How to read any formats of spreadsheet / excel using Pandas [CSV, XLSX, XLS, TXT...]


In this blog, I wish to help you easily import excel files with any file extensions. More specifically, use only one function to read any excel format (including automatically get the separator/delimiter used inside). 

Required packages:

pandas: python open source data analysis and manipulation tool.
https://pandas.pydata.org/

xlrd: a library for reading data and formatting information from Excel files in the historical .xls format.
https://xlrd.readthedocs.io/en/latest/

openpyxl: a Python library to read/write Excel 2010 xlsx/xlsm files.
https://pypi.org/project/openpyxl/

Pathlib (already installed with python): It offers classes representing filesystem paths with semantics appropriate for different operating systems.
https://docs.python.org/3/library/pathlib.html

csv (already installed with python): It implements classes to read and write tabular data in CSV format.
https://docs.python.org/3/library/csv.html

Explanations:
-  For .csv and .txt file, the function 'read_csv()' is used here. It is worthing mentioning that we
   used 'csv.Sniffer()' to first obtain the separator/delimiter inside automatically.

- For .xlsx / .xlsm / .xltx / .xltm file, the function 'read_excel()' is used together with 
   openpyxl  engine.

- For .xls file, the function 'read_excel()' is used.

Code:

Tips:

If the spreadsheet and your python file are in the same folder, the input argument 'file_path' is your
spreadsheet file name, e.g. 'XXX.csv'

Reminder:

Thanks to **kwargs arguments, we can pass extra arguments in this functions.

For example, the 2 most often used arguments when importing a spreadsheet:

    header : int, list of int, or None
        Row (0-indexed) to use for the column labels of the parsed DataFrame.
    index_col : int, list of int, or None
        Column (0-indexed) to use as the row labels of the DataFrame.

We can write in arguments when using this function:

See more input arguments of reading spreadsheets in :

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

-------------------

Thank you so much for reading my blog! If you have any thoughts or opinions on the topic, I would love to hear from you in the comments below. See you soon !





-------------------





In case the github code cannot be displayed due to browser or other problems:
import pandas as pd from pathlib import Path import csv def import_spreadsheet(file_path,**kwargs): """ import a spreadsheet/excel with any extension. Parameters ---------- file_path: string file path of the spreadsheet/ excel file (relative or absolute). **kwargs: shared arguments of read_csv and read_excel Returns ------- df: pandas Dataframe dataframe of the spreadsheet. """ file_path = Path(file_path) # Path.suffix return the extension of a file if file_path.suffix == '.csv' or file_path.suffix == '.txt': # check the deliminiter with open(file_path, 'r', encoding='utf8') as csvfile: dialect = csv.Sniffer().sniff(csvfile.readline()) df = pd.read_csv(file_path, sep=str(dialect.delimiter),**kwargs) elif file_path.suffix in ['.xlsx', '.xlsm', '.xltx', '.xltm']: df = pd.read_excel(file_path, engine='openpyxl',**kwargs) # be cafefull that xlrd has explicitly removed support for anything other than xls files. else: df = pd.read_excel(file_path,**kwargs) return df
# using default arguments df = import_spreadsheet(file_path='YourSpreadSheet.csv') # define arguments yourself df = import_spreadsheet(file_path='YourSpreadSheet.csv', header = 1, index_col=1)

Comments

Popular Posts