How to read any formats of spreadsheet / excel using Pandas [CSV, XLSX, XLS, TXT...]

December 10, 2022

How to read any formats of spreadsheet / excel using Pandas [CSV, XLSX, XLS, TXT...]

In this blog, I wish to help you easily import excel files with any file extensions. More specifically, use only one function to read any excel format (including automatically get the separator/delimiter used inside).

Required packages:

pandas:  python open source data analysis and manipulation tool.
https://pandas.pydata.org/

xlrd: a library for reading data and formatting information from Excel files in the historical .xls format.
https://xlrd.readthedocs.io/en/latest/

openpyxl: a Python library to read/write Excel 2010 xlsx/xlsm files. 
https://pypi.org/project/openpyxl/ 

Pathlib (already installed with python): It offers classes representing filesystem paths with semantics appropriate for different operating systems.
https://docs.python.org/3/library/pathlib.html

csv (already installed with python): It implements classes to read and write tabular data in CSV format. 
https://docs.python.org/3/library/csv.html

Explanations:

- For .csv and .txt file, the function 'read_csv()' is used here. It is worthing mentioning that we

used 'csv.Sniffer()' to first obtain the separator/delimiter inside automatically.

- For .xlsx / .xlsm / .xltx / .xltm file, the function 'read_excel()' is used together with

openpyxl engine.

- For .xls file, the function 'read_excel()' is used.

Code:


Tips:

If the spreadsheet and your python file are in the same folder, the input argument 'file_path' is your 
spreadsheet file name, e.g. 'XXX.csv'

Reminder:

Thanks to **kwargs arguments, we can pass extra arguments in this functions.

For example, the 2 most often used arguments when importing a spreadsheet:

header : int, list of int, or None

Row (0-indexed) to use for the column labels of the parsed DataFrame.

index_col : int, list of int, or None

Column (0-indexed) to use as the row labels of the DataFrame.

We can write in arguments when using this function:

See more input arguments of  reading spreadsheets in :

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

-------------------

Thank you so much for reading my blog!   If you have any thoughts or opinions on the topic, I would love to hear from you in the comments below. See you soon ! 


Github: https://github.com/TianyiDataScience

Linkedin: https://www.linkedin.com/in/tianyi-li-v


-------------------





In case the github code cannot be displayed due to browser or other problems:

import pandas as pd from pathlib import Path
import csv

def import_spreadsheet(file_path,**kwargs):
    """
    import a spreadsheet/excel with any extension.

    Parameters
    ----------
    file_path: string
        file path of the spreadsheet/ excel file (relative or absolute).
    **kwargs:
        shared arguments of read_csv and read_excel 
    
    Returns
    -------
    df: pandas Dataframe
        dataframe of the spreadsheet.
    """

    file_path = Path(file_path) 

    # Path.suffix return the extension of a file
    if file_path.suffix == '.csv' or file_path.suffix == '.txt':      

        # check the deliminiter
        with open(file_path, 'r', encoding='utf8') as csvfile:
            dialect = csv.Sniffer().sniff(csvfile.readline())

        df = pd.read_csv(file_path, sep=str(dialect.delimiter),**kwargs)

    elif file_path.suffix in ['.xlsx', '.xlsm', '.xltx', '.xltm']:
        df = pd.read_excel(file_path, engine='openpyxl',**kwargs)

    # be cafefull that xlrd has explicitly removed support for anything other than xls files.
    else:
        df = pd.read_excel(file_path,**kwargs)

    return df 

 # using default arguments
df = import_spreadsheet(file_path='YourSpreadSheet.csv') 

# define arguments yourself
df = import_spreadsheet(file_path='YourSpreadSheet.csv', header = 1, index_col=1)

Search This Blog

Data Science Techno

How to read any formats of spreadsheet / excel using Pandas [CSV, XLSX, XLS, TXT...]

Comments

Post a Comment

Popular Posts

Understand common-used Evaluations Metrics for Binary Classification