How to read any formats of spreadsheet / excel using Pandas [CSV, XLSX, XLS, TXT...]
In this blog, I wish to help you easily import excel files with any file extensions. More specifically, use only one function to read any excel format (including automatically get the separator/delimiter used inside).
Required packages:
pandas: python open source data analysis and manipulation tool.
https://pandas.pydata.org/
xlrd: a library for reading data and formatting information from Excel files in the historical .xls format.
https://xlrd.readthedocs.io/en/latest/
openpyxl: a Python library to read/write Excel 2010 xlsx/xlsm files.
https://pypi.org/project/openpyxl/
Pathlib (already installed with python): It offers classes representing filesystem paths with semantics appropriate for different operating systems.
https://docs.python.org/3/library/pathlib.html
csv (already installed with python): It implements classes to read and write tabular data in CSV format.
https://docs.python.org/3/library/csv.html
Explanations:
- For .csv and .txt file, the function 'read_csv()' is used here. It is worthing mentioning that we
used 'csv.Sniffer()' to first obtain the separator/delimiter inside automatically.
- For .xlsx / .xlsm / .xltx / .xltm file, the function 'read_excel()' is used together with
openpyxl engine.
- For .xls file, the function 'read_excel()' is used.
Code:
Tips:If the spreadsheet and your python file are in the same folder, the input argument 'file_path' is yourspreadsheet file name, e.g. 'XXX.csv'
Reminder:
Thanks to **kwargs arguments, we can pass extra arguments in this functions.
For example, the 2 most often used arguments when importing a spreadsheet:
header : int, list of int, or None
Row (0-indexed) to use for the column labels of the parsed DataFrame.
index_col : int, list of int, or None
Column (0-indexed) to use as the row labels of the DataFrame.
We can write in arguments when using this function:
See more input arguments of reading spreadsheets in :
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
-------------------
Thank you so much for reading my blog! If you have any thoughts or opinions on the topic, I would love to hear from you in the comments below. See you soon !
Linkedin: https://www.linkedin.com/in/tianyi-li-v
-------------------
In case the github code cannot be displayed due to browser or other problems:
import pandas as pd from pathlib import Path import csv def import_spreadsheet(file_path,**kwargs): """ import a spreadsheet/excel with any extension. Parameters ---------- file_path: string file path of the spreadsheet/ excel file (relative or absolute). **kwargs: shared arguments of read_csv and read_excel Returns ------- df: pandas Dataframe dataframe of the spreadsheet. """ file_path = Path(file_path) # Path.suffix return the extension of a file if file_path.suffix == '.csv' or file_path.suffix == '.txt': # check the deliminiter with open(file_path, 'r', encoding='utf8') as csvfile: dialect = csv.Sniffer().sniff(csvfile.readline()) df = pd.read_csv(file_path, sep=str(dialect.delimiter),**kwargs) elif file_path.suffix in ['.xlsx', '.xlsm', '.xltx', '.xltm']: df = pd.read_excel(file_path, engine='openpyxl',**kwargs) # be cafefull that xlrd has explicitly removed support for anything other than xls files. else: df = pd.read_excel(file_path,**kwargs) return df# using default arguments df = import_spreadsheet(file_path='YourSpreadSheet.csv') # define arguments yourself df = import_spreadsheet(file_path='YourSpreadSheet.csv', header = 1, index_col=1)
Comments
Post a Comment