How to read any formats of spreadsheet / excel using Pandas [CSV, XLSX, XLS, TXT...]
In this blog, I wish to help you easily import excel files with any file extensions. More specifically, use only one function to read any excel format (including automatically get the separator/delimiter used inside).
Required packages:
pandas: python open source data analysis and manipulation tool.
https://pandas.pydata.org/
xlrd: a library for reading data and formatting information from Excel files in the historical .xls format.
https://xlrd.readthedocs.io/en/latest/
openpyxl: a Python library to read/write Excel 2010 xlsx/xlsm files.
https://pypi.org/project/openpyxl/
Pathlib (already installed with python): It offers classes representing filesystem paths with semantics appropriate for different operating systems.
https://docs.python.org/3/library/pathlib.html
csv (already installed with python): It implements classes to read and write tabular data in CSV format.
https://docs.python.org/3/library/csv.html
Explanations:
- For .csv and .txt file, the function 'read_csv()' is used here. It is worthing mentioning that we
used 'csv.Sniffer()' to first obtain the separator/delimiter inside automatically.
- For .xlsx / .xlsm / .xltx / .xltm file, the function 'read_excel()' is used together with
openpyxl engine.
- For .xls file, the function 'read_excel()' is used.
Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
from pathlib import Path | |
import csv | |
def import_spreadsheet(file_path,**kwargs): | |
""" | |
import a spreadsheet/excel with any extension. | |
Parameters | |
---------- | |
file_path: string | |
file path of the spreadsheet/ excel file (relative or absolute). | |
**kwargs: | |
shared arguments of read_csv and read_excel | |
Returns | |
------- | |
df: pandas Dataframe | |
dataframe of the spreadsheet. | |
""" | |
file_path = Path(file_path) | |
# Path.suffix return the extension of a file | |
if file_path.suffix == '.csv' or file_path.suffix == '.txt': | |
# check the deliminiter | |
with open(file_path, 'r', encoding='utf8') as csvfile: | |
dialect = csv.Sniffer().sniff(csvfile.readline()) | |
df = pd.read_csv(file_path, sep=str(dialect.delimiter),**kwargs) | |
elif file_path.suffix in ['.xlsx', '.xlsm', '.xltx', '.xltm']: | |
df = pd.read_excel(file_path, engine='openpyxl',**kwargs) | |
# be cafefull that xlrd has explicitly removed support for anything other than xls files. | |
else: | |
df = pd.read_excel(file_path,**kwargs) | |
return df |
Tips:If the spreadsheet and your python file are in the same folder, the input argument 'file_path' is yourspreadsheet file name, e.g. 'XXX.csv'
Reminder:
Thanks to **kwargs arguments, we can pass extra arguments in this functions.
For example, the 2 most often used arguments when importing a spreadsheet:
header : int, list of int, or None
Row (0-indexed) to use for the column labels of the parsed DataFrame.
index_col : int, list of int, or None
Column (0-indexed) to use as the row labels of the DataFrame.
We can write in arguments when using this function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# using default arguments | |
df = import_spreadsheet(file_path='YourSpreadSheet.csv') | |
# define arguments yourself | |
df = import_spreadsheet(file_path='YourSpreadSheet.csv', header = 1, index_col=1) |
See more input arguments of reading spreadsheets in :
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
-------------------
Thank you so much for reading my blog! If you have any thoughts or opinions on the topic, I would love to hear from you in the comments below. See you soon !
Linkedin: https://www.linkedin.com/in/tianyi-li-v
-------------------
In case the github code cannot be displayed due to browser or other problems:
import pandas as pd from pathlib import Path import csv def import_spreadsheet(file_path,**kwargs): """ import a spreadsheet/excel with any extension. Parameters ---------- file_path: string file path of the spreadsheet/ excel file (relative or absolute). **kwargs: shared arguments of read_csv and read_excel Returns ------- df: pandas Dataframe dataframe of the spreadsheet. """ file_path = Path(file_path) # Path.suffix return the extension of a file if file_path.suffix == '.csv' or file_path.suffix == '.txt': # check the deliminiter with open(file_path, 'r', encoding='utf8') as csvfile: dialect = csv.Sniffer().sniff(csvfile.readline()) df = pd.read_csv(file_path, sep=str(dialect.delimiter),**kwargs) elif file_path.suffix in ['.xlsx', '.xlsm', '.xltx', '.xltm']: df = pd.read_excel(file_path, engine='openpyxl',**kwargs) # be cafefull that xlrd has explicitly removed support for anything other than xls files. else: df = pd.read_excel(file_path,**kwargs) return df# using default arguments df = import_spreadsheet(file_path='YourSpreadSheet.csv') # define arguments yourself df = import_spreadsheet(file_path='YourSpreadSheet.csv', header = 1, index_col=1)
Comments
Post a Comment