Ponder with Pandas — Join Thousands of Feature files Instantly into a Single File

Shalabh Bhatnagar
3 min readMar 2, 2023

(60 seconds read)

Photo by Andreas Klassen on Unsplash

Feature Engineering

Data comes in various shapes and different file formats. In an enterprise, different components or apps surely generate data on the same entity but with different schemas. For example, a HR system may generate information such as Employee Code, Employee Name, Address 1, Phone etc. Similarly, the payroll system may generate data such as Employee Code, Employee Name, Basic Salary, HRA etc. Common to both is the Entity called Employee. These files are useful for downstream apps or downstream business users.

Machine learning implementations go through the Pre-Processing phase to combine and clean the files. Issue is that these files could be in thousands and may contain a point-in-time information. For example, some file may be from January while others from July. Regardless, you must be combining (read concatenating/clubbing/joining) them invariably for a holistic view. Organization dashboards is an example of full employee view that are usually published at a periodic frequency for stakeholder review and decision making.

If you have ever used SQL to join these files, then you have experienced the nightmare as I have. Phew.

If you are tasked to write a machine learning piece that predicts the salaries of, say, the future joiners then ‘joining’ these files one by one is super painful. You might open each file and paste the contents into the target file. Error prone. For organizations over 500 employees, it is daunting as is, so imagine a count of 5,000 employees or more.

From my bank of utilities, I share an implementation that I hope simplifies your life. It has made my life easy over the years. It works for CSV as well. I stuck to Excel to make the point.

What does it do?

1. Iterates through the default folder.

2. Picks all the Excel files I elaborated on above.

3. Clubs the files into one Excel file and writes it in the same folder.

Applies to

· Thousands or millions of Excel files sitting in the same folder, that you need to join for feature engineering.

Benefits

1. Rapidly create a single unified file containing all the features within few seconds. (Pandas are crazy fast).

2. Features are all in one big file (*Be advised that Excel has a record-count limitations, so I use CSV to bypass this limit in that just replace to_excel() with to_csv() in the code.

When to use

· When you have disparate information systems that generate data on common entities may be at different times and at different folders.

Code

import os
import pandas as pd

# Create early binding
df_clubbed = pd.DataFrame()
# Iterate through the default directory
for an_item in os.scandir():
# Check if is an Excel file
if an_item.is_file() and str(an_item).__contains__("xlsx"):
print(f"Clubbing file {an_item}...")
# Read each Excel file
df1 = pd.read_excel(an_item)
# When axis = 0, columns are clubbed one after the other
df_clubbed = pd.concat([df_clubbed, df1], axis=0)
# Write the final file
print("Writing 'the_final.xlsx'...")
df_clubbed.to_excel("the_final.xlsx", index=False)

Git

ponderwithpandas_clubfeatures/ponderpandas_clubfiles.py at main · penredink/ponderwithpandas_clubfeatures · GitHub

Disclaimer: All copyrights and trademarks belong to their respective companies and owners.

--

--