Ponder with Pandas — Writing Features Once and For All
Writing features to files is part of our daily activities in machine learning implementations. Those of us who have worked on legacy statistical and mathematical software know well that CSV or delimited files have been around for ages.
Today, we see many formats from different big data, cloud and proprietary implementations. The fact, surprisingly, remains that majority of enterprise computing revolves around CSV, Microsoft Excel file formats, and text files of different types. Most SQL or ETL developers continue to use these formats for performance. Most visualization and enterprise class dashboards employ Microsoft Excel files.
No matter whether you are from a contemporary or a legacy era, some file formats are ubiquitous and omnipresent.
In this little code fragment, I offer you an implementation to write multiple (feature) files in different formats using Pandas. I seriously recommend that you move away from your detailed byte by byte implementations for writing files. One line of code is all you need to write file then why waste CPU cycles?
A little goodie thrown in here is that the filename you give to this code will become the filename of each feature file this code writes. I have not written exception handlers and that I leave to you as your production code will have many handlers which you mught want to supplement.
Do note that the code overwrites previous output files each time you run it. This is not about feature selection or feature generation but writing features to different types of files.
I call upon my ever-reliable friend — the rock-solid Pandas!
Here are the contents of the text file fox.txt used in the code
The quick brown fox jumped over the silly lazy dog and then dog woke up only to find that fox was playing
And here is the code.
import pandas as pdimport openpyxl, os# make a directory using brute forceif not os.path.exists(os.getcwd() + "\\output\\"):os.mkdir(os.getcwd() + "\\output\\")dir_name = os.getcwd() + "\\output\\"# Create a dataframe from a text file named fox.txt Please ensure that text file is in the same folder as this code or use the -# paths as per your needscontents = open("fox.txt", "rt").read()# Use the code filename as the filename for generated Excel filefilename = str(__file__).split("/")[-1].split(".")[0]# Split the text to get a listcontent_in_list = contents.split(" ")# Make a Dataframe from the list, using from_recordsdf = pd.DataFrame.from_records(content_in_list)folder_and_file = os.getcwd() + "\\output\\" + filename# Write commonly used file formats# ================================# write an Excel filedf.to_excel(folder_and_file + ".xlsx", merge_cells=True, index=False)# write a CSV filedf.to_csv(folder_and_file + ".csv")# write a pipe delimited file, UNIX et aldf.to_csv(folder_and_file + ".pipe", sep="|")# write a tab delimited file, UNIX et aldf.to_csv(folder_and_file + ".tab", sep="\t")# write a pickle file if you want to retrieve it a model laterdf.to_pickle(folder_and_file+".pik")# write a JSON file, hope the JavaScript world is smilingdf.to_json(folder_and_file + ".json")# write a text fileopen(folder_and_file + ".txt", "wt").write(df.to_string())
Here is a color coded screenshot for readability: