Ponder with Pandas

3 min readDec 13, 2022

To Cloud or Not, Too Huge a Data to Handle, or, Too Little to Include — Part I

Interfacing with the operating system is a standard task in all software implementations.

More so in machine learning, wherein you will read or write feature files locally or when you use Compute or Storage from any of the major cloud providers.

Invariably, you will use the thousands if not millions of files, process raw data in these files and then generate features or do transformations.

However, when you get many files that are either too huge to process in or perhaps too small to include (zero bytes), you will have an issue at hand. Always.

How do you optimize your data pre-processing? How do you know whether you need to push all your files to cloud for processing? Remember that cloud Compute and Storage are expensive and dry out your credit card rather quickly.

So, you are better off examining which files should be moved to the cloud and which ones should not beforehand. The point is:

1. Defer massive files for processing later or tackle them individually. Huge files are outliers 😊 and you must examine them separately before discarding them.

2. Remove files with zero bytes. Again, an outlier that contains nothing so better to remove them upfront.

Applies to

· Any machine learning implementation that requires directory or folder traversal.

· Any project that requires indexing of file attributes (the code fragment generates a CSV file). You can extend the code to include more attributes that lovely Python libraries offer.

Please experiment.

Benefits

· Reliable dataset for regular flow of your data pipeline.
· Lesser code brittleness.
· A log of files is available. You do not have to subclass or spawn an operating system shell to pipe-in a directory listing.

When to use

· When you want to list a folder and gather additional attributes.

· Save processing time on a cloud Compute instance.

· Save space on a cloud Storage instance.

Code fragment

import os
import pandas as pd
import time

# Some data structures to store
username = []
#
filelocation = []
filename = []
filename_length = []
#
fileextension = []
fileextension_length = []
#
filesize = []
filemodified = []
filemodified_dt_tm = []
the_filename = "the_list"
schema = {
    "user_login_name": username,
    "file_path": filelocation,

    "file_name": filename,
    "file_name_len": filename_length,

    "file_ext": fileextension,
    "file_ext_len": fileextension_length,

    "file_mod": filemodified,
    "file_mod_dt_tm": filemodified_dt_tm,
}

# root symbol is added to qualify a root drive
the_location = os.path.splitdrive(os.getcwd())[0] + "/"
# this is a random location to list files, yoy can specify a different folder
the_location = "c:/windows/"
# get the iterator
some_files = os.scandir(the_location)

# get the name of the user logged in
# so you can distinguish the listing across the user accounts
user_name = os.getlogin()
# kept the code in a simple loop for easier reading
with some_files as an_iterator:
    for entry in an_iterator:
        if not entry.name.startswith('.') and entry.is_file():
            full_name = the_location + entry.name
            if os.path.exists(full_name):
                dt_time_stamp = time.ctime(os.path.getmtime(full_name))
                # get the filename
                the_name = str(entry.name).split(".")[0]
                # get the extension
                the_extn = str(entry.name).split(".")[1]
                # add values to the respective lists
                username.append(user_name)
                filelocation.append(the_location)
                filename.append(the_name)
                filename_length.append(len(the_name))

                fileextension.append(the_extn)
                fileextension_length.append(len(the_extn))

                filesize.append(os.path.getsize(full_name))
                filemodified.append(os.path.getmtime(full_name))
                filemodified_dt_tm.append(dt_time_stamp)

df = pd.DataFrame.from_dict(schema)
df.to_csv(the_filename + ".csv", index = False)

Ponder with Pandas

To Cloud or Not, Too Huge a Data to Handle, or, Too Little to Include — Part I

Written by Shalabh Bhatnagar

No responses yet