Feature Engineering Series: Part 2 Using Parts of Speech in Natural Language Processing with Python

Shalabh Bhatnagar
6 min readJan 29, 2021
Photo by Hugo Rocha on Unsplash

I suggest you read Part 1 just in case you have reached here directly and unable to decipher what’s going on. You can download full implementation code from Git if that is your only interest, most welcome!

The code is documented inline and if you still have questions, please feel to reach me.

Code download: penredink/tds_feature_engineering_pos_tagger (github.com)

You do not need to know Git commands to work with the file. Just download the zip file and unzip.

Getting Started

I strongly recommend that you download the zip file and extract it as is. This ensures that you have the correct folder structure as shown below:

Project folder structure

Let us decipher the folder structure:

  • The data folder contains the input files — your raw text files. Copy all your text files in the input_files folder. The code will pick each file and extract PoS from it. There is no batch processing for one file.
  • The _cleaned folder contains the processed files after the PoS tags are extracted. You will see a file that resembles the following in format.
File structure of the cleaned file that shows line wrapped tokens
  • During the execution, the Python code generates following files which are explained in the later paragraphs.

1. the_files_features.csv

2. the_files_numerical.csv

3. the_files_pos.csv

4. the_files_postagged_interim.csv

Implementation

I did this implementation to Windows (my sense is that Linux users can use it with small changes) and used following modules to implement this Python piece:

Here is a very small extract of the text file that I sourced from from Wikipedia

Sample fragment text from the input file. Refer Git for full text

After the code is done, you will see a features file that looks like below:

Features generated from the raw text file shown above

Overview of the Implementation

You will need the following modules before running the code — numpy, pandas, sklearn and nltk. For nltk, please also download the corpus and treebanks using the nltk.download() function to ensure smooth execution (well, you do not need all of them but the files are small and if you plan to learn nltk you will need them).

This function opens a window that lets you choose and install the nltk components you need:

NLTK downloader window

Here are the implementation steps so you can relate them to the code structure:

  1. The code looks for the input text files in the designated folders.
  2. The code opens one input text document at a time. For batch processing, put more than one file in the designated folder. The code will iterate and process each text file. Make sure the extension is .txt for easy identification.
  3. The code parses each input text file in these seven steps:

3.1 Opens the master punctuations file (sb_golden_punctuations.txt). It contains punctuations or characters you want to remove from your input files. You can add or remove characters to this punctuations file depending on what you want to delete or retain from your source data.

3.2 Removes punctuations and unwanted characters from each of the input files, using the punctuations in the master punctuations file.

3.3 Opens the master Penn PoS tags file (sb_golden_penn_pos_tags.csv) as a Panda’s Dataframe.

3.4 Tokenizes the input text file. Each file’s tokens are eventually stored as features together with their counts.

3.5 Conditionally joins the Dataframe of the master Penn PoS tags file with the input text file, using the ‘filename’ column as the common key.

3.6 Saves each PoS tags and the count of each PoS tag. Refer the image in the later paragraphs for the file structure.

3.7 Opens sklearn’s LabelEncoder and converts features into numerical data. This is needed only for the non-numerical features.

Tip: You may want to use sklearn’s MinMaxScaler to map the features into the range [0, 1]. The mapping to [0, 1] is useful for deep learning but make sure your convert all your values to a float datatype.

You can choose to add a target or a dependent variable as the last column to this final file.

The Python code creates various CSV files at critical points of code execution. This helps track progress of execution and data state while reducing debugging effort.

Coming back, here is a brief description of each file:

Connecting NLTK and your Raw Text

To ensure that your machine learning model learns from the right features, you must remove the words, sentences and paragraphs that are of no use to your customer.

Tip: It is not wrong for your model to learn the stop words. It is just that you these stop words too will become part of your prediction. They will appear to provide accuracy when neither your customer nor you wanted them.

For example, a conversational chatbot may learn stop words and eventually become great at stopping these words in a conversation, when an end-user keys them in a chat window. For example, these stop words could be profanities that are certainly unacceptable.

The Parts of Speech Tags

Code fragment to understand PoS tags

In the fragment above, we have 3 print() statements:

  • First one prints sentence2 itself.
  • Second one prints the tokenized version of this sentence.
  • The third one prints the PoS tags and 8 as Count of tags (Length: 8). This a Python list of tuples that makes processing much easier.

Let us take a small step forward to the parts of speech in sentence2:

  • PRP, appears 2 times
  • VBP, appears 2 times
  • NNP, appears 2 times
  • IN, and CC, appear once each

The total count of PoS tags is 8.

Deep-Dive into Python Code

After these basics, please deep-dive into the full code implementation. I took the route of documenting the entire code line by line, with the hope to help you comprehend the implementation and concepts better.

Fragment of code execution output

Code download: penredink/tds_feature_engineering_pos_tagger (github.com)

Few words on the code:

· Code comments are all in the lower case.

· Please read each code comment.

· I have used pure ANSI ASCII/text files for input processing.

· I have not followed all the PEP rules (sorry) but the code works just fine.

· I used Python 3.8 which was really a random choice. For your implementations and version requirements, it is best to read through the official documentation of each module as to which versions of Python they support. Most of the world-class modules like numpy, pandas, sklearn and nltk are compatible with a large number of Python versions.

Thanks for your time. Claps or brickbats? Sure!

--

--