Feature Engineering Series: Part 1 Using Parts of Speech in Natural Language Processing to Automatically Generate Features using Python
PART 1 — BACKGROUND AND BASICS
Using Parts of Speech in Natural Language Processing to Automatically Generate Features from Raw Text
In the machine learning world, we have at some point, struggled with finding the right features for our machine learning models. Especially, when it comes to raw and unstructured text inputs or data sources. Customers have given us documents as potential datasets, leaving us perplexed. The very nature of the unstructured text has forced us to revise our work estimates, software design and the implementation — all leading to rework.
Processing of text documents and their conversion to features is daunting because of:
Structural Differences Text documents can be notoriously different from each other even when the same person writes them.
Grammatical Purity If the language is foreign to the author, the purity of the grammar suffers. This could mean consistently erroneous documents.
Varied Writing Styles Different people may have written documents on the same topic but with a marked difference in composition and writing style. Occasionally, this is by design especially with professional writers. Regardless, you get differences amongst documents.
Different Genres One document may be a news report while the other, a manuscript of a novel. So you have differences.
Technical Formats One document is pure ANSI ASCII while the other a Unicode. A common issue in not just natural language processing but in structured datasets too.
Vocabulary Richness If a committed linguist writes a document versus a capable millennial, the breadth and depth of the vocabulary could differ.
Search Criteria Using simple search to extract values versus complex Regex expressions to search text, the processing strategy may have to change for documents.
These issues deter us from designing a consistent and replicable natural language processing model that extracts features. So, what do we do?
In this two-parts piece, I humbly submit an approach to generate features from random ANSI text documents. The features that emerge from this implementation can be fed into any machine learning algorithm especially into deep learners.
When I had started this implementation, I wasn’t sure that it will work. Thankfully, I was able to use it in a project with robust prediction accuracy so I hope it helps you too.
There is a bit of a learning journey for you. You will need to understand few topics from a linguistic and Python standpoint. If you are a veteran, you can jump straight to the second part of this piece. Let us dive in.
Nano view into Linguistics
Linguistics is the scientific study of language. It involves an analysis of language form, language meaning, and language in context, as well as an analysis of the social, cultural, historical, and political factors that influence language.
[Within Linguistics, Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, math, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology, and neuroscience, among others. Source: Linguistics — Wikipedia]
In this part, the computational part of language is of interest to us.
What are Parts of Speech (PoS)?
Parts of speech play a foundation role in the dissection and the structure of sentences. They describe the function in a sentence according to the associated category of these parts of speech.
[In traditional grammar, a part of speech or part-of-speech (abbreviated as POS or PoS) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assigned to the same part of speech generally display similar syntactic behavior — they play similar roles within the grammatical structure of sentences — and sometimes similar morphology in that they undergo inflection for similar properties.
Source: Part of speech — Wikipedia]
For example, a noun — one of the many parts of speech — describes a name, place, or an object. It serves the purpose of finding concrete or abstract entities in a sentence.
If you collect all the nouns from a sentence, you can say that you have interpreted it from a “noun” perspective. It is this thought that I build upon. So, I collect all the parts of speech of every line of text of a document, count each of these parts of speech tags, associate them with the file they have come from.
A well-formed and meaningful sentence has much more than just nouns. For example, to know if a sentence is talking about a name, place, or object you search for nouns in that sentence. If you are looking at a task, deed, or action in the text then you are looking at verbs and so on.
Side note: For a sentence, if you extract all the nouns (as a start) then you have something more manageable instead of a full sentence. You dramatically reduce your sentence to words (tokens) of your interest — in this case noun or verb as the case may be.
You may find Internet search results claiming different counts of parts of speech for English language. To steer clear from this, I will stick to ten parts of speech used classically.
You know may remember many of these parts of speech tags from your childhood days. Here is a list just in case:
Tip: Suppose you key in a sentence in a conversation window of a fictitious chatbot. Further suppose that behind the scenes, this chatbot engine extracts all the nouns and their counts from your sentence. The chatbot then creates a probability distribution of nouns. Why? You can make statements that a certain sentence is ‘noun heavy’ or ‘adjective heavy’ simply by their counts. This approach by itself helps you classify sentences into a certain category!
For example, in the sentence — “I am Shalabh and I stay in Delhi-NCR”. There are 2 nouns in a total of 8 words.
Back to Basics
What is NLTK?
No treatise in natural language processing is worthy without showering heaps of praise upon the ubiquitous NLTK — The Natural Language Toolkit. It is a pilgrim for anyone serious about computational linguistics, language models, making sense of text, machine learning and more.
I quote the first 2 paragraphs from Natural Language Toolkit — NLTK 3.5 documentation that houses NLTK:
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project. Source: https://www.nltk.org/
What role NLTK plays here?
Barring the standard libraries of Python, the muscular Pandas and the nimble NumPy, the soul of this implementation sits on massive shoulders of NLTK and scikit-learn. If you are unfamiliar with NLTK or don’t want to learn it yet, that’s fine because I will cover the bare minimum you need to implement this piece.
To extract parts of speech, I have used NLTK’s Penn Parts of Speech tagger. Discussing the Penn treebank is way beyond the scope here and demands significant attention. If you want to know more about them, please visit Treebank — Wikipedia.
A Text Document in NLTK
What is a text document?
A text file (sometimes spelled text file; an old alternative name is flat file) is a kind of computer file that is structured as a sequence of lines of electronic text. Source: Text file — Wikipedia
Sentences to Tokens
Valid Tokens and Not-so-Valid-Tokens
In real life datasets, the raw text contains one or more invalid tokens.
You may have hundreds, thousands or even millions of invalid tokens in datasets that could gigabytes. I cannot emphasize enough that definition of an ‘invalid token’ is a function of client requirements, your own knowledge, the text format, and common sense.
You can see from the code output above that token but981 is perfectly valid from a code perspective but quite strange for human consumption in that it seems be an error.
(Let me step back a bit. Such tokens are not always invalid — they could be valid data. So, when in doubt always ask your customers if they want to exclude such strange looking tokens form their feature assessment).
We call these “strange looking” tokens as unwanted ones that ‘stop’ us from keeping our input data clean. These tokens become unwanted bias in the models. These are stop words and we need to remove them.
Stop words are any words, punctuations, or characters etc. that your customers do want as input machine learning model. Therefore, you should rightly spend quality time removing them.
NEXT: PART 2 — PYTHON IMPLEMENTATION
…and yes, I will share all the links to the source code on Git on the concluding Part 2.