Azure Cognitive Services and NLTK: Self-Labelling Data: Remove the Human-in-the-Loop (HITL)

using Microsoft Azure and NLTK

6 min readFeb 21, 2023

Supervised learning relies on the labeled datasets to learn from. When a dataset is not labeled, business users are frequently asked to identify the valid instances, give them correct identity & representation. Further, this labeling process is extremely useful as the owners or the experts on the data make decisions on what the data should be called or seen as — while being inside project cycle of a machine learning implementation.

These business users are human-in-the-loop.

They follow a process which is iterative, incremental until the dataset matures and complete for training. These humans mark, annotate, scribble, validate, tune and do a lot more.

SQL or Excel pros work shoulder-to-shoulder with the business users to update the dataset especially when the dataset is huge and contains many features.

Business users are usually ‘non-technical’. They can look at data but may not update millions of records with their own hands. With the data ranging from a simple text, to images, to structured data embedded in an image, this myriad of data requires different labelling technique for each type of data. It is a harsh reality that nearly every organization is sitting on volumes of unlabeled data and hence needs humans to help label them correctly, completely and comprehensively. Is there a way that you can save humans from spending time on labelling the dataset and instead focus on delivering value to processes and clients they served? Yes.

In this implementation, I submit portion of my research with a technique that you can easily extend to label the digital assets sitting pretty in your workplace archives.

Running this code in loop can batch-process your entire library of printed documents in a jiffy. All you need are small tweaks that I leave to you. I humbly claim that possibilities are endless. For example, you can scan a business card and instantly create a contact record in an Android phone accurately. You will the muscle power of RegEx to do interim cleaning. Like I said, much potential.

Where to use?

If you have printed reports, forms, applications, resumes etc. then you can use this implementation to convert any scanned document instantly into a tagged instance. All you have to do take a picture, or upload a scanned document from a URL/local drive then run the code. To try the code, you will need to make a change to filename variable for now. Later you can traverse the folder and get all the files you want processed. Consider using the powerful os and sys packages.

On a less flamboyant note, I agree that this code is no great shake but here is the subtlety — you generate a labelled dataset. In one sweep, you convert images to text, label that image and then generate features as the image’s textual content has come to life! If you transpose the features and then vectorize the dataset, you can leap further and do variety of complex machine learning tasks.

For this implementation I chose Microsoft Azure Cognitive Services and NLTK.

What do you need?

You will need an Azure subscription — the trial feature offers a superb, free period. If you already have a subscription, please use it to Create a Form Recognizer resource group and a corresponding instance. Then retrieve the endpoint and Key 1 — both confidential. If you don’t have an Azure subscription, create a free one to try this out. More on this in Prerequisites.

What is Form Recognizer?

Form Recognizer is an AI service that applies advanced machine learning to extract text, key-value pairs, tables, and structures from documents automatically and accurately. It helps turn documents into usable data and shifts focus from acting on information rather than compiling it. (Source: azure.microsoft.com/en-us/products/form-recognizer)

To generate phrases and important terms, I relied on my ever-favourite NLTK NER abilities. You don’t have to. You can use your favourite NER package or an endpoint from your preferred cloud platform.

You can use Microsoft Azure and Amazon AWS services to retrieve phrases of significance. This will attract additional fee beyond the stipulated free limits. With Form Recognizer, you are going down the commercial road if your volumes are likely to be high. Refer the price slabs published by these cloud providers to see whether your volume is within the affordable limits or could exceed them.

It is best to model/estimate the consumption using the Pricing Calculators. Reports and dashboards are available within Azure and AWS that let view, monitor and control then consumption/fees. While the fee may be affordable for you, it doesn’t mean you need not control the usage!

Refer links below for more information.

NER on Cloud Providers

NER, on Microsoft Azure

What is the Named Entity Recognition (NER) feature in Azure Cognitive Service for Language? — Azure Cognitive Services | Microsoft Learn

NER, on AWS

Named Entity Recognition — Amazon SageMaker

About NER

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. (Source: Named-entity recognition — Wikipedia)

Why this Implementation?

Entities like banks, regulators, defence and government departments generate their deliverables in and on printed paper. To be honest, most “paperless companies” generate enough paper that they preserve statutorily so it applies to nearly every business entity, in private and public sector alike.

Prerequisites

1. A Microsoft Outlook e-mail ID

2. An Azure account — Free Tier created through your Outlook e-mail ID

3. An Azure Resource Group

4. An Azure Form Recognizer service and an instance. (Once the service is deployed, you should have an Endpoint and 2 keys. Use Key 1 in the code).

5. Python 3.6 or later

6. pip following azure.core.credentials, azure.ai.formrecognizer

The process to create these resources in Azure is straightforward. You do not require any background in cloud computing.

Applies To

· For historical data in your organization that needs digitization such as forms, licenses, certificates, affidavits, proofs, legal artifacts and reports.

· Anything printed and legible of course

Input

· Any image containing printed text

Output

· Textual output of the document

· Depending on document, additional fields may be reported

· Text extracted with NLTK NER

Benefits

· Robust reliability of Azure

· Fast processing, no ML experience needed

· You can go fully cloud, or, half-cloud and half-NLTK

· Sorry HITL removed

Caveats

· Too much power to machines and not humans? ☹

· Cost, if your volume is really high. Azure offers lucrative slabs.

· Half-cloud and half-NLTK implementation may not make sense architecturally. A fully rounded cloud one may but then cost cloud APIs or NER must be added too.

Note: Code2.png contains text courtesy Wikipedia.

Code is located at: penredink/Remove-the-Human-in-the-Loop-Phew- (github.com)

import re

import nltk
import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# -----------------Variables and Data Structures --------------------
filename = "Code2.png"
search_list = []
all_the_lines = []
full_text = str()

# TODO: PLEASE POPULATE YOUR ENDPOINT URL AND KEY INI THE CREDENTIALS OR THE CODE WILL NOT WORK :-)
endpoint = ""
credential = AzureKeyCredential("")

# -------------------------------------
def gen_NER_tags(paramtext: str):
    tagged_text = pos_tag(word_tokenize(paramtext))
    # Look for all nouns and pronouns. These 'Things' tend to be Actors!
    for t in tagged_text:
        # Use a loop though you can use smaller lambda functions
        if t[1] == "NN" or t[1] == "NNS" or t[1] == "NNP" or t[1] == "NNPS" or t[1] == "PRP" or t[1] == "PRP$":
            search_list.append(t[0])
    #
    return search_list

# -------------------------------------
def most_important_pareto(paramtext: str):
    listed_text = word_tokenize(paramtext)
    extraction_metric = int(0.20 * round(len(listed_text), 0))
    # print(f"\n\nMost important phrases: {extraction_metric} of {len(listed_text)}")
    freqdist = nltk.FreqDist(samples=listed_text)
    top20percent = freqdist.most_common(extraction_metric)
    for t in top20percent: search_list.append(t[0])
    #
    return search_list
#
print("Connecting to Azure...")
azure_client = DocumentAnalysisClient(endpoint, credential)
#
print(f"Opening {filename}...")
with open(filename, "rb") as fd:
    document = fd.read()
#
print(f"Analysing {filename}...")
polled_outcome = azure_client.begin_analyze_document("prebuilt-layout", document)
#
print("Fetching results from Azure...")
outcome = polled_outcome.result()
# Loop through the resultset sent by powerful Form Recognizer endpoint
for page in outcome.pages:
    print(f"Source file {filename}, dimensions:{page.width} x {page.height}, metric unit: {page.unit}")
    for the_lineid, line in enumerate(page.lines):
        all_the_lines.append(line.content)
        full_text = full_text + line.content

# Clean punctuations unless you want these to be searchable as well :-)
to_clean = r',|\(|\)|\[|\]|;|:|!]|#|\.'
refined = re.sub(to_clean, '', full_text)
# Remove 1 character word that are not important unless you want them to be searachable as well :-)
to_clean = r' . '
refined = re.sub(to_clean, '', refined)
# Call functions to generate labels - this helps in search and as removes human from the loop
most_important_pareto(refined)
gen_NER_tags(refined)

# Make text unique to reduce overhead, redundancy
labels_or_search_strings = list(set(search_list))
schema_data = {"filename": filename, "labels": labels_or_search_strings}
# Remove the human in loop and label the image
df = pd.DataFrame(schema_data)
df.to_csv(filename + ".csv", index=False)
print(df)

Disclaimer

All copyrights and trademarks belong to their respective companies and owners.