Auto-Generate UML Actors & Use Cases Models from — Business Requirements, Vision & Scope, Business Case, Project Charters and more — using Python & Natural Language Processing: Part 1

5 min readDec 30, 2022

All projects generate requirements of some kind (cliché). It does not matter whether requirements are explicit or implicit, stated or not. All projects always generate documents that contain detailed or brief requirements. Some being:

1. Vision & Scope

2. Project Charter

3. Business Case

4. User Requirements

5. Business Requirements

6. Product Requirements

7. Software Requirements Specification

You could argue that not all above document types are treated equal. They are not.

You create some of these documents even before you start a project. For example, a Project Charter or a Business Case while you write Business Requirements typically after your project is approved.

No matter which document you produce, the contemporary development methodologies now require these documents to be written as use case models or user stories. The latter is becoming more popular these days and chances are you are using them in your project. Yet, clients are used to seeing and reviewing the ‘traditional’ requirements documents upfront especially the business users. Whereas the projects rest more on the use case narratives or the user stories.

In a two-part paper, I give you a Python implementation that takes any document (text format) as input and extracts:

Actors
Use cases

When you make a UML use case model, not all Actors are obvious immediately, identified or known to you at the start. You and your Business Analysts work hard to dig them out. You work closely with the modellers and stakeholders to list these Actors.

Why Actors are important?

1. The human Actors do the UAT for you! So, missing these Actors is never a good idea. Many of these are your direct stakeholders and approvers.

2. Non-human Actors are even more important. These range everything from hardware interfaces to components to nodes and more! They become part of manual and automation test cases and, also sit in the non-functional requirement space.

Why Generate UML Models?

For requirement documents confined to 30 odd pages, the process of reading, and marking Actors may not be much of a problem. It is when you do a large project, where average length surpasses 50 pages in a requirement document, or if you have hundreds of components each of which are written in and as many pages, that it truly becomes daunting and challenging.

Hence, identifying and documenting the Actors is crucial. This is what I will focus in part 1. In the second and final part, I will elucidate use case extraction and bring it all together.

A little note:

Most project documents are often saved in .docx or HTML formats. Please take time to save these in text format before you use the code as I have excluded this little piece, given the format variations.

Applies To

Any document but especially the ones mentioned at the start.

Input

Requirements document in text format

Output

Actors from the requirements document

Benefits

Dramatic productivity improvements
Faster use case modelling
Comprehensive list of Actors beyond what the mind

When to Use

Whenever you do not have time (which is mostly true for most of us!)
Large projects with large documents (and unbelievable timelines)

Code

import re
#
import tkinter as tk
from tkinter import *
from tkinter import ttk
import pandas as pd
#
from nltk import pos_tag
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
# Instantiate tk inter for cleaner display
win = Tk()
# Random screen size,enough to show controls
win.geometry("410x270")
frm = ttk.Frame(win)
frm.grid()
the_label = Label(frm, text="Here are potential Actors in your document for your Use Case Model. \n"
                            "Another vantage point: These are Entities or concepts in your domain space.\n")
the_label.pack()
ctr = 1
listbox = Listbox(frm, width=50)
# This function outputs list of actors into a CSV file
def generate_actors():
    win.title("Generating list of Actors, please wait...")
    df = pd.DataFrame(the_schema)
    df.to_csv("actors.csv", index=False)
    win.title("Check 'actors.csv' in the same folder as the app.")
    #
    return
# Create a button
the_button = Button(frm, text="Export Actor List to CSV", relief=tk.RAISED, command=generate_actors)
the_button.pack()
# open the source requirement document
text_in_which_to_search = open("requirements.txt", "rt").read().lower()
# Clean the text. Ideally use a stopword approach for extensibility
# but for now this will suffice, to make the point
interim_text = text_in_which_to_search. \
    replace("\n", " ").replace("\xa0", " ").replace("\t", " ").replace(".", " ")
unique_interim_text = set(interim_text.split(" ")).__str__().replace("'", "").replace(",", ""). \
    replace(">", ""). \
    replace("<", "")
# We fill find Proper nouns,  Personal pronouns and Possessive pronouns
# Generate tokens that are potential candidate actors
tagged_text = pos_tag(word_tokenize(unique_interim_text))
# Lemmatize to simplify the variations of words and reach the root form
the_lemmatizer = WordNetLemmatizer()
# open the actor file as dataframe
# this file contains the strings that are proven actors in the software world
# you can add more nouns to this file if you know the actors in your domain and want
# them found pre-emptively
df_actors = pd.read_csv("sb_actor_tags.csv")
actor_token = []
actor_true_false = []
# this dictionary will help generate dataframe quickly and easily
the_schema = {"actor_token": actor_token, "actor_yes_no": actor_true_false}
# iterate and search in the document
for the_key in df_actors["actor_tag"]:
    # use RegEx to search to find first instance of commonly "called" Actors
    # that are already part of a pre-defined vocabulary
    pattern = re.compile(the_key)
    matches = pattern.finditer(unique_interim_text)
    for found in matches:
        if len(the_key) > 2:
            actor_token.append(" Known Actor: " + the_key.title())
            listbox.insert(ctr, " Known Actor: " + the_key.title())
            actor_true_false.append(True)
            # break right after instance as a potential actor is found
            # if you want to find all instances, remove break
            break
# Look for all nouns and pronouns. These 'Things' tend to be Actors!
for t in tagged_text:
    # Use a loop though you can use lambda functions too
    if t[1] == "NN" or t[1] == "NNS" \
            or t[1] == "NNP" or t[1] == "NNPS" \
            or t[1] == "PRP" or t[1] == "PRP$":
        # Lemmatize to reduce variations and to go the root word
        potential_actor = the_lemmatizer.lemmatize(t[0])
        # ignore two character strings that may crop up, just in case
        # you can use stop words if you want to reduce code size
        if len(potential_actor) > 2:
            actor_token.append(potential_actor)
            actor_true_false.append(True)
            listbox.insert(ctr, " Additional Actor: " + potential_actor.title())
            #
            ctr += 1
listbox.configure(borderwidth=3)
listbox.pack()
# Start the message loop and show the window
win.mainloop()

Errata

Apologies to one and all for not including the file this code reads as standard Actors known commonly across our industry. Here is the dump of the file sb_actor_tags.csv

actor_tag
person
human
man
woman
system
user
end-user
end user
screen
client
customer
consumer
reviewer
approver
initiator
applicant
computer
pc
personal computer
laptop 
mobile
cell phone
phone
tablet
screen
monitor
actor
analyst
programmer
tester
project manager
project
program