Android Keyboard like Predictive Text with Python
Ever since we have started using smart phones or tablets, we have gotten used to a simple but powerful feature — text predictions (or the next word). This feature needs no introduction for it has been around on many popular search engines.
We take this ability for granted and expect better predictions. We also get frustrated when we are anticipating a certain word but the keyboard predicts some other word altogether! It isn’t that the keyboard is at fault. The issue is:
1) The corpus the keyboard was trained on
2) Type of algorithm used
3) How frequently the corpus is updated
4) Whether keyboard ‘actively learns’ what you input
This feature has several names in the computing world:
· Type ahead
· Auto complete
· Quick fill
· What next
· Text Picker
· …
And so on.
What I propose in this paper is a simple text prediction implementation wherein you can optimize the user experience and improve the domain-specific input.
For example, if you are implementing an app for the retail space, you can train the keyboard on the corpus that belongs to the retail world. The words like [‘shop’, ‘price’, ‘cart’, ‘discount’, ‘add to’,…] are likely to be in your documents you see or read. If you write sports articles then words like [‘play’, ‘time’, ‘team’, ‘break’, ‘captain’,…] will likely figure in the document.
Why am I saying all this?
The idea is that your keyboard can learn exactly what is relevant to the domain of your implementation (or perhaps consume all the text that ever existed, if you do not want a domain-specific implementation).
You create a lightweight, domain-centric implementation, or additionally include more, related domains.
For example, when you are implementing a rock music terminology-friendly keyboard, you can benefit from the corpus of the pop music. (Fans and purists may argue that legendary Metallica may never make a duet with the legendary Madonna but that is a conversation for ….)
A domain-specific implementation is a better idea. Why?
In most custom application software, the domain is pretty well-defined by your client in advance. Industry or domain independent apps are different, in that they must be as free from the vocabulary or corpus as far as possible — that is how your typical one-keyboard-fits-all operates in your phone.
Let us bungee jump into the code. It is light and frugal.
1. Take a text file from your domain. I recommend you copy & paste all the text from all of your files into one, single text file. If you don’t have one, just play with the one I have given in here. (I have used paragraph from this paper to underscore a domain-centric keyboard implementation!)
2. Make bigrams. You can optionally clean your text of unwanted punctuations etc. I have excluded it for brevity.
3. Use Pandas to structure the text into features. I have created 2 columns — one for each bigram.
4. Use sklearn.
5. Encode the labels.
6. (Generate interim, periodic CSV files for tracebacks).
7. Classify with any classifier. I have used Decision Tree. It is powerful and tunable. A neural network is another possibility.
8. All done!
Here is the code:
import numpy as np, pandas as pd
from nltk.util import ngrams
from sklearn import preprocessing, tree
#
le = preprocessing.LabelEncoder( )
#
contents = open("any.txt", "rt").read( )
# Use the code filename as the filename for all the files the code generates
filename = str(__file__).split("/")[ -1 ].split(".")[ 0 ]
schema = [ "word_1", "word_2" ]
df_data = [ ]
the_ngrams = ngrams(contents.split( ), 2)
# iterate through the tuples
for an_item in the_ngrams:
df_data.append([ an_item[ 0 ], an_item[ 1 ] ])
df2 = pd.DataFrame(data = df_data, columns = schema)
df2.to_csv(filename + "_struct_text.csv", index = False)
# encode the labels
lbl_encode = preprocessing.LabelEncoder( )
for col in df2.columns:
print(f"Column:'{col}', datatype: {df2[ col ].dtype}")
# encoding needed for object types
if df2[ col ].dtype == "object":
df2[ col ] = lbl_encode.fit_transform(df2[ col ])
#
print("Writing encoded features...")
df2.to_csv(filename + "_num_encoded.csv", index = False)
#
X = df2[ "word_1" ]
Y = df2[ "word_2" ]
# We have a single feature, we must reshape, else not needed :-)
X = np.array(X).reshape(-1, 1)
#
clf = tree.DecisionTreeClassifier( )
clf = clf.fit(X, Y)
predicted = clf.predict([ [ 0 ] ])
print(f"Next word on the keyboard may be: {lbl_encode.inverse_transform(predicted)}")
exit( )
And the text in Any.text
What I propose in this paper is a simple implementation wherein you can dramatically optimize the user experience and optimize the domain you are implementing.
Disclaimer: All copyrights and trademarks belong to their respective companies and owners. The purpose of this paper is educational only and the views herein are my own.