NLP comprises techniques that enable us to solve various tasks such as internet search, document categorization, and automatic question answering (think of smart assistants such as Siri and Alexa)
In previews chapter we’ve been working with numerical data such as surfaces and distances but what if we want to integrate comments from previous owners? SOmething like “under a flight route and very noisy” could be important to determine the price of a mökki. But numbers are big calculators which means they only work with numbers so we need a way to represent text as numbers.
Bag of words
The Bag of Words (BoW) model is one of the simplest text encoding methods. It represents text as a set of word frequencies or occurrences, disregarding grammar and word order. Each unique word in the dataset is assigned an index, and texts are represented as numerical vectors based on word counts. The motivation for ignoring the order of the words is that the number of possible words is already huge and it is challenging enough to learn models that involve thousands of parameters. Models where the occurrence of “this little Piggy” means something else than “Piggy little this” tend to require even more parameters.
import pandas as pdimport refrom collections import Counterfrom IPython.display import display, Markdowndef text_to_word_table(text: str): lines = text.split('\n') # Split text into lines# Extract words, ignoring punctuationdef extract_words(line):return re.findall(r'\b\w+\b', line.lower()) all_words =sorted(set(word for line in lines for word in extract_words(line))) # Unique words data = []for line in lines: word_counts = Counter(extract_words(line)) # Count occurrences in the line data.append([word_counts.get(word, 0) for word in all_words]) # Row for DataFrame df = pd.DataFrame(data, columns=all_words)# Display as Markdown in Quarto display(Markdown(df.to_markdown(index=False)))return data # Example usagetext="""This little piggy went to market,This little piggy stayed home,This little piggy had roast beef,This little piggy had none,And this little piggy cried "Wee! Wee! Wee!" all the way home."""data_array = text_to_word_table(text)
all
and
beef
cried
had
home
little
market
none
piggy
roast
stayed
the
this
to
way
wee
went
0
0
0
0
0
0
1
1
0
1
0
0
0
1
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
1
0
1
0
0
0
0
0
0
1
0
1
0
1
0
0
1
1
0
0
1
0
0
0
0
0
0
0
0
1
0
1
0
1
1
0
0
0
1
0
0
0
0
1
1
0
1
0
1
1
0
0
1
0
0
1
1
0
1
3
0
with this array we can then find witch lines are the most alike:
import numpy as npdef find_nearest_pair(data): N =len(data) dist = np.empty((N, N), dtype=float) # 2d array of 0. with nbLines x nblinesfor i inrange(N):for j inrange(N): dist[i, j] = np.sum(np.abs(np.array(data[i]) - np.array(data[j]))) #sum of absolute distances of the diferences for each word:# data[i] = [0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1]# data[j] = [0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0]# i-j -> [ 0 0 0 0 0 -1 0 1 0 0 0 -1 0 0 1 0 0 1]# abs -> [0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1]# sum -> 5 np.fill_diagonal(dist, np.inf) # discard diagonal because it's the comparison of a line with itself min_indices = np.unravel_index(np.argmin(dist), dist.shape) #gets the indices of the lowest value of the arrayreturn min_indicesnearest_pair = find_nearest_pair(data_array)
The most similar lines are at indices (2, 3) but this method gives the same importance to “and” and to “piggy” but those two are obviously not equaly important to find similarities.
Tf-idf
The technique called by the cumbersome name Term Frequency Inverse Document Frequency (tf-idf) places more weight on occurrences of infrequent words compared to common words like ‘a’, ‘the’, ‘is’, and so on.
Calculate the frequency (the number of occurrences divided by document length) for each word in your collection of documents. This is the “term frequency”, or \(tf\)(Note: Ignore punctuation and capitalization when doing this.)
Calculate how many documents each word appears in, and divide this by the total number of documents. This is the “document frequency”, or \(df\). Since we wish to assign less weight to common words, we will use the inverse of this, \(1÷df\)
There are different ways to combine these two numbers to assign weights to each word. The most common is the product of the term frequency and the logarithm of the inverse of the document frequency: \(tf−idf=tf×log(1÷df)\).
Example:
from collections import Counterimport mathdocument_1="He really, really loves coffee"document_2="My sister dislikes coffee"document_3="My sister loves tea"corpus=[document_1,document_2,document_3]# words = distinct words in corpuswords =set()for doc in corpus: words.update(doc.lower().replace(',', '').split())words =sorted(words) # Sort words for consistent indexing# Compute term frequencytf = []for doc in corpus: word_list = doc.lower().replace(',', '').split() word_count = Counter(word_list) doc_length =len(word_list)# frequency_array.append([word_count[word] / doc_length for word in words]) tf.append({word: word_count[word] / doc_length for word in words})# Compute document frequencycorpus_word_count = Counter()for doc in corpus: corpus_word_count.update(set(doc.lower().replace(',', '').split()))doc_count =len(corpus)df = {word: corpus_word_count[word] / doc_count for word in words}#Compute text frequency - inverse document frequencyfor i, doc inenumerate(corpus):print("document "+str(i+1))for word in words:print("\t"+word+": tf-idf ="+str(tf[i][word]*math.log10(3/1)))
Now we can use this knowledge to integrate the first example. Let’s do it now with humpty dumpty:
import mathfrom collections import Countertext ='''Humpty Dumpty sat on a wallHumpty Dumpty had a great fallall the king's horses and all the king's mencouldn't put Humpty together again'''def main(text):# 1. Split the text into words and lines docs = [line.lower().split() for line in text.split('\n')]# Extract unique words words =set(word for doc in docs for word in doc) words =sorted(words) # Sort words for consistent indexing# 2. Compute term frequency for each document (line) tf = []for doc in docs: word_count = Counter(doc) # create the word count per line doc_length =len(doc) tf.append({word: word_count[word] / doc_length for word in words})# 3. Compute document frequency (DF) (=occurences in corpus) corpus_word_count = Counter()for doc in docs: corpus_word_count.update(set(doc)) doc_count =len(docs) df = {word: corpus_word_count[word] / doc_count for word in words}# Compute TF-IDF for each document tf_idf = []for i, doc inenumerate(docs): tf_idf_vector = {}for word in words: tf_idf_vector[word] = tf[i][word] * math.log10(doc_count / (1+ df[word])) # Avoid division by zero tf_idf.append(tf_idf_vector)print("We can now see that instead of ones")print("we have a calculated value per word with more rare words having a higher value: ") df = pd.DataFrame(tf_idf) display(Markdown(df.to_markdown(index=False)))# 4. Calculate distances between each line to find the closest onesdef calculate_distance(vec1, vec2):returnsum(abs(vec1[word] - vec2[word]) for word in words) min_distance =float('inf') closest_pair = (None, None)for i inrange(len(docs)):for j inrange(i +1, len(docs)): dist = calculate_distance(tf_idf[i], tf_idf[j])print(f"dist of line {i} & {j} = {dist}")if dist < min_distance: min_distance = dist closest_pair = (i, j)# Output the closest linesprint(f"\nThe two most similar lines are:")print(f"Line {closest_pair[0] +1}: {' '.join(docs[closest_pair[0]])}")print(f"Line {closest_pair[1] +1}: {' '.join(docs[closest_pair[1]])}")# Run the main functionmain(text)
We can now see that instead of ones
we have a calculated value per word with more rare words having a higher value:
a
again
all
and
couldn’t
dumpty
fall
great
had
horses
humpty
king’s
men
on
put
sat
the
together
wall
0.0709948
0
0
0
0
0.0709948
0
0
0
0
0.059837
0
0
0.0841917
0
0.0841917
0
0
0.0841917
0.0709948
0
0
0
0
0.0709948
0.0841917
0.0841917
0.0841917
0
0.059837
0
0
0
0
0
0
0
0
0
0
0.112256
0.0561278
0
0
0
0
0
0.0561278
0
0.112256
0.0561278
0
0
0
0.112256
0
0
0
0.10103
0
0
0.10103
0
0
0
0
0
0.0718044
0
0
0
0.10103
0
0
0.10103
0
dist of line 0 & 1 = 0.505149978319906
dist of line 0 & 2 = 0.959551535344231
dist of line 0 & 3 = 0.8106519473280271
dist of line 1 & 2 = 0.9595515353442308
dist of line 1 & 3 = 0.8106519473280271
dist of line 2 & 3 = 0.9810743495041644
The two most similar lines are:
Line 1: humpty dumpty sat on a wall
Line 2: humpty dumpty had a great fall
This marks an important step in text processing, allowing us to represent language in a structured way for various tasks. While BoW and TF-IDF provide a simple yet effective approach to text classification and information retrieval, they have limitations in capturing meaning and context. To address these challenges, we turn to embeddings—a more advanced technique that maps words or entire texts into dense numerical vectors, preserving semantic relationships and unlocking even more powerful applications in natural language processing.