4-Working with text (NLP’s)

elementsofai.com

Published

March 7, 2025

NLP comprises techniques that enable us to solve various tasks such as internet search, document categorization, and automatic question answering (think of smart assistants such as Siri and Alexa)

In previews chapter we’ve been working with numerical data such as surfaces and distances but what if we want to integrate comments from previous owners? SOmething like “under a flight route and very noisy” could be important to determine the price of a mökki. But numbers are big calculators which means they only work with numbers so we need a way to represent text as numbers.

Bag of words

The Bag of Words (BoW) model is one of the simplest text encoding methods. It represents text as a set of word frequencies or occurrences, disregarding grammar and word order. Each unique word in the dataset is assigned an index, and texts are represented as numerical vectors based on word counts. The motivation for ignoring the order of the words is that the number of possible words is already huge and it is challenging enough to learn models that involve thousands of parameters. Models where the occurrence of “this little Piggy” means something else than “Piggy little this” tend to require even more parameters.

import pandas as pd
import re
from collections import Counter
from IPython.display import display, Markdown

def text_to_word_table(text: str):
    lines = text.split('\n')  # Split text into lines
    
    # Extract words, ignoring punctuation
    def extract_words(line):
        return re.findall(r'\b\w+\b', line.lower())
    
    all_words = sorted(set(word for line in lines for word in extract_words(line)))  # Unique words
    
    data = []
    for line in lines:
        word_counts = Counter(extract_words(line))  # Count occurrences in the line
        data.append([word_counts.get(word, 0) for word in all_words])  # Row for DataFrame
    
    df = pd.DataFrame(data, columns=all_words)
    
    # Display as Markdown in Quarto
    display(Markdown(df.to_markdown(index=False)))

    return data 

# Example usage
text="""This little piggy went to market,
This little piggy stayed home,
This little piggy had roast beef,
This little piggy had none,
And this little piggy cried "Wee! Wee! Wee!" all the way home."""
data_array = text_to_word_table(text)

all	and	beef	cried	had	home	little	market	none	piggy	roast	stayed	the	this	to	way	wee	went
0	0	0	0	0	0	1	1	0	1	0	0	0	1	1	0	0	1
0	0	0	0	0	1	1	0	0	1	0	1	0	1	0	0	0	0
0	0	1	0	1	0	1	0	0	1	1	0	0	1	0	0	0	0
0	0	0	0	1	0	1	0	1	1	0	0	0	1	0	0	0	0
1	1	0	1	0	1	1	0	0	1	0	0	1	1	0	1	3	0

with this array we can then find witch lines are the most alike:

import numpy as np

def find_nearest_pair(data):
    N = len(data)
    dist = np.empty((N, N), dtype=float) # 2d array of 0. with nbLines x nblines

    for i in range(N):
        for j in range(N):
            dist[i, j] = np.sum(np.abs(np.array(data[i]) - np.array(data[j]))) #sum of absolute distances of the diferences for each word:
            # data[i] = [0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1]
            # data[j] = [0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0]
            # i-j -> [ 0  0  0  0  0 -1  0  1  0  0  0 -1  0  0  1  0  0  1]
            # abs -> [0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1]
            # sum -> 5

    np.fill_diagonal(dist, np.inf) # discard diagonal because it's the comparison of a line with itself

    min_indices = np.unravel_index(np.argmin(dist), dist.shape) #gets the indices of the lowest value of the array
    return min_indices
   

nearest_pair = find_nearest_pair(data_array)

The most similar lines are at indices (2, 3) but this method gives the same importance to “and” and to “piggy” but those two are obviously not equaly important to find similarities.

Tf-idf

The technique called by the cumbersome name Term Frequency Inverse Document Frequency (tf-idf) places more weight on occurrences of infrequent words compared to common words like ‘a’, ‘the’, ‘is’, and so on.

Calculate the frequency (the number of occurrences divided by document length) for each word in your collection of documents. This is the “term frequency”, or \(tf\) (Note: Ignore punctuation and capitalization when doing this.)
Calculate how many documents each word appears in, and divide this by the total number of documents. This is the “document frequency”, or \(df\). Since we wish to assign less weight to common words, we will use the inverse of this, \(1÷df\)
There are different ways to combine these two numbers to assign weights to each word. The most common is the product of the term frequency and the logarithm of the inverse of the document frequency: \(tf−idf=tf×log(1÷df)\).

Example:

from collections import Counter
import math

document_1="He really, really loves coffee"
document_2="My sister dislikes coffee"
document_3="My sister loves tea"

corpus=[document_1,document_2,document_3]

# words = distinct words in corpus
words = set()
for doc in corpus:
    words.update(doc.lower().replace(',', '').split())

words = sorted(words)  # Sort words for consistent indexing

# Compute term frequency
tf = []
for doc in corpus:
    word_list = doc.lower().replace(',', '').split()
    word_count = Counter(word_list)
    doc_length = len(word_list)
    # frequency_array.append([word_count[word] / doc_length for word in words])
    tf.append({word: word_count[word] / doc_length for word in words})


# Compute document frequency
corpus_word_count = Counter()
for doc in corpus:
    corpus_word_count.update(set(doc.lower().replace(',', '').split()))

doc_count = len(corpus)
df = {word: corpus_word_count[word] / doc_count for word in words}

#Compute text frequency - inverse document frequency
for i, doc in enumerate(corpus):
    print("document "+str(i+1))
    for word in words:
        print("\t"+word+": tf-idf ="+str(tf[i][word]*math.log10(3/1)))

document 1
    coffee: tf-idf =0.09542425094393249
    dislikes: tf-idf =0.0
    he: tf-idf =0.09542425094393249
    loves: tf-idf =0.09542425094393249
    my: tf-idf =0.0
    really: tf-idf =0.19084850188786498
    sister: tf-idf =0.0
    tea: tf-idf =0.0
document 2
    coffee: tf-idf =0.11928031367991561
    dislikes: tf-idf =0.11928031367991561
    he: tf-idf =0.0
    loves: tf-idf =0.0
    my: tf-idf =0.11928031367991561
    really: tf-idf =0.0
    sister: tf-idf =0.11928031367991561
    tea: tf-idf =0.0
document 3
    coffee: tf-idf =0.0
    dislikes: tf-idf =0.0
    he: tf-idf =0.0
    loves: tf-idf =0.11928031367991561
    my: tf-idf =0.11928031367991561
    really: tf-idf =0.0
    sister: tf-idf =0.11928031367991561
    tea: tf-idf =0.11928031367991561

Now we can use this knowledge to integrate the first example. Let’s do it now with humpty dumpty:

import math
from collections import Counter

text = '''Humpty Dumpty sat on a wall
Humpty Dumpty had a great fall
all the king's horses and all the king's men
couldn't put Humpty together again'''

def main(text):
    # 1. Split the text into words and lines
    docs = [line.lower().split() for line in text.split('\n')]

    # Extract unique words
    words = set(word for doc in docs for word in doc)

    words = sorted(words)  # Sort words for consistent indexing

    # 2. Compute term frequency for each document (line)
    tf = []
    for doc in docs:
        word_count = Counter(doc) # create the word count per line
        doc_length = len(doc)
        tf.append({word: word_count[word] / doc_length for word in words})

    # 3. Compute document frequency (DF) (=occurences in corpus)
    corpus_word_count = Counter()
    for doc in docs:
        corpus_word_count.update(set(doc))

    doc_count = len(docs)
    df = {word: corpus_word_count[word] / doc_count for word in words}

    # Compute TF-IDF for each document
    tf_idf = []
    for i, doc in enumerate(docs):
        tf_idf_vector = {}
        for word in words:
            tf_idf_vector[word] = tf[i][word] * math.log10(doc_count / (1 + df[word]))  # Avoid division by zero
        tf_idf.append(tf_idf_vector)

    print("We can now see that instead of ones")
    print("we have a calculated value per word with more rare words having a higher value: ")
    df = pd.DataFrame(tf_idf)
    display(Markdown(df.to_markdown(index=False)))

    # 4. Calculate distances between each line to find the closest ones
    def calculate_distance(vec1, vec2):
        return sum(abs(vec1[word] - vec2[word]) for word in words)

    min_distance = float('inf')
    closest_pair = (None, None)

    for i in range(len(docs)):
        for j in range(i + 1, len(docs)):
            dist = calculate_distance(tf_idf[i], tf_idf[j])
            print(f"dist of line {i} & {j} = {dist}")
            if dist < min_distance:
                min_distance = dist
                closest_pair = (i, j)

    # Output the closest lines
    print(f"\nThe two most similar lines are:")
    print(f"Line {closest_pair[0] + 1}: {' '.join(docs[closest_pair[0]])}")
    print(f"Line {closest_pair[1] + 1}: {' '.join(docs[closest_pair[1]])}")

# Run the main function
main(text)

We can now see that instead of ones
we have a calculated value per word with more rare words having a higher value:

a	again	all	and	couldn’t	dumpty	fall	great	had	horses	humpty	king’s	men	on	put	sat	the	together	wall
0.0709948	0	0	0	0	0.0709948	0	0	0	0	0.059837	0	0	0.0841917	0	0.0841917	0	0	0.0841917
0.0709948	0	0	0	0	0.0709948	0.0841917	0.0841917	0.0841917	0	0.059837	0	0	0	0	0	0	0	0
0	0	0.112256	0.0561278	0	0	0	0	0	0.0561278	0	0.112256	0.0561278	0	0	0	0.112256	0	0
0	0.10103	0	0	0.10103	0	0	0	0	0	0.0718044	0	0	0	0.10103	0	0	0.10103	0

dist of line 0 & 1 = 0.505149978319906
dist of line 0 & 2 = 0.959551535344231
dist of line 0 & 3 = 0.8106519473280271
dist of line 1 & 2 = 0.9595515353442308
dist of line 1 & 3 = 0.8106519473280271
dist of line 2 & 3 = 0.9810743495041644

The two most similar lines are:
Line 1: humpty dumpty sat on a wall
Line 2: humpty dumpty had a great fall

This marks an important step in text processing, allowing us to represent language in a structured way for various tasks. While BoW and TF-IDF provide a simple yet effective approach to text classification and information retrieval, they have limitations in capturing meaning and context. To address these challenges, we turn to embeddings—a more advanced technique that maps words or entire texts into dense numerical vectors, preserving semantic relationships and unlocking even more powerful applications in natural language processing.