2-Probabilities

elementsofai.com

Published

February 25, 2025

Probability is the best way to deal with uncertainty. For example, you can’t be certain that an input will be correct so you’ll use probability to find the closest to the correct one using porbability.

Refresh on probabilities

Here’s is data on the percentage of fishers in the population of nordic countries:

Country	Population	Fishers	Proportion of fishers
Denmark	5,615,000	1,891	0.034%
Finland	5,439,000	2,652	0.049%
Iceland	324,000	3,800	1.173%
Norway	5,080,000	11,611	0.229%
Sweden	9,609,000	1,757	0.018%
TOTAL	26,067,000	21,711	0.083%

A probability is calculated by dividing the acceptable outcomes by to total possible outcomes

If there is a lottery, the probability of the winner to be from a given country is calculated by the following expression:

\(P(country)=population(country)÷totalpopulation\)

or

\(P(Denmark)=5615000÷26067000=0.2154064526\) or 21.5%
If we know that the winner is a fisher, the calclulation will need to take that into acount and would look like:

\(P(Denmark ∣ fisher)=fishers(Denmark)÷fishers(total)\)

or

\(P(Denmark ∣ fisher)=1891÷21711=0.087\) or 8.7%

The following code tries to guess the nationality of the winner when we know that the winner is a fisher and we know their gender (either female or male):

countries = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
populations = [5615000, 5439000, 324000, 5080000, 9609000]
male_fishers = [1822, 2575, 3400, 11291, 1731]
female_fishers = [69, 77, 400, 320, 26] 

def guess(winner_gender):
    if winner_gender == 'female':
        fishers = female_fishers
    else:
        fishers = male_fishers

    # write your solution here
    guess = None
    biggest = 0.0
    for i, country in enumerate(countries):
        prob=(fishers[i]/sum(fishers))*100
        if prob>biggest:
            biggest=prob
            guess=country
        # print(str(i)+" - "+str(country)+" - "+str(prob))



    
    return (guess, biggest)  

def main():
    country, fraction = guess("male")
    print("if the winner is male, my guess is he's from %s; probability %.2f%%" % (country, fraction))
    country, fraction = guess("female")
    print("if the winner is female, my guess is she's from %s; probability %.2f%%" % (country, fraction))

main()

if the winner is male, my guess is he's from Norway; probability 54.23%
if the winner is female, my guess is she's from Iceland; probability 44.84%

Montecarlo

Monte Carlo methods are a class of algorithms that rely on repeated random sampling to estimate numerical results. In probability, they are used to approximate complex distributions, integrals, or expectations when direct computation is impractical.

The core idea:

Generate random samples from a probability distribution.
Perform calculations on those samples (e.g., estimate an expected value).
Aggregate results to approximate the desired probability or numerical outcome. For example, to estimate the probability of a dart landing inside a circle within a square, you randomly throw many darts, count how many land inside the circle, and use that ratio to approximate π.

Monte Carlo is widely used in finance, physics, AI, and risk analysis.

The following code simulates a random sequence of zeros and ones, counts occurrences of five consecutive ones (11111), and prints the result. The probability of 11111 to appear is \((2/3)^5\) which means the result of this code should be \(10000*(2/3)^5 ≈ 1316.9\)

import numpy as np

def generate(p1):
    # generates 10000 random zeros and ones where the probability of one is p1
    seq = np.random.choice([0, 1], p=[1-p1, p1], size=10000)
    return seq

def count(seq):
    #counts the number of occurrences of 5 consecutive ones ("11111") 
    # - overlaping allowed 111111 is 2 occurrences
    count = 0
    for i in range(len(seq) - 4):
        if seq[i] == 1 and seq[i+1] == 1 and seq[i+2] == 1 and seq[i+3] == 1 and seq[i+4] == 1:
            count += 1
    return count


def main(p1):
    seq = generate(p1)
    return count(seq)

print(main(2/3))

This is a Monte Carlo simulation because it relies on random sampling to estimate the frequency of an event—in this case, the occurrence of five consecutive ones in a random binary sequence.

The Bayes Rule

Bayes’ Rule is a formula used to update probabilities based on new evidence. It states:

\(P(A∣B)= \frac{P(B∣A)P(A)}{P(B)}\)

It’s useful in medical tests, spam filtering, and AI models to refine predictions based on prior knowledge.

In our fishereman scenario from before, we could do \(P(Denmark∣fisher)= \frac{P(fisher∣Denmark)P(Denmark)}{P(fisher)}\)

The probability that a person is Danish, given that they are a fisher, is equal to the probability of a person being a fisher given that they are Danish, multiplied by the probability of a person being Danish, divided by the probability of a person being a fisher.

P(fisher ∣ Denmark) = 0.034% - probability of someone to be a fisher given that they’re Danish
P(Denmark) = 21.5% - probability that a person is Danish (only considering nordic countries)
P(fisher) = 21711 / 26067000 = 0.083% - probability that a person is a fisher (only considering nordiv countries)

\(P(Denmark ∣ fisher)=0.00034×0.215÷0.00083=0.088\) (difference with the 8.7% of earlier explained by rounding errors)

For our case, the first method to obtain this result will be easier but sometimes we don’t have the full data. For example imagine a medical scenario where we know the probability of the effect (test result) given its cause (a medical condition), but not the other way around.

Example:

@John37330190 started following you on instagram. You don’t want to have creepy bots following you. To decide whether you should block the new follower, you decide to use the Bayes rule!

We “know” the probability that a follower is a bot is 10% so pbot=0.1, also the probability that the username of a bot account includes an 8-digit number is about 80% so p8_bot=0.8 and the probability of a human having a username with 8 digits is 5% so p8_human=0.05

We can then calculate the probability of a user (human or bot) to have an 8 digit number:

\(P(8digits) = P(8digits | bot) * P(bot) + P(8digits | human) * P(human)\)

\(P(8digits) = p8_bot * pbot + p8_human * (1-pbot) = 0.125\)

We can determine the probability it’s a bot given it has 8 digits:

\(P(bot | 8digits) = P(8digits | bot) * P(bot) / P(8digits)\)

def bot8(pbot, p8_bot, p8_human):
    p8 =  p8_bot * pbot + p8_human * (1-pbot)
    pbot_8 = p8_bot*pbot/p8
    print(pbot_8)

# you can change these values to test your program with different values
pbot = 0.1
p8_bot = 0.8
p8_human = 0.05

bot8(pbot, p8_bot, p8_human)

0.64

Naive Bayes classifier

One of the most useful applications of the Bayes rule is the so-called Naive Bayes classifier. It is a machine learning technique that can be used to classify objects such as text documents into two or more classes.

This is the probability of a message being spam given the words it contains. If this probability is high, then the filter may automatically delete the message or put it into a junk mail folder:

\(P(spam∣words)=P(words ∣ spam)P(spam)÷P(words)\)

The idea is to use a large collection of spam messages to estimate the frequency of each word in them, which can be used as P(words ∣ spam). The same is done for non-spam messages, which is often called “ham”, to estimate P(words ∣ ham). As you may notice, the Bayes rule formula above doesn’t really include the latter term, but it is needed to calculate P(words), which refers to the word frequencies in all messages (either ham or spam).

This is how to classify an email as spam or not:

Start with the odds 1:1, which means that the probability of spam is 0.5.
Calculate the so called likelihood ratio as r=P(word ∣ spam)÷P(word ∣ ham)
Multiply the current odds by r
Repeat steps 2 and 3 until all words have been processed
Transform odds into probability usin if \(odds=x:y\), then \(probability=x÷(x+y)\)

For example, after processing multiple emails we estimate that:

P(million|spam)=0.0016285
P(million|ham)=0.0003198
P(conferences|spam)=0.0000100
P(conferences|ham)=0.0000391

the likelyhood ratio of each word is:

\(P(million ∣ spam)÷P(million ∣ ham)=0.0016285÷0.0003198=5.0923\) \(P(conferences ∣ spam)÷P(conferences ∣ ham)=0.0000100÷0.0000391=0.2554\)

This means that the word million is 5 times more likely to be spam rather than ham, while the word conferences is 4 times more likely to be ham. For the message million conferences we will do (0.2554×5.0923)=1.30 and if odds = 1.30:1 then probability = 1.30÷(1.30+1)=0.565 or 56.5%

Example:

we want to determine whether a six-sided die is fair or loaded based on a sequence of rolls. A fair die has an equal probability of rolling any number from 1 to 6, while a loaded die is biased to roll 6 half of the time.

We will roll the die 10 times and record the results, then analyse the results with naive bayes classifier to tell if the dice is normal or loaded.

Remeber that the odds are how many times its loaded compared to normal so p2 / p1. Otherwise change the boolean values to return False if odds>1

import numpy as np

p1 = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]   # normal
p2 = [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]   # loaded

def roll(loaded):
    if loaded:
        print("rolling a loaded die")
        p = p2
    else:
        print("rolling a normal die")
        p = p1

    # roll the dice 10 times
    # add 1 to get dice rolls from 1 to 6 instead of 0 to 5
    sequence = np.random.choice(6, size=10, p=p) + 1 
    for roll in sequence:
        print("rolled %d" % roll)
        
    return sequence

def bayes(sequence):
    odds = 1.0           # start with odds 1:1
    for roll in sequence:
        odds=odds*(p2[roll-1]/p1[roll-1])
        pass             # edit here to update the odds
    if odds > 1:
        return True
    else:
        return False

sequence = roll(True)
if bayes(sequence):
    print("I think loaded")
else:
    print("I think normal")

rolling a loaded die
rolled 1
rolled 1
rolled 2
rolled 6
rolled 4
rolled 6
rolled 5
rolled 3
rolled 6
rolled 6
I think loaded