Foundations of Probability in Python - Datacamp course | Solution
Course Description
Probability is the study of regularities that emerge in the outcomes of random experiments. In this course, you’ll learn about fundamental probability concepts like random variables (starting with the classic coin flip example) and how to calculate mean and variance, probability distributions, and conditional probability. We’ll also explore two very important results in probability: the law of large numbers and the central limit theorem. Since probability is at the core of data science and machine learning, these concepts will help you understand and apply models more robustly. Chances are everywhere, and the study of probability will change the way you see the world. Let’s get random!
Chapter -1
1. Let’s start flipping coins
A coin flip is the classic example of a random experiment. The possible outcomes are heads or tails. This type of experiment, known as a Bernoulli or binomial trial, allows us to study problems with two possible outcomes, like “yes” or “no” and “vote” or “no vote.” This chapter introduces Bernoulli experiments, binomial distributions to model multiple Bernoulli trials, and probability simulations with the scipy library.
Flipping coins
This exercise requires the bernoulli object from the scipy.stats library to simulate the two possible outcomes from a coin flip, 1 (“heads”) or 0 (“tails”), and the numpy library (loaded as np) to set the random generator seed.
You’ll use the bernoulli.rvs() function to simulate coin flips using the size argument.
You will set the random seed so you can reproduce the results for the random experiment in each exercise.
From each experiment, you will get the values of each coin flip. You can add the coin flips to get the number of heads after flipping 10 coins using the sum() function.
1. Import bernoulli from scipy.stats, set the seed with np.random.seed(). Simulate 1 flip, with a 35% chance of heads.
from scipy.stats import bernoulli
np.random.seed(42)
coin_flip = bernoulli.rvs(p=.35, size=1)
print(coin_flip)
Use bernoulli.rvs() and sum() to get the number of heads after 10 coin flips with 35% chance of getting heads.
# Import the bernoulli object from scipy.stats
from scipy.stats import bernoulli
# Set the random seed to reproduce the results
np.random.seed(42)
# Simulate ten coin flips and get the number of heads
ten_coin_flips = bernoulli.rvs(p=.35, size=10)
coin_flips_sum = sum(ten_coin_flips)
print(coin_flips_sum)
Using bernoulli.rvs() and sum(), try to get the number of heads after 5 flips with a 50% chance of getting heads.
# Import the bernoulli object from scipy.stats
from scipy.stats import bernoulli
# Set the random seed to reproduce the results
np.random.seed(42)
# Simulate ten coin flips and get the number of heads
five_coin_flips = bernoulli.rvs(p=.50, size=5)
coin_flips_sum = sum(five_coin_flips)
print(coin_flips_sum)
Using binom to flip even more coins
Previously, you simulated 10 coin flips with a 35% chance of getting heads using bernoulli.rvs().
This exercise loads the binom object from scipy.stats so you can use binom.rvs() to simulate 20 trials of 10 coin flips with a 35% chance of getting heads on each coin flip.
Use the binom.rvs() function to simulate 20 trials of 10 coin flips with a 35% chance of getting heads.
# Set the random seed to reproduce the results
np.random.seed(42)
# Simulate 20 trials of 10 coin flips
draws = binom.rvs(n=10, p=.35, size=20)
print(draws)
Predicting the probability of defects
Any situation with exactly two possible outcomes can be modeled with binomial random variables. For example, you could model if someone likes or dislikes a product, or if they voted or not.
Let’s model whether or not a component from a supplier comes with a defect. From the thousands of components that we got from a supplier, we are going to take a sample of 50, selected randomly. The agreed and accepted defect rate is 2%.
We import the binom object from scipy.stats.
Recall that:
binom.pmf() calculates the probability of having exactly k heads out of n coin flips.
binom.cdf() calculates the probability of having k heads or less out of n coin flips.
binom.sf() calculates the probability of having more than k heads out of n coin flips.
Question
Let’s answer a simple question before we start calculating probabilities:
What is the probability of getting more than 20 heads from a fair coin after 30 coin flips?
Possible Answers
· binom.pmf(k=20, n=30, p=0.5)
· 1 — binom.pmf(k=20, n=30, p=0.5)
· binom.sf(k=20, n=30, p=0.5) — Answer
· binom.cdf(k=20, n=30, p=0.5)
# Probability of getting exactly 1 defective component
prob_one_defect = binom.pmf(k=1, n=50, p=0.02)
print(prob_one_defect)
# Probability of not getting any defective components
prob_no_defects = binom.pmf(k=0, n=50, p=0.02)
print(prob_no_defects)
# Probability of getting 2 or less defective components
prob_two_or_less_defects = binom.cdf(k=2, n=50, p=0.02)
print(prob_two_or_less_defects)
Predicting employment status
Consider a survey about employment that contains the question “Are you employed?” It is known that 65% of respondents will answer “yes.” Eight survey responses have been collected.
We load the binom object from scipy.stats with the following code: from scipy.stats import binom
Answer the following questions using pmf(), cdf(), and sf().
Calculate the probability of getting exactly 5 yes responses.
# Calculate the probability of getting exactly 5 yes responses
prob_five_yes = binom.pmf(k=5, n=8, p=0.65)
print(prob_five_yes)
Calculate the probability of getting 3 or fewer no responses.
prob_three_or_less_no = 1-binom.cdf(k=3, n=8, p=0.65)
print(prob_three_or_less_no)
Calculate the probability of getting more than 3 yes responses using binom.sf().
prob_more_than_three_yes = binom.sf(k=3, n=8, p=0.65)
print(prob_more_than_three_yes)
Predicting burglary conviction rate
four_solved = binom.pmf(k=4, n=9, p=0.20)
print(four_solved)
more_than_three_solved = binom.sf(k=3, n=9, p=0.2)
print(more_than_three_solved)
two_or_three_solved = binom.pmf(k=2, n=9, p=0.2) + binom.pmf(k=3, n=9, p=0.2)
print(two_or_three_solved)
tail_probabilities = binom.cdf(k=1, n=9, p=0.2) + binom.sf(k=7, n=9, p=0.2)
print(tail_probabilities)
How do you calculate the expected value and the variance from a binomial distribution with parameters n=10 and p=0.25?
Possible Answers
binom.stats(n=10, p=0.25) — Ans
binom.pmf(k=5, n=10,p=0.25)
describe(binom.rvs(n=10, p=0.25, size=100)).mean
describe(binom.rvs(n=10, p=0.25, size=100)).variance
Calculating the sample mean
sample_of_100_flips = binom.rvs(n=1, p=0.5, size=100)
sample_mean_100_flips = describe(sample_of_100_flips).mean
print(sample_mean_100_flips)
sample_mean_1000_flips = describe(binom.rvs(n=1, p=0.5, size=1000)).mean
sample_mean_2000_flips = describe(binom.rvs(n=1, p=0.5, size=2000)).mean
print(sample_mean_2000_flips)
Checking the result
sample = binom.rvs(n=10, p=0.3, size=2000)
sample_describe = describe(sample)
mean = 10*0.3
variance = mean*(1–0.3)
binom_stats = binom.stats(n=10, p=0.3)
print(sample_describe.mean, sample_describe.variance, mean, variance, binom_stats)
Calculating the mean and variance of a sample
for i in range(0, 1500):
sample = binom.rvs(n=10, p=0.25, size=10)
averages.append(describe(sample).mean)
variances.append(describe(sample).variance)
print(“Mean {}”.format(describe(averages).mean))
print(“Variance {}”.format(describe(variances).mean))
print(binom.stats(n=10, p=0.25))
Chepter -2
Any overlap?
When you calculate probabilities for multiple events the most important thing to notice is the relation between the events, in particular, if there is any overlap between any two events.
· What would be the formula to calculate the probability of A or B, given that A and B are not mutually exclusive?
Possible Answers
· P(A and B) = P(A) + P(B)
· P(A or B) = P(A) — P(B) + P(A and B)
· P(A or B) = P(A)P(B) — P(A and B)
· P(A or B) = P(A) + P(B) — P(A and B) — Ans
Measuring a sample
# Count how many times you got 2 heads from the sample data
count_2_heads = find_repeats(sample_of_two_coin_flips).counts[2]
# Divide the number of heads by the total number of draws
prob_2_heads = count_2_heads / 1000
# Display the result
print(prob_2_heads)
# Get the relative frequency from sample_of_two_coin_flips
# Set numbins as 3
# Extract frequency
rel_freq = relfreq(sample_of_two_coin_flips, numbins=3).frequency
print(rel_freq)
# Probability of getting 0, 1, or 2 from the distribution
probabilities = binom.pmf([0,1,2], n=2, p=0.5)
print(probabilities)
Joint probabilities
# Individual probabilities
P_Eng_works = 0.99
P_GearB_works = 0.995
# Joint probability calculation
P_both_works = P_Eng_works * P_GearB_works
print(P_both_works)
# Individual probabilities
P_Eng_fails = 0.01
P_Eng_works = 0.99
P_GearB_fails = 0.005
P_GearB_works = 0.995
# Joint probability calculation
P_only_GearB_fails = P_GearB_fails * P_Eng_works
P_only_Eng_fails = P_Eng_fails * P_GearB_works
# Calculate result
P_one_fails = P_only_GearB_fails + P_only_Eng_fails
print(P_one_fails)
# Individual probabilities
P_Eng_fails = 0.01
P_Eng_works = 0.99
P_GearB_fails = 0.005
P_GearB_works = 0.995
# Joint probability calculation
P_EngW_GearBW = P_Eng_works * P_GearB_works
P_EngF_GearBF = P_Eng_fails * P_GearB_fails
# Calculate result
P_fails_or_works = P_EngW_GearBW + P_EngF_GearBF
print(P_fails_or_works)
Deck of cards
# Ace probability
P_Ace = 4/52
# Not Ace probability
P_not_Ace = 1 — P_Ace
print(P_not_Ace)
# Figure probabilities
P_Hearts = 13/52
P_Diamonds = 13/52
# Probability of red calculation
P_Red = P_Hearts + P_Diamonds
print(P_Red)
# Figure probabilities
P_Jack = 4/52
P_Spade = 13/52
# Joint probability
P_Jack_n_Spade = 1/52
# Probability of Jack or spade
P_Jack_or_Spade = P_Jack + P_Spade — P_Jack_n_Spade
print(P_Jack_or_Spade)
# Figure probabilities
P_King = 4/52
P_Queen = 4/52
# Joint probability
P_King_n_Queen = 0
# Probability of King or Queen
P_King_or_Queen = P_King + P_Queen — P_King_n_Queen
print(P_King_or_Queen)
Delayed flights
# Needed quantities
On_time = 241
Total_departures = 276
# Probability calculation
P_On_time = On_time / Total_departures
print(P_On_time)
# Needed quantities
P_On_time = 241 / 276
# Probability calculation
P_Delayed = 1 — P_On_time
print(P_Delayed)
# Needed quantities
Delayed_on_Tuesday = 24
On_Tuesday = 138
# Probability calculation
P_Delayed_g_Tuesday = Delayed_on_Tuesday / On_Tuesday
print(P_Delayed_g_Tuesday)
# Needed quantities
Delayed_on_Friday = 11
On_Friday = 138
# Probability calculation
P_Delayed_g_Friday = Delayed_on_Friday / On_Friday
print(P_Delayed_g_Friday)
Contingency table
# Individual probabilities
P_Red = 26/52
P_Red_n_Ace = 2/52
# Conditional probability calculation
P_Ace_given_Red = P_Red_n_Ace / P_Red
print(P_Ace_given_Red)
# Individual probabilities
P_Ace = 4/52
P_Ace_n_Black = 2/52
# Conditional probability calculation
P_Black_given_Ace = P_Ace_n_Black / P_Ace
print(P_Black_given_Ace)
# Individual probabilities
P_Black = 26/52
P_Black_n_Non_ace = 24/52
# Conditional probability calculation
P_Non_ace_given_Black = P_Black_n_Non_ace / P_Black
print(P_Non_ace_given_Black)
# Individual probabilities
P_Non_ace = 48/52
P_Non_ace_n_Red = 24/52
# Conditional probability calculation
P_Red_given_Non_ace = P_Non_ace_n_Red / P_Non_ace
print(P_Red_given_Non_ace)
More cards
P_first_Jack = 4/52
P_Jack_given_Jack = 3/51
# Joint probability calculation
P_two_Jacks = P_first_Jack * P_Jack_given_Jack
print(P_two_Jacks)
# Needed probabilities
P_Spade = 13/52
P_Spade_n_Ace = 1/52
# Conditional probability calculation
P_Ace_given_Spade = P_Spade_n_Ace / P_Spade
print(P_Ace_given_Spade)
# Needed probabilities
P_Face_card = 12/52
P_Face_card_n_Queen = 4/52
# Conditional probability calculation
P_Queen_given_Face_card = P_Face_card_n_Queen / P_Face_card
print(P_Queen_given_Face_card)
Formula 1 engines
# Needed probabilities
P_A = 0.7
P_last5000_g_A = 0.99
P_B = 0.3
P_last5000_g_B = 0.95
# Total probability calculation
P_last_5000 = P_A * P_last5000_g_A + P_B * P_last5000_g_B
print(P_last_5000)
Voters
# Individual probabilities
P_X = 0.43
# Conditional probabilities
P_Support_g_X = 0.53
# Total probability calculation
P_X_n_Support = P_X * P_Support_g_X
print(P_X_n_Support)
# Individual probabilities
P_Z = 0.32
# Conditional probabilities
P_Support_g_Z = 0.32
P_NoSupport_g_Z = 1 — P_Support_g_Z
# Total probability calculation
P_Z_n_NoSupport = P_Z * P_NoSupport_g_Z
print(P_Z_n_NoSupport)
# Individual probabilities
P_X = 0.43
P_Y = 0.25
P_Z = 0.32
# Conditional probabilities
P_Support_g_X = 0.53
P_Support_g_Y = 0.67
P_Support_g_Z = 0.32
# Total probability calculation
P_Support = P_X * P_Support_g_X + P_Y * P_Support_g_Y + P_Z * P_Support_g_Z
print(P_Support)
Conditioning
In many situations we find events that depend on other events, and we are interested in calculating probabilities taking into consideration such relations.
On the other hand, many problems can be studied by classifying the elements in our sample space. It is easier to work with classifications that do not have elements in common — i.e., that do not overlap. Using this context, please answer the following question:
· Why is Bayes’ rule important?
Possible Answers
· It allows you to calculate conditional probabilities.
· It allows you to calculate conditional probabilities for events that can be partitions in nonoverlapping parts. — Ans
· It allows you to calculate joint probabilities.
· It was discovered by the Presbyterian minister Thomas Bayes.
Factories and parts
# Individual probabilities & conditional probabilities
P_V1 = 0.5
P_V2 = 0.25
P_V3 = 0.25
P_D_g_V1 = 0.01
P_D_g_V2 = 0.02
P_D_g_V3 = 0.03
# Probability of Damaged
P_Damaged = (P_V1 * P_D_g_V1) + (P_V2 * P_D_g_V2) + (P_V3 * P_D_g_V3)
# Bayes’ rule for P(V1|D)
P_V1_g_D = (P_V1 * P_D_g_V1) / P_Damaged
print(P_V1_g_D)
# Individual probabilities & conditional probabilities
P_V1 = 0.5
P_V2 = 0.25
P_V3 = 0.25
P_D_g_V1 = 0.01
P_D_g_V2 = 0.02
P_D_g_V3 = 0.03
# Probability of Damaged
P_Damaged = (P_V1 * P_D_g_V1) + (P_V2 * P_D_g_V2) + (P_V3 * P_D_g_V3)
# Bayes’ rule for P(V2|D)
P_V2_g_D = (P_V2 * P_D_g_V2) / P_Damaged
print(P_V2_g_D)
# Individual probabilities & conditional probabilities
P_V1 = 0.5
P_V2 = 0.25
P_V3 = 0.25
P_D_g_V1 = 0.01
P_D_g_V2 = 0.02
P_D_g_V3 = 0.03
# Probability of Damaged
P_Damaged = (P_V1 * P_D_g_V1) + (P_V2 * P_D_g_V2) + (P_V3 * P_D_g_V3)
# Bayes’ rule for P(V3|D)
P_V3_g_D = (P_V3 * P_D_g_V3) / P_Damaged
print(P_V3_g_D)
Swine flu blood test
# Probability of having Swine_flu
P_Swine_flu = 1./9000
# Probability of not having Swine_flu
P_no_Swine_flu = 1 — P_Swine_flu
# Probability of being positive given that you have Swine_flu
P_Positive_g_Swine_flu = 1
# Probability of being positive given that you do not have Swine_flu
P_Positive_g_no_Swine_flu = 0.01
# Probability of Positive
P_Positive = (P_Swine_flu * P_Positive_g_Swine_flu) + (P_no_Swine_flu * P_Positive_g_no_Swine_flu)
# Bayes’ rule for P(Swine_flu|Positive)
P_Swine_flu_g_Positive = (P_Swine_flu * P_Positive_g_Swine_flu) / P_Positive
print(P_Swine_flu_g_Positive)
# Individual probabilities & conditional probabilities
P_Swine_flu = 1./350
P_no_Swine_flu = 1 — P_Swine_flu
P_Positive_g_Swine_flu = 1
P_Positive_g_no_Swine_flu = 0.01
# Probability of Positive
P_Positive = (P_Swine_flu * P_Positive_g_Swine_flu) + (P_no_Swine_flu * P_Positive_g_no_Swine_flu)
# Bayes’ rule for P(Swine_flu|Positive)
P_Swine_flu_g_Positive = (P_Swine_flu * P_Positive_g_Swine_flu) / P_Positive
print(P_Swine_flu_g_Positive)
# Individual probabilities & conditional probabilities
P_Swine_flu = 1./350
P_no_Swine_flu = 1 — P_Swine_flu
P_Positive_g_Swine_flu = 1
P_Positive_g_no_Swine_flu = 0.02
# Probability of Positive
P_Positive = P_Swine_flu * P_Positive_g_Swine_flu + P_no_Swine_flu * P_Positive_g_no_Swine_flu
# Bayes’ rule for P(Swine_flu|Positive)
P_Swine_flu_g_Positive = (P_Swine_flu * P_Positive_g_Swine_flu) / P_Positive
print(P_Swine_flu_g_Positive)
Chapter-3
Range of values
Suppose the scores on a given academic test are normally distributed, with a mean of 65 and standard deviation of 10.
What would be the range of scores two standard deviations from the mean
Possible Answers
· 10 and 65
· 55 and 75
· 45 and 85 — Ans
· 35 and 95
Plotting normal distributions
# Import norm, matplotlib.pyplot, and seaborn
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
# Create the sample using norm.rvs()
sample = norm.rvs(loc=3.15, scale=1.5, size=10000, random_state=13)
# Plot the sample
sns.distplot(sample)
plt.show()
Within three standard deviations
The heights of every employee in a company have been measured, and they are distributed normally with a mean of 168 cm and a standard deviation of 12 cm.
· What is the probability of getting a height within three standard deviations of the mean?
Possible Answers
· 68%
· 95%
· 99.7% — Ans
· 30%
Restaurant spending example
# Probability of spending $3 or less
spending = norm.cdf(3, loc=3.15, scale=1.5)
print(spending)
# Probability of spending more than $5
spending = norm.sf(5, loc=3.15, scale=1.5)
print(spending)
# Probability of spending more than $2.15 and $4.15 or less
spending_4 = norm.cdf(4.15, loc=3.15, scale=1.5)
spending_2 = norm.cdf(2.15, loc=3.15, scale=1.5)
print(spending_4 — spending_2)
# Probability of spending $2.15 or less or more than $4.15
spending_2 = norm.cdf(2.15, loc=3.15, scale=1.5)
spending_over_4 = norm.sf(4.15, loc=3.15, scale=1.5)
print(spending_2 + spending_over_4)
Smartphone battery example
# Probability that battery will last less than 3 hours
less_than_3h = norm.cdf(3, loc=5, scale=1.5)
print(less_than_3h)
# Probability that battery will last more than 3 hours
more_than_3h = norm.sf(3, loc=5, scale=1.5)
print(more_than_3h)
# Probability that battery will last between 5 and 7 hours
P_less_than_7h = norm.cdf(7, loc=5, scale=1.5)
P_less_than_5h = norm.cdf(5, loc=5, scale=1.5)
print(P_less_than_7h — P_less_than_5h)
Adults’ heights example
# Values one standard deviation from mean height for females
interval = norm.interval(0.68, loc=65, scale=3.5)
print(interval)
# Value where the tallest males fall with 0.01 probability
tallest = norm.ppf(0.99, loc=70, scale=4)
print(tallest)
# Probability of being taller than 73 inches for males and females
P_taller_male = norm.sf(73, loc=70, scale=4)
P_taller_female = norm.sf(73, loc=65, scale=3.5)
print(P_taller_male, P_taller_female)
# Probability of being shorter than 61 inches for males and females
P_shorter_male = norm.cdf(61, loc=70, scale=4)
P_shorter_female = norm.cdf(61, loc=65, scale=3.5)
print(P_shorter_male, P_shorter_female)
ATM example
# Import poisson from scipy.stats
from scipy.stats import poisson
# Probability of more than 1 customer
probability = poisson.sf(k=1, mu=1)
# Print the result
print(probability)
Highway accidents example
# Import the poisson object
from scipy.stats import poisson
# Probability of 5 accidents any day
P_five_accidents = poisson.pmf(k=5, mu=2)
# Print the result
print(P_five_accidents)
# Import the poisson object
from scipy.stats import poisson
# Probability of having 4 or 5 accidents on any day
P_less_than_6 = poisson.cdf(k=5, mu=2)
P_less_than_4 = poisson.cdf(k=3, mu=2)
# Print the result
print(P_less_than_6 — P_less_than_4)
# Import the poisson object
from scipy.stats import poisson
# Probability of more than 3 accidents any day
P_more_than_3 = poisson.sf(k=3, mu=2)
# Print the result
print(P_more_than_3)
# Import the poisson object
from scipy.stats import poisson
# Number of accidents with 0.75 probability
accidents = poisson.ppf(q=0.75, mu=2)
# Print the result
print(accidents)
Generating and plotting Poisson distributions
# Import poisson, matplotlib.pyplot, and seaborn
from scipy.stats import poisson
import matplotlib.pyplot as plt
import seaborn as sns
# Create the sample
sample = poisson.rvs(mu=2, size=10000, random_state=13)
# Plot the sample
sns.distplot(sample, kde=False)
plt.show()
Catching salmon example
# Getting a salmon on the third attempt
probability = geom.pmf(k=3, p=0.0333)
# Print the result
print(probability)
# Probability of getting a salmon in less than 5 attempts
probability = geom.cdf(k=4, p=0.0333)
# Print the result
print(probability)
# Probability of getting a salmon in less than 21 attempts
probability = geom.cdf(k=20, p=0.0333)
# Print the result
print(probability)
# Attempts for 0.9 probability of catching a salmon
attempts = geom.ppf(q=0.9, p=0.0333)
# Print the result
print(attempts)
Free throws example
# Import geom from scipy.stats
from scipy.stats import geom
# Probability of missing first and scoring on second throw
probability = geom.pmf(k=2, p=0.3)
# Print the result
print(probability)
Generating and plotting geometric distributions
# Import geom, matplotlib.pyplot, and seaborn
from scipy.stats import geom
import matplotlib.pyplot as plt
import seaborn as sns
# Create the sample
sample = geom.rvs(p=0.3, size=10000, random_state=13)
# Plot the sample
sns.distplot(sample, bins = np.linspace(0,20,21), kde=False)
plt.show()
Chapter-4
Generating a sample
# Import the binom object
from scipy.stats import binom
# Generate a sample of 250 newborn children
sample = binom.rvs(n=1, p=0.505, size=250, random_state=42)
# Show the sample values
print(sample)
Calculating the sample mean
# Print the sample mean of the first 10 samples
print(describe(sample[0:10]).mean)
# Print the sample mean of the first 50 samples
print(describe(sample[0:50]).mean)
# Print the sample mean of the first 250 samples
print(describe(sample[0:250]).mean)
Plotting the sample mean
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
averages.append(describe(sample[0:i]).mean)
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
averages.append(describe(sample[0:i]).mean)
# Add population mean line and sample mean plot
plt.axhline(binom.mean(n=1, p=0.505), color=’red’)
plt.plot(averages, ‘-’)
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
averages.append(describe(sample[0:i]).mean)
# Add population mean line and sample mean plot
plt.axhline(binom.mean(n=1, p=0.505), color=’red’)
plt.plot(averages, ‘-’)
# Add legend
plt.legend((“Population mean”,”Sample mean”), loc=’upper right’)
plt.show()
Sample means
# Create list for sample means
sample_means = []
for _ in range(1500):
# Take 20 values from the population
sample = np.random.choice(population, 20)
# Create list for sample means
sample_means = []
for _ in range(1500):
# Take 20 values from the population
sample = np.random.choice(population, 20)
# Calculate the sample mean
sample_means.append(describe(sample).mean)
# Create list for sample means
sample_means = []
for _ in range(1500):
# Take 20 values from the population
sample = np.random.choice(population, 20)
# Calculate the sample mean
sample_means.append(describe(sample).mean)
# Plot the histogram
plt.hist(sample_means)
plt.xlabel(“Sample mean values”)
plt.ylabel(“Frequency”)
plt.show()
Question
Inspecting the plot, what is the distribution of the sample mean?
Possible Answers
· Same as the generated sample
· Binomial
· Normal –Ans
Sample means follow a normal distribution
# Generate the population
population = geom.rvs(p=0.5, size=1000)
# Create list for sample means
sample_means = []
for _ in range(3000):
# Take 20 values from the population
sample = np.random.choice(population, 20)
# Calculate the sample mean
sample_means.append(describe(sample).mean)
# Plot the histogram
plt.hist(sample_means)
plt.show()
# Generate the population
population = poisson.rvs(mu=2, size=1000)
# Create list for sample means
sample_means = []
for _ in range(1500):
# Take 20 values from the population
sample = np.random.choice(population, 20)
# Calculate the sample mean
sample_means.append(describe(sample).mean)
# Plot the histogram
plt.hist(sample_means)
plt.show()
Adding dice rolls
# Configure random generator
np.random.seed(42)
# Generate the sample
sample1 = roll_dice(2000)
# Plot the sample
plt.hist(sample1, bins=range(1, 8), width=0.9)
plt.show()
# Configure random generator
np.random.seed(42)
# Generate two samples of 2000 dice rolls
sample1 = roll_dice(2000)
sample2 = roll_dice(2000)
# Add the first two samples
sum_of_1_and_2 = np.add(sample1, sample2)
# Plot the sum
plt.hist(sum_of_1_and_2, bins=range(2, 14), width=0.9)
plt.show()
# Configure random generator
np.random.seed(42)
# Generate the samples
sample1 = roll_dice(2000)
sample2 = roll_dice(2000)
sample3 = roll_dice(2000)
# Add the first two samples
sum_of_1_and_2 = np.add(sample1, sample2)
# Add the first two with the third sample
sum_of_3_samples = np.add(sum_of_1_and_2, sample3)
# Plot the result
plt.hist(sum_of_3_samples, bins=range(3, 20), width=0.9)
plt.show()
Fitting a model
# Import the linregress() function
from scipy.stats import linregress
# Get the model parameters
slope, intercept, r_value, p_value, std_err = linregress(hours_of_study, scores)
# Print the linear model parameters
print(‘slope:’, slope)
print(‘intercept:’, intercept)
Predicting test scores
# Get the predicted test score for given hours of study
score = slope*10 + intercept
print(‘score:’, score)
# Get the predicted test score for given hours of study
score = slope*9 + intercept
print(‘score:’, score)
# Get the predicted test score for given hours of study
score = slope*12 + intercept
print(‘score:’, score)
Studying residuals
# Scatterplot of hours of study and test scores
plt.scatter(hours_of_study_A, test_scores_A)
# Plot of hours_of_study_values_A and predicted values
plt.plot(hours_of_study_values_A, model_A.predict(hours_of_study_values_A))
plt.title(“Model A”, fontsize=25)
plt.show()
# Calculate the residuals
residuals_A = model_A.predict(hours_of_study_A) — test_scores_A
# Make a scatterplot of residuals of model_A
plt.scatter(hours_of_study_A, residuals_A)
# Add reference line and title and show plot
plt.hlines(0, 0, 30, colors=’r’, linestyles=’ — ‘)
plt.title(“Residuals plot of Model A”, fontsize=25)
plt.show()
# Scatterplot of hours of study and test scores
plt.scatter(hours_of_study_B, test_scores_B)
# Plot of hours_of_study_values_B and predicted values
plt.plot(hours_of_study_values_B, model_B.predict(hours_of_study_values_B))
plt.title(“Model B”, fontsize=25)
plt.show()
# Calculate the residuals
residuals_B = model_B.predict(hours_of_study_B) — test_scores_B
# Make a scatterplot of residuals of model_B
plt.scatter(hours_of_study_B, residuals_B)
# Add reference line and title and show plot
plt.hlines(0, 0, 30, colors=’r’, linestyles=’ — ‘)
plt.title(“Residuals plot of Model B”, fontsize=25)
plt.show()
Fitting a logistic model
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression
# sklearn logistic model
model = LogisticRegression(C=1e9)
model.fit(hours_of_study, outcomes)
# Get parameters
beta1 = model.coef_[0][0]
beta0 = model.intercept_[0]
# Print parameters
print(beta1, beta0)
Predicting if students will pass
# Specify values to predict
hours_of_study_test = [[10], [11], [12], [13], [14]]
# Pass values to predict
predicted_outcomes = model.predict(hours_of_study_test)
print(predicted_outcomes)
# Set value in array
value = np.asarray(11).reshape(-1,1)
# Probability of passing the test with 11 hours of study
print(“Probability of passing test “, model.predict_proba(value)[:,1])
Passing two tests
# Specify values to predict
hours_of_study_test_A = [[6], [7], [8], [9], [10]]
# Pass values to predict
predicted_outcomes_A = model_A.predict(hours_of_study_test_A)
print(predicted_outcomes_A)
# Specify values to predict
hours_of_study_test_B = [[3], [4], [5], [6]]
# Pass values to predict
predicted_outcomes_B = model_B.predict(hours_of_study_test_B)
print(predicted_outcomes_B)
# Set value in array
value_A = np.asarray(8.6).reshape(-1,1)
# Probability of passing test A with 8.6 hours of study
print(“The probability of passing test A with 8.6 hours of study is “, model_A.predict_proba(value_A)[:,1])
# Set value in array
value_B = np.asarray(4.7).reshape(-1,1)
# Probability of passing test B with 4.7 hours of study
print(“The probability of passing test B with 4.7 hours of study is “, model_B.predict_proba(value_B)[:,1])
# Print the hours required to have 0.5 probability on model_A
print(“Minimum hours of study for test A are “, -model_A.intercept_/model_A.coef_)
# Print the hours required to have 0.5 probability on model_B
print(“Minimum hours of study for test B are “, -model_B.intercept_/model_B.coef_)
# Probability calculation for each value of study_hours
prob_passing_A = model_A.predict_proba(study_hours_A.reshape(-1,1))[:,1]
prob_passing_B = model_B.predict_proba(study_hours_B.reshape(-1,1))[:,1]
# Calculate the probability of passing both tests
prob_passing_A_and_B = prob_passing_A * prob_passing_B
# Maximum probability value
max_prob = max(prob_passing_A_and_B)
# Position where we get the maximum value
max_position = np.where(prob_passing_A_and_B == max_prob)[0][0]
# Study hours for each test
print(“Study {:1.0f} hours for the first and {:1.0f} hours for the second test and you will pass both tests with {:01.2f} probability.”.format(study_hours_A[max_position], study_hours_B[max_position], max_prob))
▬▬▬▬▬▬ Connect with me ▬▬▬▬▬▬
Youtube Subscription ► https://bit.ly/2LENtS1
Facebook Page: ► https://www.facebook.com/EasyAWSLearn/
Blog: ► https://easyawslearn.blogspot.com/
Dev: ► https://dev.to/easyawslearn
Telegram Channel: ► https://t.me/devtul