Monthly Archives: September 2018

Module #5 Random Variables & Probability distributions

This week we looked at random variables and probability distributions.

The first assignment was to take two sets of values and their probabilities and determining the standard deviation for both.

The R code I used for the first set is:

t1 <- data.frame(x=c(0, 1, 2, 3), p=c(0.5, 0.2, 0.15, 0.10))
t1mu <- sum(t1$x * t1$p)
t1var <- sum((t1$x - t1mu)^2 * t1$p)
t1sd <- sqrt(t1var)
t1sd

The standard deviation was reported as 1.013903.

The code for the second set was:

t2 <- data.frame(x=c(1, 3, 5, 4), p=c(0.10, 0.2, 0.6, 0.2))
t2mu <- sum(t2$x * t2$p)
t2var <- sum((t2$x - t2mu)^2 * t2$p)
t2sd <- sqrt(t2var)
t2sd

The standard deviation for the second set was 1.369306.

The standard deviation was higher for the second set, probably because the second set had a high probability attached to its largest value, and the range was higher for the second set as well.

The second part of the assignment called for finding p(x=0) for a binomial distribution where the number of trials was 4 and the probability of success was 0.12.

The code I used to solve the problem was:

dbinom(0, 4, 0.12)

The result for p(0) was 0.5996954.

The final part of the assignment called for the creation of 20 Poisson numbers and calculating the value of the mean and the variance of the results. Lambda wasn’t provided, so I used 3 for the value. The code I ran was:

poisvec <- rpois(20, 3)
mean(poisvec)
var(poisvec)

I ran the code a few times to see how much it would vary. The expected mean and variance would be around 3 because that’s the point of Poisson variables – lambda represents the target mean and variance.

The first run I got a mean of 2.8 and a variance of 3.22. The second time I got a mean of 3.2 and a variance of 4.69. The third run I got a mean of 3.05 and a variance of 2.16.

Generating 20 values allowed for a good deal of variety in the variance, apparently.

Posted in Advanced Statistics and Analytics | Comments Off

Module # 4 Probability Theory

For the first part of this week’s assignment, there is a table of 90 events.

A1. Probability of Event A -> 30/90 -> 33.3%

A2. Probability of Event B -> 30/90 -> 33.3%

A3. Probability of Event A or B -> 50/90 -> 55.6%

A4. P(A or B) = P(A) + P(B) -> That equality wouldn’t be true because of the results that include both A and B. A and B have the same probability of occurring, but the total of their probabilities (66.7%) is higher than the probability of “A or B” (55.6%) because of the ten events where A and B are both true (which would be accounted for in the Addition Rule, where P(A or B) would be P(A) + P(B) – P(A and B)) – A and B are not mutually exclusive.

The second part concerns the probability that the weatherman is correct with regards to whether it will rain on a wedding day in Arizona.

B1. The answer given is almost true, I guess I’ll say. Technically false to three decimal places, but it’s an issue with the number crunching rather than the method presented being false, so it’s true at two decimal places.

B2. The formatting of the assignment looks like it leaves out a “/”, so for clarity I’ll write the formula to calculate P(A1|B) (using Bayes’ Theorem) on one line:

P(A1|B) = P(A1) * P(B|A1) / ( P(A1) * P(B|A1) + P(A2) * P(B|A2))

As the explanation in the assignment notes, that results in this equation:

P(A1|B) = 0.014 * 0.9 / (0.014 * 0.9 + 0.986 * 0.1)

By my calculations, rather than resulting in 0.111, the result of that calculation is 0.113.

Thus:

P(A1|B) = 0.113

And the probability of it raining on the wedding day is 11.3%.

I do agree that it feels like a weird result if one only looks at the probabilities for P(B|A1) and P(B|A2). Thinking about it further, though, the result we calculated comes because of the rarity of rainfall throughout the year. If the weatherman predicts rain on 10% of the days when it doesn’t rain, that means he predicts rain on 36 of the days when it doesn’t rain. Given that it only rains 5 days out of the year, the weatherman’s accuracy rate truly is terrible.

Posted in Uncategorized | Comments Off

Module # 3 Bivariate analysis

The assignment this week involved examining a data set of pre-boarding screener turnover and security violations detected at US airports between 1988 and 1999. The instructor selection 20 data points at random and asked us to check for correlations.

The first task was to describe the association between the screener data and violations data. Based on the number crunching in the questions that followed, it appears that there was a fairly strong positive correlation between screener turnover and the number of security violations detected.

The next tasks involved calculating the Pearson’s sample correlation coefficient and the Spearman’s rank coefficient, then to create a scatterplot of the data. The code for those tasks follows (after importing the data from the assignment page into my RStudio workspace):

cor.test(mod3$screeners, mod3$violations)
cor.test(mod3$screeners, mod3$violations, method="spearman")
plot(mod3$screeners, mod3$violations)

Pearson’s coefficient: 0.8375321

Spearman’s coefficient (rho): 0.7575423

Scatterplot:

Posted in Advanced Statistics and Analytics | Comments Off

Module # 2 – Descriptive Statistics and introduction to Open Source R

It turns out I’m too late posting this to get credit, but I’ll post it anyway because the work is done. Note to self: remember to check due time as well as due date.

The assignment was to find various calculations for two vectors then compare them. My code:

x <- c(10, 2, 3, 2, 4, 2, 5)
y <- c(20, 12, 13, 12, 14, 12, 15)

# Took function to find the mode from:
# https://stackoverflow.com/a/8189441
# The mode() function reports a values type (like "numeric")
get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

get_cv <- function(x) {
  sd(x) / mean(x) * 100
}

cat("For X, mean:", mean(x), "median:", median(x), "mode:", get_mode(x), "\n")
cat("For Y, mean:", mean(y), "median:", median(y), "mode:", get_mode(y), "\n")

cat("--\n")

# Assuming that the question was for quantile, given that was the function
# described - otherwise I'd use the IQR() function to find interquartile range
cat("For X, range:", range(x), "quantile:", quantile(x), "variance:", get_cv(x), "standard deviation:", sd(x), "\n")
cat("For Y, range:", range(y), "quantile:", quantile(y), "variance:", get_cv(y), "standard deviation:", sd(y), "\n")

The result:

For X, mean: 4 median: 3 mode: 2
For Y, mean: 14 median: 13 mode: 12
--
For X, range: 2 10 quantile: 2 2 3 4.5 10 variance: 72.16878 standard deviation: 2.886751
For Y, range: 12 20 quantile: 12 12 13 14.5 20 variance: 20.61965 standard deviation: 2.886751

The most interesting result when comparing the two datasets was that the standard deviance was identical for both sets, despite the ranges and variances being very different.

The assignment was straightforward, mostly because of previous courses using R – I’ve used these functions before, apart from variance and the custom mode function I picked up from the Internet.

Posted in Advanced Statistics and Analytics | Comments Off