Monthly Archives: October 2018

Module #10 Introduction to ANOVA

This week’s assignment was to apply ANOVA to a dataset of patient pain ratings at different levels of stress. The patients rated their pain levels while on the test drug on a scale from 1 to 10 (with 10 being the most pain), and their stress states were recorded as high, moderate, and low stress.

The null hypothesis is that there is no difference between the means of the pain ratings at different stress levels.

To test this, I copied the table of data from the assignment website. Then I cleaned the dataset up a bit using the gather() function from the tidyr package to convert the data to two columns – one with the stress level and one with the pain level. Then I used aov() to get ANOVA information and TukeyHSD() to check variability between specific stress groups.

My code follows.

library("tidyr")
# Read the table and then convert it to two columns detailing stress and pain
migraine <- read.table("G:/week10data.txt", header=TRUE)
migraine_clean <- gather(migraine, stress, pain, factor_key=TRUE)

# Get ANOVA information for the data
migraine_aov <- aov(migraine_clean$pain ~ migraine_clean$stress)

# Print summary information from ANOVA
summary(migraine_aov)
# Print comparisons between groups to determine where variance lies
TukeyHSD(migraine_aov)

The output of the ANOVA information was:

                      Df Sum Sq Mean Sq F value   Pr(>F)
migraine_clean$stress  2  82.11   41.06   21.36 4.08e-05 ***
Residuals             15  28.83    1.92
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p value is very low (less than 0.05) and the F value is high, so the null hypothesis can be rejected. There is variability in pain between stress groups.

(To be sure that F was high enough to show variability I ran qf() against the DFs of 2 and 15. At a 95% probability level the critical F value would be 3.682, so the F of 21.36 does exceed the critical level.)

To find where there was variance, I used TukeyHSD(). The output of the TukeyHS() function was:

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = migraine_clean$pain ~ migraine_clean$stress)

$`migraine_clean$stress`
                                 diff       lwr        upr     p adj
moderate_stress-high_stress -1.166667 -3.245845  0.9125117 0.3382642
low_stress-high_stress      -5.000000 -7.079178 -2.9208216 0.0000440
low_stress-moderate_stress  -3.833333 -5.912512 -1.7541550 0.0006586

There was not significant variance in the mean pain between moderate and high stress, but there was significant variance in the other comparisons. It therefore appears that pain levels differ significantly between low and medium stress levels, while there is little difference in pain levels between medium and high stress levels.

Posted in Advanced Statistics and Analytics | Comments Off

Module #9: t-test for independent means

This week we looked at t-tests for independent samples. The dataset was a list of students, with the values being the student gender and the number of times they raised their hands in class.

To get the values I needed, I first put the data into a text file, read it into R, then set up vectors for boys and girls and ran t.test() on them:

students <- read.table("G:/week9data.txt", header=TRUE)
girls <- students[students$Gender==1,2]
boys <- students[students$Gender==2,2]
t.test(girls, boys)

The result of t.test() was:

	Welch Two Sample t-test

data:  girls and boys
t = 2.8651, df = 6.7687, p-value = 0.02505
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.6805904 7.3765524
sample estimates:
mean of x mean of y
 7.600000  3.571429
 

1. The means are:
girls: 7.6
boys: 3.571

2. The degrees of freedom value is 6.769.

3. The t-test statistics score is 2.865.

4. The p value is 0.025

5. The p value is below the alpha value of 0.05, so this test would have been statistically significant.

6. To find the critical t value for a p of 0.01 and the degrees of freedom indicated by the samples, I used the qt() function:

qt(0.01, 6.769)

Then I took the absolute value of the result. The result indicates that the t value would have to exceed 3.027 to be statistically significant for a p value of 0.01.

Posted in Advanced Statistics and Analytics | Comments Off

Module # 7: Confidence Interval Estimation And introduction to Fundamental of hypothesis testing

This week we looked at estimating confidence intervals and formulating test hypotheses.

1. The code I used with the given variables to find the confidence interval is:

testmean <- 85
testsd <- 8
testsize <- 64
testconf <- 0.95
testforz <- testconf + (1 - testconf) / 2
testerr <- qnorm(testforz)*testsd/sqrt(testsize)
testmean - testerr
testmean + testerr

The calculated interval was from 83.040 to 86.960.

2. I used similar code for this question:

testmean <- 125
testsd <- 24
testsize <- 36
testconf <- 0.99
testerr <- qnorm(testforz)*testsd/sqrt(testsize)
testmean - testerr
testmean + testerr

The calculated interval was from 117.160 to 132.840.

3a. I used this code to calculate the confidence interval for the paint measurements:

testmean <- 0.99
testsd <- 0.02
testsize <- 50
testconf <- 0.95
testforz <- testconf + (1 - testconf) / 2
testerr <- qnorm(testforz)*testsd/sqrt(testsize)
testmean - testerr
testmean + testerr

The resulting interval was from 0.984 to 0.996.

3b. Given that the target value, 1, was not in the 99% confidence interval, it looks like the shop manager does have a good reason to complain to the company he bought the paint from.

4a. The code I used to get the confidence interval is:

testmean <- 1.67
testsd <- 0.32
testsize <- 20
testconf <- 0.95
testforz <- testconf + (1 - testconf) / 2
testerr <- qnorm(testforz)*testsd/sqrt(testsize)
testmean - testerr
testmean + testerr

The interval was from 1.53 to 1.81

4b. The store owner now has a range of possible means – he can be 95% certain that the mean of all the cards in the store is between those two values.

5. The code I used to find the sample size required for the given error range is:

testsd <- 15
testconf <- 0.95
testforz <- testconf + (1 - testconf) / 2
testerr <- 5
qnorm(testforz)^2 * testsd^2 / testerr^2

The result was 34.573, so I rounded up to get a required sample size of 35.

6. The average male student’s shoe size is 10, so:

mean = 10

The falsifying statement is that the mean is not 10:

mean != 10

The original statement concerns equality, so the alternative hypothesis will be the false state and the null hypothesis will be the original assertion.

Null hypothesis: The average male student’s shoe size is 10.
Alternative hypothesis: The average male student’s shoe size is not 10.

Posted in Advanced Statistics and Analytics | Comments Off

Module #6 Sampling & Confidence Interval Estimation

This week we looked at sampling data and estimates over normal distributions.

A. The first section looked at a record of ice cream purchases during an academic year for each of five housemates (8, 14, 16, 10, 11).

a. First I was to calculate the mean of the population. R code:

pop <- c(8, 14, 16, 10, 11)
mean(pop)

The mean is 11.8.

b. Next I took a random sample of 2 of the values from the population.

sample(pop, 2)

I ended up with (8, 16).

c. Next, the mean and standard deviation of my random sample.

mean(subpop)
sd(subpop)

mean: 12
standard deviation: 5.657

d. For contrast, the same stats for the original population:

popmean <- mean(pop)
popsd <- sd(pop)

mean: 11.8
standard deviation: 3.194

The mean of my sample was pretty close to the original population, but the standard deviation was wildly different.

B. The second section concerned a population with a size of 100 and a proportion of 0.95.

1. To determine whether a sample has a normal distribution, you multiply the sample size n times both p (the proportion) and q (1 – p). The sample has a normal distribution if both n*p and n*q are greater than 5.

To get the values, I used this code:

p <- 0.95
q <- 1 - p
n <- 99
n * p
n * q

n*p was 95 and n*q was 5.

This sample, then, almost has a normal distribution, but not quite. If both n*p and n*q have to be greater than 5, then we don’t reach that threshold with n*q equal to 5.

b. The smallest sample size for which p results in a normal distribution would be 101, which is the size for which n*q is greater than 5 (specifically, 5.05).

Posted in Advanced Statistics and Analytics | Comments Off