View Table of Contents
1 - Data Basics - Data Structures and PropertiesData Collection PrinciplesExperiments2 - Examining and Summarizing numerical Data Measuring the Center of Distributions - Mean & MedianMeasuring spread around the mean of distributions - Variance & Standart DeviationVisualizing numerical Data - Scatterplot, Dot Plot, Box Plot and HistogramConsidering Categorical DataVisualizing Categorical Data 3 - ProbabilityDefining Probability (Basic Theory)Conditional Probability - Probability TreesSampling from a small proportionRandom Variables Discrete Random VariablesContinious Random Variables4 - Probability DistributionsNormal DistributionGeometrical DistributionsBinomial DistributionNegative Binomial DistributionPoisson Distribution5 - Foundations of InferencePoint Estimates (PE) and Sampling VariabilityConfidence Intervals - Range around the PEHypothesis Testing - Intro using numerical data (means)Inference for single proportions6 - Difference of two ProportionsTesting for goodness of fit using chi-squareTesting for independence in two-way tables7 - Inference for numerical DataOne-Sample Means with t-distribution (paired)Paired DataDifference of two meansPower calculations for a difference between two meansComparing many means with ANOVA 8 - Introduction to Linar RegressionLine Fitting, residuals, and correlationFitting a line by least squared regressionTypes of Outliers in linear RegressionInference for linear regression9 - Multiple and Logistics RegressionMultiple RegressionModel SelectionChecking model conditions using graphsLogistics Regression
1 - Data Basics - Data Structures and Properties
In a table, what are variables and what are observations?
What are the two types of variables?
Assign each variable to the corresponding type:
What type of variable are telephone area codes?
What is the difference between nominal and ordinal categorial variables?
- ordinal = the categories (level) have a certain order
- nominal = no special type of ordering
What are the two relationships that variables can have?
variables that show a connection with each other = associated
What is the relationship between explanatory and response variables?
- We suspect that one variable might causally affect another variable
- The first is lablled then explanatory and the second response
What is a observational study?
Simply observing study result to draw possible conclusions
- The result of your study might show correlations but no causations
- Behind the variables your are looking at there could be confounding variables in the backgroud that you didn't consider.
What is an experiment?
Researchers conducting e.g a study to investigate the possibility of a causal connection.
- We have treatment and control goups
- We have clearly defined explanatory and response variables
What is the difference between association and causation?
- Associated: Two variables show a connection → can also be dependend variables
- Independend: Two variables that are not associated / no connection
⇒ Observational studies cannot imply causation, they can show some kind of association.
⇒ Only randomized experiments can infer causation.
What is the main difference between observational sutdies and experiments?
Most experiments use random assignment while observational studies do not and just collect data
Data Collection Principles
What is anecdotal evidence and what is the difference to statistical methods?
- Data you collect might be bias and not objective. Be careful
What is a census?
Sampling the entire population
What is exploratory analysis?
- Realize something interesting with your sample. Some proportion
What is inference?
- Drawing conclusions about the entire population based on your sample observation
What characteristic does your sample need to have in order to draw a valid inference?
- The sample needs to be represenative for the entire population
- There are certain biases that you need to be aware of when sampling
What are four biases that could make your sample non-represenative?
What is the voluntary response bias?
The resulting sample tends to overrepresent individuals who have strong opinions. There is a percentage which will not respond and those are likely those people who don't care.
What is a conveinece sample?
- The survery you conduct consists of people that are more easily accesible
- Often results in undercoverage
What is a undercoverage bias?
Undercoverage bias is a type of sampling bias that occurs when some parts of your research population are not adequately represented in your survey sample. In other words, undercoverage happens when a significant entity in your research population has an almost-zero probability of getting selected into the research sample
- For example, let’s say you’re conducting a product evaluation survey via Formplus to find out what users think about a product. To accurately gather data for this research, you’ll need to collect feedback from both new and existing users of the product.
What is non response bias?
About 30% of a choosen, random sample is likely to not even respond. It is questionable if the results are then valid or not
What is confounding (variable)?
- A third variable next to the explanatory and reponse variable that is correlated to both
- E.g. sun exposure to usage of sun screen and skin cancer
Difference between prospective and retrospective studies?
Prospective: Collecting information as events unfold. Follow individuals over a certain time span and track canges
Retrospective: Collect data of events that already have taken place. E.g research that has already been done.
What are the 4 most common sampling methods?
Simple random sampling
= sample independent subjects from the entire population
- Divide the entire population into groups which hold similar observation.
- Take a SRS from EACH stratum
- divide the population into homogeneous observations
- Take a SRS of the all the cluster and then sample all observation from this selection
- After taking a SRS on all the clusters also take a SRS on the observations in the clusters
What are methods for simple random sampling?
- Random digit table. → Depending on your sample size (10s, 100ds, 1000ds, ..) choose the lenght of numbers you are looking at, ignoring any combination that is greater than the sample size
- Lottery drawing
- Random number generator
A cluster sample, as clusting the area into the district neighborhoods and accidentially selecting only clusters of large home neighborhoods will not yield representative results.
What are the principles of experimental design?
- Controlling: Control differences between two test groups so that results are free of error
- Randomization: Radom assignment of research partizipants to reduce false results
- Replication: The larger the sample, the more accurate the result is
- Blocking: Splitting up groups into blocks, as some people might have other variables that affect their response. E.g both groups gets a block of high-risk and low-risk. Gender is another block, if it could impact the results of your study.
What are blided / double-blinded experiments?
blinded = teatment and control group do not know what they are part of
double - bilded = examiners / trackers of each group do not know what testee in front of them is part of
What is matched pairs design?
Switching control and treatment group participants after a certain amount of time
Overview on expectations from a study.
2 - Examining and Summarizing numerical Data
Measuring the Center of Distributions - Mean & Median
What is the mean?
Mean, also average, is one way to measure the center of some kind of data distribution
How do you calculate the sample mean?
How do we denote the population mean?
The sample mean is a .... and serves as a ... of the entire population
The sample mean is one form of a sample statistic and serves as a point estimate for the population mean.
- The quality of that estimate depends on various factors
What is the median?
→ Imagine a survey where a lot of people aswered with 1. Then you had way more entires on the left, also shifting the median to the left.
The median is also reffered to as the ...th percentile?
What are the percentile corresponding to the first and third quartiles?
First quartile (Q1) = 25th percentile → Midpoint of the first 50% of the datapoints
Third quartile (Q3) = 75th percentile → Midpoint of the second 50% of the datapoints
What is the IQR?
= Interquartile Range
- difference between the midpoint of the upper 50% of the data and the lower 50% of the data
Categorize Mean, Median, IQR and Standart Deviation based on robustness
- Median and IQR
- Outliers and extreme data points might shift the median one number to either side, but you would need to add a lot of new numbers on either side to substantially change the middle number of the whole data set (median) and the upper and lower 50% (IQR)
- Mean and Standart deviation
- Adding extreme values to the nominator of the mean calculation will make the fraction return a way larger or smaller number
- With the mean changing a lot the SD calculation also changes a lot. Especially large outliers will affect the standart deviation → But that is exactly what the SD is supposed to show you
What kind statistics should you use for center and spread for (1) skewed data sets and (2) symmetric data sets?
Measuring spread around the mean of distributions - Variance & Standart Deviation
What is the variance and how do you calculate it?
= roughly the squared deviation from the mean
This is the formula for the sample variance → Why? We devide by n-1 to force greater values for the variance calculation and cause overestimation for the variance. Deviding by n would only underestimate the actual population variance. Thus we are more accurate when using n-1
What do we use the squared deviation in the variance calculation?
- to get rid of negative values, which result when the mean is larger than the observation → No matter if positive or negative resulting value, any distance from the mean is thus weighted more equally, providing us a better estimate for the overall deviation
- Larger deviations are weighted even more heavily.
What is the standart deviation and how do we calculate it?
Visualizing numerical Data - Scatterplot, Dot Plot, Box Plot and Histogram
What are scatterplot useful for?
Visulizing the realtionship between two numercial variables
- indetifying association or independence
What are dot plots used for?
Useful for vizualizing just ONE variable.
- Use darker colors where more observations are
- Use stacked Dot Plots to show the same data a little different
What are histograms used for?
- A binwidth of 1 would basically be like a dotplot where the dots are connected
What are the modality types of a distribution (esp. histogram)?
- SIngle Peak = unimodal
- Several Peaks = bi or multi-modal
- No peaks = uniform
What are the different skewness types of distributions?
- Right skewed = tail to the right
- Left Skewed = tail to the left
Explain the structure of a box plot? What does the box represent? What does the thick line in the middle represent? What do outside points and the bars mean?
- The Box covers the IQR → the middle 50% of all the data
- Thick line in the box → Median of entire data set
- Lines away from box → Whiskers
- max upper whisker = Q3 + 1,5 * IQR
- max lower whisker = Q1 - 1,5 * IQR
- Outside points → Outliers
- Ever data point outside of the max upper and lower whisker reach
What is a common operation to transform extremely skewed data?
- outliers become far less prominant
- right side is log(number of games attended)
- results of an analysis of log data might be hard to interprete.
What are commonly extremely skewed data sets?
Salary and Housing prices
Considering Categorical Data
What is contingency table?
- A two way table that summarizes data and shows the ralationship between TWO or more categorical variables
- The table helps in determining conditional probabilities quite easily.
A chi-test can be conducted on contingency tables to test whether or not a relationship exists between variables. These effects are defined as relationships between rows and columns. The
Visualizing Categorical Data
What is a bar plot used for? How do you call a bar plot showing proportions?
What is the difference between bar plot and histogram?
- bar plots are for categorical variables while histograms are for numerical data
- the x axis of histogram is a number line, the order of the bars cannot be changed, as the order depends on the distribution
- The categories in a bar plot can be arranged in any order. Their value only depends on the y-axis frequency or relative frequency but not on any number of observations.
What kind of bar-plots can we use if there are two or more variables in the contingency table?
- Stacked bar plots → Where on the x-axis there are now the row observations
- the stack reaches a height of the total amount of observations
- Side-by-side bar plots → Still the variables are split on the x-axis but the frequencies are not stacked but rather next to each other
- adding up the heights of both frequencies will give you the total
- Standardized stacked bar plot → Stacked plot for relative frequencies
- The height will exactly be 1 or 100%
3 - Probability
Defining Probability (Basic Theory)
What is a random process?
What is probability and what are the two main mathematical rules it needs to follow?
What is the law of large numbers?
What are disjoint or mutally exclusive outcomes?
- Outcomes that CANNOT happen at the same time
- you can not get heads and tails in one flip
Do the sum of probabilities of two disjoint events always add up to one?
- Not necessarily, there may be more than 2 events in the sample space yielding different values than 1
If two sets are disjoint, what is the probability of the Event A and B happening in one experiment?
0, as in one event non of the set elements will overlap, making it impossible to occur.
What is the opposite?
Does the sum of two complementary events always add up to 1?
Yes, that is the definition of complementary sets. They always add up to 1
What is the addition rule and for what operation on sets is it used?
What is a probability distribution?
A distribution listing all possible event and the corresponding probabiilities
- Can be a table format
- or a distribution curve, if the data works
What are the three basics rules for a probability distribution?
What is a sample space?
What are complementary events?
What are independend processes?
How do you check for independence between events?
What is the product rule for independent events?
Conditional Probability - Probability Trees
What is conditional probability and what is the formula to calculate it?
- Two or more dependend events and we want to know the probability of one event given that we know it's probability changes when the other event occurs
What is general multiplication formula for assumingly dependent events?
- We multiplied the fraction on the right side by P(B) and thus can calculate the probability of A and B
How can we use the conditional probability formula to proof the indepence condition?
What is the Bayes' Theorem? How is it different from the conditional probabilty formula?
- the nominator is just a general expression for a two-step path on the tree below. A is one of the first branches and P(B|A) is on the second branches
Sampling from a small proportion
What is sampling with replacement?
What is sampling without repalcement?
What are random variables?
What are continous and discrete random variables?
What is a fair game?
A fair game is defined as a game that costs as much as its expected payout, i.e. expected profit is 0.
Discrete Random Variables
How do you calculate Mean (Expected Value) of discrete random variables?
Discrete random variable = Value of e.g. X, won't have fractions
How do you calculate the variability of values around the expected value of a random variable?
- We first need to compute the expected value
- we can use that to compute the variance and standart deviation
How does a linear combination of random variables look like? How do you compute the expected value or mean of those two variables?
- simply add up the expected value of each random variable
How do you calculate the variance of the sum of random variables?
- Calculate the individual variances using the formula from above
- Add the variances together
How do you calculate the variance if you have a range of numbers given for the variance?
- If you want to calculate the variation of the sum, then you need to add lower as well as uppon bound of each random variables variation together. Resulting give you a range of possible variation of both
Why do the random variables need to be independend from each other?
Continious Random Variables
Why can you not calculate the exact probabilty of continious random variables?
You can't count continious numbers, so there is no chance to exactly determine the value of a certain number. Or if you tried, the probability of that specific number would be ~ 0%
E.g. determine e.g exactly 15. But there are 14.9999999 and 15.00000001 which also have probabilities
What is a probability density function?
A curve with an area combined below of 1.
How do you calculate probabilities of continious random variables using a density function for normal distribution?
In a normal distribution curve we know certain probability occurances:
- Area 1 S.d below mean and 1 s.d above the mean is ~ 68.2% which means that each halve is about 34%
- Area below the mean on each side is 50%
4 - Probability Distributions
How are mean and SD also called?
→ If µ = 0 and ∂ = 1 we talk about the standart normal distribution
What is the Z-Score and what do we use it for?
The Z-score of an observation is defined as the number of standard deviations it falls above or below the mean
- the nominator determine the general difference between the mean of the distribution and your observed value
- The denominator then puts this distance into relation with the standart deviation of your distribution
Only under which conditions can we apply the z-score?
Only if the distribution are normal
When kind ob observations are considered to be unusual?
What is a percentile? How do we graphically represent it?
- The percentage of observations that fall BELOW a given data point → This data point can e.g be a Z-Score on a normal distribution
- Graphically the percentile is the area below the probability distribution TO THE LEFT of the observation
What is the R code to compute the percentile / area below the curve?
How do you compute the cutoff point for e.g the lowest 3% of a normal distribution observations? Given is mean of 98,2 and SD of 0,73.
- Lookup or compute the Z-Score corresponding to 3% or 0.03, which is -1.88
- plug all the values into the Z-Score formula and solve for X
How many data points fall within 1, 2, and 3 standart deviations of the mean?
How large is the IQR for normal distributions?
In a standard normal distribution (with mean 0 and standard deviation 1), the first and third quartiles are located at -0.67448 (-1 SD) and +0.67448 (+1 SD) respectively. Thus the interquartile range (IQR) is 1.34896. In a standard normal distribution: IQR = Q 3 - Q 1 = 0.67448- (-0.67448) = 1.34896
Under which condition do we talk about a Bernulli Random Variable?
On trial can only have two possible outcomes; success or failure
What is a geometric distribution? What formula can we use to calculate it?
In a geometric distribution X describes the waiting time (e.g trials) until a success for
- independet and identically distributed Bernulli Random Variables
What does independence between trials mean?
The outcomes of trials does not affect each other
What does identical mean?
the probability for a success in each trial is the same
How do we compute the mean amount of trials before a success and the standart deviation?
What calculation do you use if you want to know the probability of success occurence within a certain range of trials rather than exactly at one certain trial?
- We use the formula to calculate geometric distribution in general
- Then we plug in the corresponding values for n, each in a different formula
- Lastly we add together all the individual probaiblities.
If the question is: Find probaiblity of finding succes in first three cases: Instead of calculating and adding each of the first three trials, you can make use of the inverse probabilty and then calculate 1 - result you got
What is the binomial distribution?
What is the formula? How would you describe the formula in words?
- In words you can calcualte it as follows:
# of scenarios X P(single scenario), where mathematically:
- The actual formula is:
What is another way of writing n,k in brakets?
⇒ which means, out of n events choose k.
What is the R code for the choose function of the binimial distribution?
- where x = total trials and y = sucesses
What are the 4 conditions that need to be met for a binomial distribution to be applicable?
- Trial must be independent
- Number of trials must be fixed
- Each trail outcome must be classified as a success of failure
- The probaiblity of a success p must be the same for each trial (bernulli)
What is the probabilty that 2 randomly chosen people share a birthday? What is the probabilty that at least 2 people out of 366 share a birthday?
How do you compute the mean and standart deviation?
How do you compute the range (interval) outside of which observations are considered unusual in binomial distributions?
- Compute mean and SD of your distribution
- Recall that all observations further than 2 SD away from the mean are considered unusual
- As we computed ONE standart deviation, you can compute the interval borders by multiplying the SD by 2 and adding it to the mean
What is another method to determine whether the observation compared to the binomial distribution is unusual?
How effect does the size of you sampple n have on the shape of the binomial distribution?
The larger n, the more normal the distribution becomes → The mean of the distribution tends towards the actual population mean
What sample size is required to view a binomial distribution as normal?
Recall the wording for this:
How many different ways are there to arrange 0 successes and n failures in n trials? (1 way.) How many different ways are there to arrange n successes and 0 failures in n trials? (1 way.)
How many ways can you arrange one success and n 1 failures in n trials? How many ways can you arrange n 1 successes and one failure in n trials?2
Negative Binomial Distribution
What is the negative binomial distribution?
The probability of observing the kth success spcifically on the nth trial.
What are the conditions to be met?
- eyes on point No.4!
What is the formula?
- similar ideas as to regular binomial distribution: (probability of one sequence) * (number of possible sequences)
- We are excluding the column all the way to the right.
How is the negative binomial distribution different from the binomial distribution?
- binomial: the amount of trial is usually fixed and instead consider the number of successes
- negative: examine how many trials it takes to observe a fixed number of succes and require the last observation to be a succes
What is this kind of distribution used for?
What is the rate of a possion distribution?
What is the formula? How do you compute mean and standart deviation?
What are generalized linear models?
Adjusting the rate for different times, as e.g traffic is way different between normal times and rush hour times.
We are basically trying to improve the result of our poisson distribution
How do you identify a possion distribution?
- The event being evaluated is really rere
- E.g a person making a type who makes one normally once and hour
5 - Foundations of Inference
Point Estimates (PE) and Sampling Variability
What is a population paramenter?
the actual parameter of often a very large population which is hard to examine
What is a sample statistic and what is it used for?
- Samle statistic are point estimates of the actual populaiton parameter.
- this statistic varies from each sample that we take from the entire population
What is the margin of error?
The amount by which the sample statistic deviates from the acutal population parameter
How can you transfer a "known" population parameter of e.g US citizens in R? Parameter is 0.8 or 80%
How do you sample and compute p-hat using R?
What does the Central Limit Theorem for single sample proprtinos tell us?
What is the effect of the sample size on the Standart Error?
- as n increases, SE will decrease
- with more samples, the more consistent our point estimates of the actual population proportion will become.
- the variability among the point estimates will decrease.
What are necessary conditions for the central limit theorem to hold?
- Independence = independent sample observation; hard to proof, but assumed if:
- random sampling was applied
- if sampled without replacement and sample size is smaller than 10% of the entire population
- Sample size (success-failure condition)
- n*p > 10 & n*(1-p) > 10
- there must be at least 10 EXPECTED successes and failure in the observed sample
- we generally use the actual population parameter, however as this is not often given we most often use the point estimate p-hat.
What happens if both np or n(1-9) are < 10?
What does the central limit theorem for the distribution of sample means tell us?
What are the conditions to apply the central limit theorem to sampling distributions?
Confidence Intervals - Range around the PE
What is a confidence interval?
A possible range of values that contain the population parameter
- using a sample proportion to estimate the population parameter is like fishing with a spear, while using a confidence interval is like fishing with a net
What does 95% confident mean?
- we are taking many samples from a population and we are building a confidencinterval around each of them
- 95% of those intervals that we build would contain the true population proportion (p)
If we want to be more certain that we capute the populatino parameter with our interval, how should we adjust the interval?
- Make it wider
What do we change in order adjust the width of the interval?
- Adjust the margin of error, which is the Z* SE section of the CI calcualation
- More specifically we can adjust the Z* value
- 95% confidence interval ⇒ z = 1.96
- 98% confidence interval ⇒ z = 2.33
⇒ As z is larger, the margin of error will become larger and thus the interval will become larger. Thus the interval will give us more certaincy that it contains the actual proportion.
Hypothesis Testing - Intro using numerical data (means)
What are the steps of conduction a hypothesis test?
What are the two error types in hypothesis tests?
What relation needs to be between the final p-value of our test and our significance level alpha in order to reject H0?
How does enlarging our significance level alpha affect a type one error?
- A 5% significance level means that there is a 5% chance of rejecting H0 → Thus there is also a 5% chance of making a Type 1 error.
- The larger alpha is, the higher the probability of making a type 1 error (H0 rejected, even tho true) becomes.
- The signiifance level determine the "size" of the area below our z-score curve which we define as the rejection area of point estimates.
What is the parameter of interest?
usually the entire population paramenter which we are trying to estimate with our sample
What are the conditions for being able to apply a hypothesis test?
- Independence between each sample
- random sample
- sample size is less than 10% of entire population
- Success-Failure condition
What is the test statistic?
- evaluation of how many standart deviations the observed sample mean is from the mean of the hypothesized sampling distribution
What are the two types of test statistics you can compute?
What is the p-value?
- the probability of observing a test statistic at least as extreme, assuming that the Null Hypothesis is true
- p-values lower than alpha provide statistical evidence that the Null Hypothesis can be rejected
Under which condition should we use really small or large significance levels?
- If making a Type 1 error (falsly rejecting H0, even though it's true) is very costly.
- Rejecting H0 with a very small significance level is very unlikely
- Vise versa for expense Type 2 Errors
What is the difference between one sided and two sided hypothesis test? How do you define hypothesis for either one?
Inference for single proportions
What is the parameter of interest and the point estimate with single sample proportions?
- Parameter of interest = single population proportion for some kind of characteristic from the population
- Point estimate = Single proportion of sampled Americans who match some kind of characteristic
How do we compute the standart error of a sample proportion?
How do you determine the right sample size so that the Margin of Error is below some threshold?
- make use of the margin of error part from the confidence interval calculation
What is the difference in succes-failure condition and standart error calculation between confidence intervals and hypothesis tests for single proportions?
- for CI's tend to use the sample proportions
6 - Difference of two Proportions
What is the parameter of interest and the point estimate for the difference between two proportions?
- Parameter of interest = The actual difference between the proportions of two independend populations
- Point estimate = Observed difference between the proportions of two independent populations
The CI and HT formulas remain the same. What is the only thing that within the formulas needs to be computed differently? What values for p do we use? What is the special case of HT?
- We add the individual standart errors of both proportions
- which we then use for the Margin of Error in confidence intervals, and
- the denominator of the Z-score calculation during hypothesis tests
What are the conditions to compute the difference between two proportions?
- Independence within groups
- both groups (populations) need to be sampled randomly
- if they are sampled without replacement, the sample size should be less than 10% of the entire population
- Independence between groups
- Both groups are independed of eachother
- Succes failure for both groups
- in both samples there needs to be at least 10 successes and 10 failures
If we assume that H0 indicates NO DIFFERENCE between two proportions, our p-values (not p-hat) would be 0. With that we cannot compute the expected successes and failures. What do we use to solve this problem?
The pooled proportion:
- FInding the pooled proportion for the two groups
- Means finding the proportion of total successes among the total number of observations
What do we only mention the difference in proportions in the nominator of the z-score calculation for HT?
- Normally the nominator of Z-score calculation is: the point estimate - the null value
- as said in the question before, if we assume that H0 indicates no difference between the proportions, or null-value would actually be 0.
⇒ Thus we just leave it out and simply write the point estimate. In the case of difference between two proportions the corresponsing formula:
Direct comparison SE calculation for PROPORTION of one sample and difference between two samples:
- when working with proportions:
- if doing hypothesis tests, p comes from the null hypothesis
- if contructing confidence intervals, we use p-hat instead (sample proportion)
Testing for goodness of fit using chi-square
What is the Null and Alternative Hypothesis for the difference between observed and expected counts?
H0: there is NO inconsistency between the observed and expected counts. The observed counts follow the same distribution as the expected counts
HA: There IS inconsistency between the observed and expected counts. The observed counts do not follow the same distribution as the expected counts.
What is a goodness of fit test?
We quantify HOW different the observed counts are from the expected ones
- large deviations provide strong evidence for the alternative hypothesis
- How well does the observed data fit the expected data?
What is the chi-square statistic and what exactly do we compute with it?
Why do we sqaure the chi-statistic?
How do we describe the shape of the chi-square distribution?
- by determining the degrees of freedom
Recall: What is the key difference between normal, T, F and Chi distributions?
→ Chi distribution: one parameter (df) that influences the degrees of freedom.
What is the R code to determine the P-values corresponding to a Chi-Statistic?
Does the p-value corresponding to the chi-statistic shade the tail area above or below the test test statistic?
What are the conditions to apply a chi-square statistic?
Testing for independence in two-way tables
What are H0 and HA for testing indepence using Chi-Square?
H0: Two variables in the two-way tabla are independed
HA: Two variable are dependent.
How do we compute the df for two-way tables?
How do we compute the expected counts in two-way tables?
7 - Inference for numerical Data
One-Sample Means with t-distribution (paired)
Under which condition do we use t-distribution instead of normal distribution?
- If the sample size is very small and
- the population standart deviation is unknown
How is a t-dist. different from a normal one? What happens if the df changes?
- t distributions have larger tails, which means that observations are more likely to fall beyond 2 SD's → Thus getting classified as unusual
- This resolved our issued of not being able to relaiably estimate the standart error due to the small sample size and not having given the SD of the population
- with increasing df the t-distribution becomes more and more normal shaped. This is why we say a sample size of > 30 is required for a normal distribution.
How do you compute the df value?
sample size - 1
What is the SE formula for t-distributions of one-sample means?
- just as for normal distributions
What are the conditions to apply a t-distribution?
How does the CI and test-statistic (HT) calculations change?
- instead of Z-table scores we are now looking at t-tables with the corresponding df value.
What is paired data? What do we usually do to analyze such data?
- Two sets of observations have a special correspondence
- we sample from ONE population but look at TWO variables that are likely to be dependent
- We often compute the difference in outcomes of each pair of observations and test whether the sample average difference is different from the population average difference
- Always subtract in the same order!
What is our parameter of interest and point estimate?
parameter of interest: Average difference between the two examined variables of the population
point estimate: average difference between the two examined variables in the sample
What are the hypotheses for is testing if there is a difference between the average of population and sample?
Difference of two means
When do we get two means?
If we sample TWO samples and keep track of ONE variable for both. In this case the mean of each sample
What is out parameter of interest and point estimate?
Parameter of Interest: Average difference between the mean of one variable and the mean of another variable
Point estimate: Difference between the mean of the variable of one sample and the mean of the variable of the other sample
What are the conditions for testing the difference between two means?
How do you compute SE? How do you compute df?
- Recall that the SE formula is similar to the one-sample mean SE calcualtion
What is the equivalent confidence interval for a two-sided hypothesis test with alpha = 0,05?
90% confidence interval
What is the confidence interval formula here?
Direct comparison SE calculation for MEAN of one sample and difference between two sample means
- when working with means, it's very rare that the population standart deviation is known, so we usually use s
Power calculations for a difference between two means
What is the power of test?
- The probability of correctly rejecting H0 (correctly accepting HA), which is the negative of the probability for a Type 2 Error
- 1 - P(type II error)
What is the effect size?
- The required extremeness of our sample point estimate that would lead us to rejecting - H0.
- The cutoff point of the significance level that we set on both sides of the distribution
- Thus in context the required difference between two sample means
How do you compute the effect size?
- Compute a 95% percent confidence interval around the center of the distribution assuming that H0 is true
- Determine the "point estimate" → here 0, as we assume that the difference between means for H0 is 0
- Determine the z-value corresponding to the confidence interval → if alpha 0,05 then 95% CI
- Compute the Standart Error using the formula for the difference between two means
- Determine the interval border which then correspond to your cutoff points of the 97,5th and 2.5th percentile by plugging in everything into the CI Formula:
Explain graphically and mathematically the area corresponding to a Type II error.
- the area below the HA curve up to the critical value (H0 rejection) we just computed in the last step.
- The area in which we would fail to reject H0, even though H0 is false. So everything greater than the signifiance level of the H0 distribution.
How do you graphically represent the Power of a test?
- 1 - area of Type II error
- the area below the HA curve starting that the critical value of the H0 distribution (yellow area)
How do you compute the exact power?
- compute the z-score of the new distribution with the new center
- the shape of both distributions is the same, thus the SD remains the same
- the mean will be whatever your tasks is providing your with
- x will be either the right or left critical value we just calculated. Depending if the new HA center is to the right or left of the H0 center
- Thus at this specific example the power of the test if 42.07%
What are ways to increase the power of a test?
- Increase alpha → Expanding the orange area
- Problem: Chance for making a Type I error increases
- increase sample size
- decrease the area for a type II error as curves have less overlap
- Fewer variability in the data set
- will also cause the distributions to be narrower → thus less overlap
- The true parameter (HA) is further away from H0
- also causing less overlap
How do you calculate the sample size required for a minimum power requirement? (e.g 80% instead of 40%)
- Mark the required power area on the alternative distribution
- Find the Z-value corresponding to the required power (here corresponding to 0.8, which is 0.845)
⇒ The distance from the alternative center to the critical value of the H0 distribution is 0.845 times the SE of that distribution (null and observerd distribution SE's are the same)
- Set up an equation making use of the fact that the sum of those two SE lenghts is equal to the difference between the centers of both distributions
- solve this equation for SE by deviding with 2.8
- Make use of the SE formula for differnce between two means. For S plug in the given sample standart deviation. Solve for n
Comparing many means with ANOVA
When do we start to use ANOVA? What statistic do we use?
- If we are comparing the means of more than 3 samples
- We then use a statistic called F-Statistic
What are H0 and HA?
What are the conditions to apply ANOVA?
- Independent observations within and between groups
- Data from less than 10% of entire population
- interprete whether data is independent
- Observations within each group are nearly normal
- success failure condiditon
- Variability across the groups should be about equal
- especially when the sample sizes differ between the groups
How do we use ANOVA and F-Statistics to draw inference from data?
- still large test statistics will lead to small p-values. For both Z and F this is the same, as large test statistic indicate unlikely observations
- The F-statistic becomes very large when there is high variabliity between groups.
- If F is large enough we can reject H0 and thus conclude that the means between many samples are not equal and at least one of them is "extremely" different.
What is and how do we compute the total sum of squares (SST)?
- The sum of squares total provides us with the total variability between each data point in of all samples
- within each group AND between the groups
- the nummerator when calculating variance
- Compute the grand mean
- Compute the mean of the obersations from each sample
- Then add those means up and devide them by the amount of samples
⇒ (2 + 4 + 6) / 3 = 4 → grand mean
- Subtract the grand mean from EACH data point and square it. Add this computation up for all the data points from all samples
How do you compute the degrees of freedom? How do you compute the variance?
- DF: mn - 1
- Questions: How many data points were required to compute the LAST MISSING data point having given the grand mean?
- With the degrees of freedom you can then compute the variance
- Devide the total sum of squared by the degrees of freedom
- here e.g 30/8 = 3,75
What is and How do you compute the sum of squared within (SSW) also called sum of sqaured errror (SSE)?
- How far are the data points within each sample from their own sample tendency (sample mean)?
- Distance between the data points form each sample from that sample specific mean (we computed that sample mean in the first step)
How do you compute degrees of freedom and then variance of the SSW?
- Degrees of freedom: (n-1) * m
- Question: How many data points would be required for EACH sample to compute the LAST MISSING data point in EACH sample having given the means for each sample.
- Variance: Devide the SSW by the Degrees of freedom
- This will actually provide us with the denominator of the F-Statistic called mean square error
What is and how do we compute the sum of squared between (SSB or SSG)?
- How much of the total sum of squares variability comes from the variability BETWEEN the sample means
- For EACH Data point don't use the actual data point but rather the mean of the corresponding sample
- For EACH Data point subtract form the sample mean the grand mean and square that
- Repeat for each sample and sum up all the calculaitons
How do you compute the degrees for freedom for SSG? How variance?
- Degrees of freedom: m - 1, where m is amount of samples
- Question: How many means would you need to have in order to compute the LAST MISSING mean, having given the grand mean?
- Variance: Divide the SSG by the degrees of freedom
- this is actually the mean square error which is the nominator of our F statistic
Overview on degree of freedom calculations:
How does the F-Statistic formula look like with the sums we computed before?
- Both nominator and denominator are basically chi-square statistics → So F could be view as a fraction of two chi-square statsitics
How do we find the p-value corresponding to the F-Statistic with a 10% significance level using a table?
- Don't look at the actual fraction value but rather at the two degrees of freedoms from the nominator and denominator.
- This table is corresponding to an F-statistic based on a significance level of 10%
- As we computed a f-statistic of 12, we actually have high statistical evidence that H0 can be rejected.
⇒ Any value greater than 3.46 for F would indicate a p-value smaller than 10%, thus would allow us to reject H0
What are multiple comparisons?
The scenario of testing many pairs of groups
What is the benforroni correction?
8 - Introduction to Linar Regression
Line Fitting, residuals, and correlation
What are residuals?
What is correlation? What is a correlation coefficient? What values can it take on?
- The strengh of the linar association between two variables
- how close the data points are to the linear approximation line → how well does that line represent the relationship between x and y
- takes values between -1 (perfect negative) 1 ( perfect positive)
- if 1 → a upward sloping line can perfectly represent the relationship between x and y
- If -1 → a downward sloping line can -,,-
What are positive and negative correlations?
Positive = upwards sloping
Negative = downwards sloping
What is the formula to compute the correlation coefficient?
- Compute the sample mean
- Compute the sample standart deviation
- subtract each data point from the sample from the mean and square it.
- divide that whole sum by one less than the sample size
- For all data points for x and y compute the z-score of that point
- then apply the formula above
Why do we multiply z-scores?
- For each data point we want to know how many standart deviations it is away from the means of both the x-values and the y-values.
- Here: x = 2 is the sample mean (vertical line); y = 3 is the other mean (horizontal line
- the z-score for the first point wrt. x will be negative, as it is below the mean
- the z-score wrt. y will also be negative
⇒ two negative values yield a positive value + as the point is close to the standart deviations of both x and y, the z-score fractions will lead to a value close to 1
⇒ not only visually but also computationally this point will contribute to a high correlation coeffienct.
In a scatterplot what axis is usually the explanatory and what the response variable?
X-axis = explanatory variable
Y-axis = response variable
Fitting a line by least squared regression
what is the formula for a least squares line? What is a guaranteed point that you can use to get the y-intercept?
- the least squares line is definitly going throught the mean of x and mean of y
- the slope is generally described as the change in y over the change in x
- however, we need to keep in mind that there are residuals and the least squres line is only an estimate
- thus we are multiplying our change in y with the correlation coefficient we just computed. ⇒ This muliplication will change the nominator and thus the slope of the final line (here r = 0946)
What conditions does the bivariate data need to fulfill in order to apply a least squares line?
- Linearity between explanatory and response variable
- e.g not:
- Nealy normal residuals → Most of the residuals' values should be centered around 0, following a normal distribution (use a histogram to see distribution)
- Constant variability → over the whole data set the variability should not increase with increasing x and y values
- e.g not:
What is the basic interpretation of the slope of a regression line?
For an unit increase in x, a unit increase in y by the slope is on average expected to happen
How do you compute the total squared error of a sum of squares line?
How do you compute the root mean square deviation? (or standart diviation of the residuals
- Compute y-hat
- compute the square difference between y-hat and the actual y values
- Sum those together, devide them by n-2 and take the squareroot of that
What is a prediction?
Using the linear regression model to estimate the value of a response variable for a certain value of the explanatory variable
What is extrapolation?
Using the linear regression model to estimate values outside the realm of the given data → could be the intercept for a value of 0 for the explanatory variable
What is ? How do you compute it?
- the square of the correlation coefficient
- tells us the percent of variability in the response (y) variable is explained by the model
- how much of total variability is eliminated by using least squared regression instead of predicting without regression (e.g without x-values)
- Without using regression on the x variable, our most reasonable estimate would be to simply predict the average of the y values → measure the residuals based on the mean line of y
- compared to linear regression
What is the relationship between the squared error of the regression line and the coefficient of determination?
- If the squared error of the regression line is small → the line is a good fit for the data
- the fraction to compute the c.o.d will become very small
- Subtracting 1 from that fraction will yield a large number
- in context that means that a lot of the % variability in y-values can be described by the change occuring in x
- and vice versa
Types of Outliers in linear Regression
What are influential points?
Outliers horizontally far away from the center of the cloud, as they have big influence on the slope of the regression curve.
What is the effect of removing outliers on r, r-sq and the slope?
- r and r-sq can increases or decrease, but both simultaneously
- the slope will change, but the change is dependent on the influence of the outlier
Inference for linear regression
What are the conditions for making inferences on the slope of a regression line?
- Normal distribution of y values
- Equal variance → Each of the normal distributions should have similar spread
- Random → data from well-designed random sample or experiment
What are we testing with a hypothesis test and linear regression?
- For H0, we assume that there is neither a positive nor a negative relationship between explanatory and response variable
- In other words the slope of the regression line is 0
- Assuming H0, we can then compute a t-statistic that provides us with the probability of observing a regression line as the one we constructed use sample data from ONE specific sample
What is SE Coef?
The standart error of the coefficient we care about
- the coefficient we care about is the one in front of the explanatory variable
- the variable we care about also provides us with the slope, thus the SE coef also provides us with the sampling distribution of the slope of the regression line
How do you construct a confidence interval to determine the range which includes the actual slope of the population regression line?
Point Estimte +/- Tdf * SE Coef (we care about)
- df = n - 2
How do you compute the t-value in a hypothesis test?
- where b1 is the slope of our regression line and as we assume that the true. slope of the regression line is 0, subtract 0 from that
- we devide that by the SE Coef of the variable we care about.
How can we use the construction of a confidence interval to accept or reject H0?
- We assume H0 is true and for linear regression we assume thus a slope of 0
- If we now construct a confidence interval based on the sample data and this interval does not contain 0, we have statistical evidence that H0 can be rejected.
9 - Multiple and Logistics Regression
When do we have multiple regression?
If there is still one response variable, but more than one predictor or explanatory variables
- there are often many variables that can impact an outcome.
- Looking at one specific variable, the impact on the response variable can change a lot if we also consider other variables
What is collinearity?
The fact that two predictor variables are correlated
- ideally, we only want to add predictors to our study that are independent from each other
- thus experimental design is necessary for that
How the a multiple regression line formula look like?
Recall the basic formula to compute
- Variability in residuals can be computed as:
- Variability in the outcome concerns the variability in the sample data for the response variable (y-values) and can be computed as:
How do we need to adjust the formula to better explain the observed variance?
Looking at the computer regression output, how do you identify the reference explanatory variable?
- The variable that is not shown in the table
- here on top of volume, we consider both hardcover and paperback as explanatory variable for book weight. As there is no row for cover:hard, this is our reference value
What is the effect on R and when adding more variable?
- If the added variable provides new information, R and R^2 increases → our model becomes a better explanation for the spread in y values that we observe
- If the added variable does not provide new information, there will be no change
Is greater or smaller than normal ?
What does model selection help us with?
Eliminating variables from a model that are less important and do not provide any significant information
- this improved the overall prediction quality of the model
What are the hypothesis for testing the significance of a predictor?
How can we use the p-value returned from the statistical software to determine the significance of a variable?
- we can eliminate variables with high p-values (or low t-values)
- low t-values indicate that the slope observed for this certain variable is quite likely assuming that H0 is true
- low t-values in t-distribution yield high p-values, as the slope falls into the large area under the t-curve
How does backward-selection work?
- We start with the full model including all predictors and step by step remove those variales with high p-values
- we stop as soon as all variable with p-values far away from 0 are removed.
How does forward-selection work?
- we start with the model only including variables with p-values of 0 or very close to 0
- we then step by step add new variables with p-values close to 0
- for each model compute
- stop adding variables once does not seem to increase significantly anymore.
- in context: added variables do not contribute to a better explanantion of the variability in y-values anymore
Checking model conditions using graphs
What are 4 requirements to construct multiple regression models?
- Residuals of the model are nearly normal
- Variability of the residuals is nearly constant
- Residuals are independent
- Each variable is linearly related to the outcome
What graph can you use to check normality of residuals? What are the axis?
- most of the residuals should be centered around a distance of 0 from the actual curve
- be on the lookout for outliers, as they will heavily change the graph above.
What graph can you use to check for constant variability of residuals?
- the variability of residuals should remain the same even if we increaes the values on the x-axis.
- the variability should not in or decrease
What graph can you use to check independence between residuals?
What graphs can you use to check the linear relationship between predictor and response variable? What graph is suited for categorical, what graph for numerical data?
When are normally distributed residuals impossible?
The response variable is categorical with two levels
- logistics regression is used to address this problem with binary variables
What are and how do you compute oods?
What is logistics regression?
- type of generalized linear model for response variable where regualr multiple regression does not work
- our response variable y, takes the value 1 with a probability of p and a value of 0 with a probability of 1-p
- we model the probability of pi in relation to the predictor variables
What method do we use to transform the regular regression equation?
What are the 2 conditions to construct a logistic regression model?
- The model relating the parameter pi to the predictors x1i, x2i, x3i,... must closely resemble the true relationship between the parameter and the predictors
- The outcome for each case Yi must be independet of the other outcomes for other cases.