Sampling Distributions 🐱

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a>
</span>
</div>

---

# Sampling Distributions

---

# Road Map

- Sampling distributions: what and why
- The sampling distribution of the mean
- Laws that describe distribution shape
  - Law of Large Numbers
  - Central Limit Theorem
- Using sampling distributions for inference

---

# Sampling Distribution

- Simple Idea:
  - Distribution of a statistic
  - Compare to the distribution of a given variable
--

- Formal Definition: 
  - Sampling distribution is the distribution of values taken by a statistic across all possible samples of a given size, drawn from a particular population

---

## All statistics have a sampling distribution

- Mean, mode, third quartile, upper extreme, etc
  - Each statistic has its own sampling distribution
- Terminology:
    - `$\mu_{x}$`, `$\sigma_{x}$` = raw score distribution
    - `$\mu_{\bar{x}}$`, `$\sigma_{\bar{x}}$` = Parameters of the sampling distribution of that particular statistic

---

## Population Distributions are different

- Be careful: The population distribution describes the individuals that make up the population. 
- A sampling distribution describes how a statistic varies in many samples from the population.

---

## Sampling distribution: Illustration

---

# Example: Sampling Distribution of the Mean

- SAT Scores at WFU
  - `$\mu_{x}$` Population mean = 1500
  - `$\sigma{x}$` Population standard deviation = 100
<br>
- Sampling Distribution of 25 WFU Student Scores
  - `$\mu_{x} =\mu_{\bar{x}}$` = 1500
  - `$\sigma_{\bar{x}} = \frac{\sigma_{x}}{\sqrt{n}} = \frac{100}{\sqrt{25}} = 20$`

---

# General Equation

.pull-left-narrow[
- `$\mu_{\bar{x}} =\mu_{x}$`
- `$\sigma_{\bar{x}} = \frac{\sigma_{x}}{\sqrt{n}}$`
- `$\sigma^2_{\bar{x}} = \frac{\sigma^2_{x}}{n}$`
]
.pull-right-wide[
- Because the mean of the statistic `$\bar{x}$` is always equal to the mean `$\mu$` of the population
  - (that is, the sampling distribution of the mean is centered at `$\mu$`)
  - we say the statistic `$\bar{x}$` is an unbiased estimator of the parameter `$\mu$`
  - Note: on any particular sample, `$\bar{x}$` may not may fall above or below `$\mu$`
]

---

# Example: Sampling Distribution of the Mean

<img src="data:image/png;base64,#sampling-distributions_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" />
]

.pull-right[
- SRS of size 10: `$\bar{x}$` = 26.91
- SRS of size 10: `$\bar{x}$` = 26.96
- SRS of size 10: `$\bar{x}$` = 24.49
- SRS of size 10: `$\bar{x}$` = 20.83

<img src="data:image/png;base64,#sampling-distributions_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" />
]

---

# Standard Deviation of the Sampling Distribution

- Because the standard deviation of the sampling distribution of `$\bar{x}$` is

`$$\sigma_{\bar{x}} = \frac{\sigma_{x}}{\sqrt{n}}$$`
  - the averages are less variable than individual observations, and 
  - averages are less variable than the results of small samples.
---

# Standard Deviation of the Sampling Distribution

- Not only is the standard deviation of the distribution of `$\bar{x}$` smaller than the standard deviation of individual observations,
  - it gets smaller as we take larger samples.
- The results of large samples are less variable than the results of small samples.
- Although the standard deviation of the distribution of `$\bar{x}$` gets smaller, 
  - it does so at the rate of `$\sqrt{n}$`, not `$n$`. 
- To cut the sampling distribution’s standard deviation in half, for instance, 
  - you must take a sample four times as large, not just twice as large.
---

# Wrapping Up...

---

# This Time...

- Laws that describe the shape of distributions

---

# Laws that describe the shape of distributions

- Law of Large Numbers
- Central Limit Theorem

---

# Law of Large Numbers

- Although a single statistic, like `$\bar{x}$` is rarely identical to the population mean, we can still use it as a reasonable estimate of the population mean.
- Why?
--

- If we keep taking larger and larger samples, the statistic `$\bar{x}$` is guaranteed to get closer and closer to the parameter `$\mu$`

---

# Law of Large Numbers

<img src="data:image/png;base64,#sampling-distributions_files/figure-html/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" />
]
.pull-right[
- As the size of the sample increases, the sample mean gets closer to the population mean
- Note: This is a probabilistic statement, not a deterministic one
]

---
# Law of Large Numbers

- As the size of the sample increases, the sample mean gets closer to the population mean
- The sample mean is an unbiased estimator of the population mean
- The larger the sample size, the more likely the sample mean is to be close to the
  population mean
  - Example: Rolling a die
  - The average of a large number of independent observations from the same population is likely to be close to the population mean
  - The average of a small number of independent observations from the same population is not likely to be close to the population mean
  - Note: This is a probabilistic statement, not a deterministic one
---

# Law of Large Numbers in Action

---

# Central Limit Theorem

- Similarly, the Central Limit Theorem describes the shape of a sampling distribution.
- Draw an SRS of size `$n$` from any population with mean `$\mu$` and finite standard deviation `$\sigma$`. 
  - The central limit theorem says that when n is large, the sampling distribution of the sample mean `$\bar{x}$` is approximately Normal:
  - `$\bar{x}$` is distributed as N (`$\mu$`,  `$\frac{\sigma}{\sqrt{n}}$`)

---

# Central Limit Theorem

- No matter what the shape of the raw score distribution.
- The central limit theorem allows us to use normal probability calculations to answer questions about sample means from many observations, 
  - even when the population distribution is not normal (e.g, skewed, uniform, bimodal)
  
---

# Reiterate

- With a simple random sample,
- A sampling distribution of the mean is normal if:
  - The raw score distribution is normal; OR
  - The sample size is “large”
	  - N = 30, is usually large enough
	  - The central limit theorem does not apply under small sample size

---

# Example: CLT

- The distribution of household income in the U.S. is strongly right-skewed, with a mean of about $54,000 and a standard deviation of about $65,000.
- .question[ If we take a random sample of 45 households, what is the probability that the mean household income for the sample is more than $60,000?]
--

- Because the sample size is large, we can use the central limit theorem to answer this question.
--

.pull-left[
- The sampling distribution of the mean is approximately normal with:
  - Mean = $54,000
  - Standard deviation = `$\frac{65000}{\sqrt{45}}$` `$\approx$` 9689.63
]
--
.pull-right[
- We want to find `$P(\bar{x} > 60000)$`

]

---
.pull-left-wide[
- We want to find `$P(\bar{x} > 60000)$`
- `$Z = \frac{\bar{x} - \mu}{\sigma_{\bar{x}}} = \frac{60000 - 54000}{65000/\sqrt{45}} \approx$` 0.62
.medi[

``` r
round((60000 - 54000)/(65000/sqrt(45)),4)
```

```
## [1] 0.6192
```
]]
.pull-right-narrow.small[
> Btw... AI guessed: `$\approx 0.41$`

]
---
.pull-left-wide[
- `$P(\bar{x} > 60000) = P(Z >$` 0.62) =
- `$1 - P(Z <$` 0.62) = 
.medi[

``` r
round(pnorm((60000 - 54000)/(65000/sqrt(45))),4)
```

```
## [1] 0.7321
```
]
- 1 - 0.7321 = 0.2679
.medi[

``` r
round(1-pnorm((60000 - 54000)/(65000/sqrt(45))),4)
```

```
## [1] 0.2679
```
]
]
.pull-right-narrow.small[
> AI guessed: 0.3409
]

---

# Example: CLT

.pull-left-narrow[
<img src="data:image/png;base64,#sampling-distributions_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" />
]

.pull-right-wide[
- Based on service records from the past year, the time (in hours) that a technician requires to complete preventative maintenance on an air conditioner follows the distribution that is strongly right-skewed and whose most likely outcomes are close to 0. The mean time is `$\mu = 1$` hour and the standard deviation is `$\sigma$` = 1 hours.
]
- .question[Your company will service an SRS of 70 air conditioners. You have budgeted 1.1 hours per unit. Will this be enough?]
--

The central limit theorem states that the sampling distribution of the mean time spent working on the 70 units has:
- Mean = 1 hour
- Standard deviation = `$\frac{\sigma_{\bar{x}}}{\sqrt{n}} = \frac{1}{\sqrt{70}} \approx$` 0.12
- and is approximately normal because the sample size is large.

---

# Will this be enough?

.pull-left-wide[
- We want to find `$P(\bar{x} > 1.1)$`
- The sampling distribution of the mean time spent working is approximately ( N(1, 0.12) ) since ( n = 70 `$\ge$` 30 ).
]

.pull-right-narrow[
<img src="data:image/png;base64,#sampling-distributions_files/figure-html/clt_schematic-1.png" width="95%" style="display: block; margin: auto auto auto 0;" />
]
--

$$
Z = \frac{\bar{x} - \mu}{\sigma_{\bar{x}}} = \frac{1.1 - 1}{\frac{1}{\sqrt{70}}} \approx
$$

``` r
round((1.1 - 1)/(1/sqrt(70)),4)
```

```
## [1] 0.8367
```

---

$$
p(\bar{x} > 1.1) = P(Z > 0.8367)
$$
$$
= 1 - P(Z < 0.8367) = 
$$

``` r
round(1-pnorm((1.1 - 1)/(1/sqrt(70))),4)
```

```
## [1] 0.2014
```

If you budget 1.1 hours per unit, there is a 20% chance the technicians will not complete the work within the budgeted time.

---

# Wrapping Up...

---

# Sampling distributions & statistical significance

---

# Statistical significance

.pull-left[
- We have looked carefully at the sampling distribution of a sample mean. 
- However, any statistic we can calculate from a sample will have a sampling distribution.
- The sampling distribution of a statistic allows us to determine how likely a particular value of the statistic is.
- For example, here is the sampling distribution of the sample standard deviation and variance for samples of size 5 from a normal population with mean 0 and standard deviation 1.
]
.pull-right[

<img src="data:image/png;base64,#sampling-distributions_files/figure-html/unnamed-chunk-14-1.png" width="90%" height="20%" style="display: block; margin: auto;" />
]
---

# The sampling distribution of a sample statistic is determined by:

- the particular sample statistic we are interested in, 
- the distribution of the population of individual values from which the sample statistic is computed, and 
- the method by which samples are selected from the population.

---

# Sampling distributionsand statistical significance

- The sampling distribution allows us to determine the probability of observing any particular value of the sample statistic in another such sample from the population. 
- We said that an observed effect so large that it would rarely occur by chance is called statistically significant.

---
# Consider This…

- We may decide, based on our observed set of 1000 samples, that because we say only 2 with variances above 200, that that is a statistically significant event.

---

# Wrapping Up...