Normal Distributions and Rescaling

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a>
</span>
</div>

---

# Normal Distribution

---

## Normal Distribution

- A Normal distribution is a bell-shaped curve that models many natural and social phenomena. 
- It is defined mathematically by the formula:

`$f(x)= \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}}$`

- Formula has two parameters
    - `$\mu$`: the mean, locating the center of the distribution.
    - `$\sigma$`: the standard deviation, determining the spread.
- The *standard normal* is a special case with `$\mu = 0$` and `$\sigma = 1$`.
- It simplifies the equation

---

## Visualizing the Standard Normal

- The mean `$\mu$` is located at the center of the symmetric curve and is the same as the median. 
- Changing `$\mu$` (without changing `$\sigma$`) moves the Normal curve along the horizontal axis without changing its variability.

.small[
<img src="data:image/png;base64,#rnorms_files/figure-html/norm-1.png" width="55%" style="display: block; margin: auto;" />
]

---

``` r
#### Normal Distribution
locations<-c(2,4,-2)
# Display the normal distributions with various means

plot(X, HX, type="l", lty=2, xlab="x value",
		 ylab="Density", 
		 main = "Normal Distributions with Different Means", 
		 xlim=c(-5, 7))

for (i in seq_along(locations)) {
lines(X, dnorm(X, mean = locations[i]), 
      lwd = 2, 
      col = norm_colors[i])
}
legend("topright", 
       legend = paste("mu =", locations), 
       col = norm_colors[seq_along(locations)], lwd = 2, bty = "n")
```

<img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-2-1.png" width="40%" style="display: block; margin: auto;" />
]

---

# Standard Normal with multiple standard deviations

.pull-left[
- The standard deviation `$\sigma$` controls the variability of a Normal curve. 
- When the standard deviation is larger, the area under the normal curve is less concentrated about the mean.
- The standard deviation is the distance from the center to the change-of-curvature points on either side.
- The change-of-curvature points lie `$\sigma$` units from `$\mu$`.
]
.pull-right.small[
<img src="data:image/png;base64,#rnorms_files/figure-html/zsd-1.png" width="90%" style="display: block; margin: auto;" />
]
---

``` r
# Display the normal distributions with various standard deviations
plot(X, HX, type="l", lty=2, xlab="x value",
		 ylab = "Density", main="Comparison of Normal Distributions",xlim=c(-10, 10))

for (i in c(.5,2,4,6)){
	lines(X, dnorm(X,sd=i), lwd=1, col=norm_colors[i])
}
```

<img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-3-1.png" width="40%" style="display: block; margin: auto;" />
]
---

# Empirical rule (68–95–99.7)

.pull-left[
- In the Normal distribution, with mean `$\mu$`  and standard deviation `$\sigma$`:
    - approximately `$68\%$` of the observations fall within 1 `$\sigma$` of `$\mu$`
    - approximately `$95\%$` of the observations fall within 2 `$\sigma$` of `$\mu$`
    - approximately `$99.7\%$` of the observations fall within 3 `$\sigma$` of `$\mu$`
- This property is sometimes called: The `68-95-99.7 Rule`
]
.pull-right[
<img src="data:image/png;base64,#../img/normal.png" width="95%" style="display: block; margin: auto;" />
]

---

---

# Worked Example (ITBS)

.pull-left[
- The Iowa Test of Basic Skills (ITBS) is a standardized test used in many U.S. schools to assess students' academic skills.
- We will use the vocabulary subtest scores for seventh-grade students in Gary, Indiana.
- Vocabulary scores are approximately Normal: `N(6.84, 1.55)`.
- In this distribution, `$\mu = 6.84$` and `$\sigma = 1.55$`
]

---

# Worked Example (ITBS)

.pull-left[
- Sketch the Normal density curve for this distribution.
  - Q. What percent of ITBS scores is between 3.74 and 9.94?
  - Q. What percent of scores are below 3.74?
]
--

<img src="data:image/png;base64,#../img/norms.png" width="55%" style="display: block; margin: auto;" />
]

---
# Worked Example (ITBS)

## Check your understanding
.question[What percent of the scores is above 5.29?]

---
# Standard Normal in practice

The Normal is an approximation model—useful but not exact.
- Often reasonable for:
    - Physical features
    - Psychological features
    - Performance measures
- Not appropriate for:
    - Skewed variables (e.g. income)
    - Any count variable (number of kids, mistakes on an exam)

---

# Real Data Overlays

.pull-left[
- Many, but not all, variables are approximately Normal.
- Below are histograms from datasets we've used, with Normal overlays using sample mean and SD.
]

--
.small.pull-right[
<img src="data:image/png;base64,#rnorms_files/figure-html/example-1.png" width="90%" style="display: block; margin: auto;" />
]
---

# Height of children (Galton)
.small[

``` r
library(HistData)
library(ggplot2)
library(tidyverse)
Galton %>%
ggplot(aes(x = child)) + 
	geom_histogram(fill = "#e41a1c", bins = 30, color = "white") +
  stat_function(
    fun = function(x, mean, sd, n) n * dnorm(x, mean, sd),
    args = with(Galton, c(mean = mean(child), sd = sd(child), n = length(child)))) +
  labs(x = "Heights of children", y = "Count", 
     title = "Galton children's heights with Normal overlay") +
  theme_minimal()
```

<img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-8-1.png" width="45%" style="display: block; margin: auto;" />
]

---

# IMBd movie ratings (movies dataset)
.pull-left.midi[

``` r
library(ggplot2movies)
data(movies)

ggmovie <- ggplot(movies,
                  aes(x = rating)) +
  geom_histogram(fill = "blue") +
  geom_freqpoly(aes(
    x = rnorm(length(rating))*sd(rating) + mean(rating)),
    fill = "black") +
  scale_x_continuous("IMBd Movie Ratings") +
  theme_minimal()
```
]
.pull-right[
<img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-9-1.png" width="90%" style="display: block; margin: auto;" />
]
---

# Temperature in Nottingham (nottem)

---

# Standard Normal quick facts

- Normal distribution tricks
    - Symmetric about the mean
        - Mean = Median = Mode
    - 50% of area above zero 
    - Total proportion is 1.0 (or 100%)
    
---

# Area under the Normal curve

---

# Wrapping Up...

---

# Rescaling

---

# Rescaling

- All Normal distributions are the same 
  - if we measure in units of size `$\sigma$` from the mean `$\mu$` as center.
- We can convert any variable into the same metric as the standard normal
- Changing to these units is called standardizing or rescaling.

---

# Z-Score

- Z-score describes the location of the raw score in terms of distance from the mean, 
  - measured in standard deviations
- A z-score tells us how many standard deviations a given value (or score) is from the mean

--
.pull-right[
- Population
- `$z_{i}$` = `$\frac{x_{i}-\mu}{\sigma}$`
]

---

# Advantages of standardization
.pull-left[
- Allows us to compare scores on a common metric 
- Origin is 0. The mean
      - The units are 1, the standard deviation
      - '+' values above the mean
      - '-' values below the mean
]

.pull-right[
- We can compare across measurement scales
    - Shape of the distribution does NOT CHANGE
- We can go from z-scores to raw scores
]

---

# Demonstration

``` r
library(ggplot2movies)

# Raw vs scaled
data(movies)
variable <- movies$rating
# Raw data
head(variable, 10)
```

```
##  [1] 6.4 6.0 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6.0
```

``` r
# Rescaling
head(scale(variable), 10)
```

```
##              [,1]
##  [1,]  0.30079877
##  [2,]  0.04323788
##  [3,]  1.45982279
##  [4,]  1.45982279
##  [5,] -1.63090793
##  [6,] -1.05139592
##  [7,] -0.40749368
##  [8,]  0.49396944
##  [9,]  0.42957922
## [10,]  0.04323788
```
]

- Mean = 5.93 (SD  = 1.55)
]

| Raw| Z_Score|
|---:|-------:|
| 6.4|    0.30|
| 6.0|    0.04|
| 8.2|    1.46|
| 8.2|    1.46|
| 3.4|   -1.63|
| 4.3|   -1.05|
| 5.3|   -0.41|
| 6.7|    0.49|
| 6.6|    0.43|
| 6.0|    0.04|
]
]

---

# Density: raw vs scaled

``` r
plot(density(variable)) # no scaling
```

<img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-13-1.png" width="90%" style="display: block; margin: auto;" />
]

``` r
plot(density(scale(variable))) # with scaling
```

<img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" />
]

---

# Z-Score
- A z-score is a type of standard score
- Gives us information about the location of that score relative to the "average" deviation of all scores
- A z-score is the number of standard deviations a score is above or below the mean of the scores in a distribution
- A raw score is a regular score before it has been converted into a Z score
- Raw scores on very different variables can be converted into Z scores and directly compared

---

# Worked Z-Score Problem (IQ)

- IQ test scores of 31 7th-grade girls in a Midwest school district. 
<br>

---

# Check approximate Normality

A) We expect IQ scores to be approximately Normal.

- Make a stem plot to check that there are no major departures from normality.

---

# Mean and SD

B) Find the mean and standard deviation
--

- Mean =105.84 = `$\sum \frac{X_{i}}{n}$` = 3281/31
- SD = 14.27 =  `$s^{2}$` = `$\frac{\sum^{n}_{i=1}(x_{i}-\bar{x})^{2}}{n-1}$` =  `$s^{2}$` = `$\frac{\sum^{n}_{i=1}(x_{i}-105.84)^{2}}{30}$`

---

# Within one SD

C) What proportion of scores are within one standard deviation of the mean?
- One SD above mean = 105.84 + 14.27 = 120.11
- One SD below mean = 105.84 - 14.27 = 91.57
- 23/31 = 0.74
--

---

# Within two SD

B) What proportion of scores are within TWO standard deviations of the mean?
- TWO SD above mean = 105.84 + 2*(14.27) = 134.38
- TWO SD below mean = 105.84 - 2*(14.27) = 77.3
- 29/31 = 0.935

---

# Compare to exact Normal

B) What would these proportions be in an exactly Normal distribution?

-  +/- One SD?

``` r
area_within_one_sd <- 0.8413 - 0.500
print(paste("Area within one SD:", area_within_one_sd))
```

```
## [1] "Area within one SD: 0.3413"
```

``` r
total_area <- area_within_one_sd * 2
print(paste("Total area within ±1 SD:", total_area))
```

```
## [1] "Total area within ±1 SD: 0.6826"
```

---

---

## Cumulative Proportions

- The table below gives cumulative proportions for the standard Normal distribution.
- The cumulative proportion for value `$x$` equals `$P(Z \le z)$` under the standard Normal.
- In other words, value `$x$` in a distribution is the proportion of observations in the distribution that are less than or equal to `$Z$`.

---

---

---

---

---

---

---

The area between -2 and +2 standard deviations from the mean in a Normal distribution:

``` r
area_within_two_sd <- 0.9773 - 0.500
print(paste("Area within two SD:", area_within_two_sd))
```

```
## [1] "Area within two SD: 0.4773"
```

``` r
total_area_two_sd <- area_within_two_sd * 2
print(paste("Total area within ±2 SD:", total_area_two_sd))
```

```
## [1] "Total area within ±2 SD: 0.9546"
```

---

---

---

---

---

---

## Starting from a proportion (SAT example)

- SAT reading scores for a recent year are distributed according to an N(504, 111) distribution. 
- How high must a student score to be in the top 10% of the distribution?
- In other words, what score corresponds to the cumulative proportion 0.90 below it?

---

### Normal Calculations

- How high must a student score in order to be in the top 10% of the distribution?
  - Look up the closest probability (closest to 0.10) in the table.
  - Find the corresponding standardized score.
  - The value you seek is that many standard deviations from the mean.
  
---

```
##     z  X0.07  X0.08  X0.09
## 1 1.1 0.8790 0.8810 0.8830
## 2 1.2 0.8980 0.8997 0.9015
## 3 1.3 0.9147 0.9162 0.9177
```

```
## [1] "z-score for top 10%: 1.28"
```

---

- We need to “unstandardize” the z-score to find the observed value (x):

``` r
# Calculate the actual score
mean_score <- 504
sd_score <- 111

required_score <- mean_score + z_score * sd_score

print(paste("Required score for top 10%:", round(required_score, 2)))
```

```
## [1] "Required score for top 10%: 646.25"
```

- A student would have to score at least 646.08 to be in the top 10% of the distribution of SAT reading scores for this particular year.

---

## "Backward" Normal Calculations

Steps for using Table given a Normal proportion:

1. State the problem in terms of the given proportion. Draw a picture that shows the Normal value, `$x$`, that you want in relation to the cumulative proportion.
2. Use the table, the fact that the total area under the curve is 1, and the given area under the standard Normal curve to find the corresponding `$z$`-value.
3. Unstandardize `$z$` to solve the problem in terms of a non-standard Normal variable `$x$`.

---

## Functions in R for Normal Distribution

- `dnorm()`: gives the density
- `pnorm()`: gives the cumulative density function
  - Computes the probability that a normally distributed random number will be less than that number
- `qnorm()`: gives the quantile function
  - Is the inverse of pnorm, give it a probability, it produces the number whose cumulative distribution matches the probability
- `rnorm()`: generates random deviates

---

## Illustrations with R

Let's demonstrate these functions:

``` r
# Density at the mean of a standard normal distribution
dnorm(0)
```

```
## [1] 0.3989423
```

``` r
# Cumulative probability at 1 SD above the mean
pnorm(1)
```

```
## [1] 0.8413447
```
---

## Illustrations with R

Let's demonstrate these functions:

``` r
# Value at the 90th percentile of a standard normal
qnorm(0.9)
```

```
## [1] 1.281552
```

``` r
# Generate 5 random numbers from N(0,1)
rnorm(5)
```

```
## [1] -0.1608782 -1.8855510 -0.3830155  1.3031392  0.7325694
```

---
class: middle
# Wrapping Up...