Descriptive Statistics: Numeric Edition

class: center, middle, inverse, title-slide

.title[
# Descriptive Statistics: Numeric Edition
]
.author[
### S. Mason Garrison
]

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a>
</span>
</div>

---

class: middle

# Characteristics of distributions

---

## Describing distributions with numbers 
.pull-left[
- Hundreds of descriptive statistics exist
- Goal: Describe the data with a single number that represents an entire distribution
- Taxonomy of descriptive statistics (a.k.a. Key characteristics)
  - Shape
  - Center
  - Spread
  - Unusual observations
] .pull-right[

]

---

# Taxonomy

- shape:
  - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
  - modality: unimodal, bimodal, multimodal, uniform
- center: mean (`mean`), median (`median`), mode (not always useful)
- spread: range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`)
- unusual observations: outliers, extreme values

---

# Center

- Center (or Central tendency or Location) aims to capture the center of the distribution.
- Display the normal distributions with various means

.small[
<img src="data:image/png;base64,#summary_files/figure-html/cent-1.png" width="55%" style="display: block; margin: auto;" />
]
---

.small[

``` r
x <- seq(-80, 80, length=1000)
hx <- dnorm(x)
colors <- c("red", "blue",
            "green", "purple", "black")

plot(x, hx, type="l", lty=2, xlab="x value",
		 ylab="Density", main="Distributions with Different Means",xlim=c(-7, 7))
location<-c(2,4,-2,-4)
labels=paste0("Mean = ",location)
labels[length(labels)+1]="Mean = 0"

for (i in 1:length(location)){
	lines(x, dnorm(x,mean=location[i]), lwd=1, col=colors[i])
}

legend("topright", inset=.05, title="Distributions",
  labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)
```

<img src="data:image/png;base64,#summary_files/figure-html/unnamed-chunk-3-1.png" width="30%" style="display: block; margin: auto;" />
]

---

# Spread

- Spread (a.k.a. Variability ) describes how spread out the data are from that center
    - Low variance has a less wide distribution, with the bulk of the mass in the center
    - High variance has a very wide distribution, bulk of distribution is spread out

---

# Normal distributions w/ various standard deviations

.small[
<img src="data:image/png;base64,#summary_files/figure-html/spread-1.png" width="75%" style="display: block; margin: auto;" />
]

---

.small[

``` r
plot(x, hx, type="l", lty=2, xlab="x value",
		 ylab="Density", main="Distributions with Different Standard Deviations",xlim=c(-7, 7))
spread=c(.5,2,4)
labels=paste0("SD = ",spread)
labels[length(labels)+1]="SD = 1"
for (i in 1:length(spread)){
	lines(x, dnorm(x,sd=spread[i]), lwd=1, col=colors[i])
} 
legend("topright", inset=.05, title="Distributions",
  labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors[c(1:3,5)])
```

<img src="data:image/png;base64,#summary_files/figure-html/unnamed-chunk-4-1.png" width="30%" style="display: block; margin: auto;" />
]

---

# Skew / Asymetry
.pull-left[
- Skewness is a measure of symmetry, or more precisely, the lack of symmetry. 
- A distribution, or data set, is symmetric 
  - if it looks the same to the left and right of the center point.
- Refers to the tail of the distribution
]
.pull-right[
- Positively skewed (right skewed)
    - Bulk of the mass is on the left
-	Negatively skewed (left skewed)
    - Bulk of the mass is on the right
]
<img src="data:image/png;base64,#../img/skew.gif" width="40%" style="display: block; margin: auto;" />

---

# Kurtosis – peakedness

- Kurtosis is a measure of whether the data are 
    - heavy-tailed or light-tailed relative to a normal distribution. 
- Data sets with high kurtosis tend to have heavy tails, or outliers. 
- Data sets with low kurtosis tend to have light tails, or lack of outliers. 
    - A uniform (flat) distribution  would be the extreme case.
    
---

# Kurtosis – peakedness
.pull-left[
- Low k
    - Leptokurtic
- Normal
    - Mesokurtic
- High k
    - Platykurtic
]
<br>
.pull-right[
<img src="data:image/png;base64,#../img/KurtosisPict.jpg" width="90%" style="display: block; margin: auto;" />
]

---

# Specific Measures

- Measures of Central Tendency
    - Mean
    - Median
    - Mode

---

# Central tendency: Mean

- `$Mean (\mu; \bar{X})$`
    - arithmetic average
    - `$\bar{X}$` is used for samples
    - Mu (`$\mu$`) is used for population

`$\bar{X}= \frac{1}{n}  \sum^{n}_{i=1}x_{i}$`

---

# Central tendency: Mean

.medi[#Properties]

- Is the balance point of the distribution (in terms of center of mass)
--
    
    `$\sum^{n}_{i=1}(x_{i}-\bar{x})=0$`
    
--

- Least squares property
      - The sum of squared deviations about the mean is small

--
- highly sensitive to outliers (extreme scores)
  - (weakness; it means that the mean is not so good as a measure of central tendency in highly skewed distributions)
- Is not a robust statistic (low robust = sensitive to outliers; high robust = not sensitive to outliers)
  
---

# Central tendency: Mean
- Very good with quantitative data (interval and ratio data, 
    - especially bell shaped distributions)
    - Very popular statistic

---

# Horse kick Data
.pull-left[

``` r
library(pscl)
data(prussian)

#horse kick fatalities by year
prussian$y 
```

```
##   [1] 0 2 2 1 0 0 1 1 0 3 0 2 1 0 0 1 0 1 0 1 0 0 0 2 0 3 0 2 0 0
##  [31] 0 1 1 1 0 2 0 3 1 0 0 0 0 2 0 2 0 0 1 1 0 0 2 1 1 0 0 2 0 0
##  [61] 0 0 0 1 1 1 2 0 2 0 0 0 1 0 1 2 1 0 0 0 0 1 0 1 1 1 1 0 0 0
##  [91] 0 1 0 0 0 0 1 1 0 0 0 0 0 0 2 1 0 0 1 0 0 1 0 1 1 1 1 1 1 0
## [121] 0 0 1 0 2 0 0 1 2 0 1 1 3 1 1 1 0 3 0 0 1 0 1 0 0 0 1 0 1 1
## [151] 0 0 2 0 0 2 1 0 2 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1
## [181] 0 0 0 0 0 2 1 1 1 0 2 1 1 0 1 2 0 1 0 0 0 0 1 1 0 1 0 2 0 2
## [211] 0 0 0 0 2 1 3 0 1 1 0 0 0 0 2 4 0 1 3 0 1 1 1 1 2 1 3 1 3 1
## [241] 1 1 2 1 1 3 0 4 0 1 0 3 2 1 0 2 1 1 0 0 0 1 0 0 0 0 0 1 0 1
## [271] 1 0 0 0 2 2 0 0 0 0
```

``` r
variable<-prussian$y 
```
]
.pull-right[

``` r
# Mean
mean(variable)
```

```
## [1] 0.7
```
]
---

# Central Tendency: Median
- Median (Md)
- Def: central score in a distribution
    - If *n is odd* then 
        - Md = value of the `$\frac{n+1}{2}$` item term.
    - If *n is even* then 
        - Md = average of the `$\frac{n+}{2}$`  and `$\frac{n+1}{2}$` item terms.
        
---

# Central Tendency: Median
- Properties
    - Balance point of scores
    - Highly robust to outliers (less sensitive than the mean to outliers)
    - Sum of absolute deviations is smaller than any other constant (c)
--
<br>
  `$\sum^{n}_{i=1}(\left|X-c\right|)$`
- Often used for ordinal data

``` r
median(variable)
```

```
## [1] 0
```

---

# Calculating by "hand"

.pull-left.small[
- Sample Size?

``` r
(sample.size <- length(variable))
```

```
## [1] 280
```

- Even or Odd?
  - Test if number is divisible by 2.
  - If yes, then even
  - Else, is odd

``` r
sample.size %% 2 == 0 
```

```
## [1] TRUE
```
]
.pull-right.small[
- Sort our values

``` r
variable <- sort(variable)
```

- if odd, grab the midpoint value sample size / 2

``` r
variable[sample.size/2]
```

```
## [1] 0
```

- if even, grab the average of the midpoint values sample size / 2 and sample size + 1 / 2

``` r
(variable[sample.size/2] + variable[1+sample.size/2])/2
```

```
## [1] 0
```
]

---

# Central Tendency: Mode
.pull-left-narrow.small[
- Mode: most common score
- Local modes are the highest point with a subset of the distribution, there can be multiple ones

``` r
#Mode
mode(variable) # That's not mode!
```

```
## [1] "numeric"
```
]
.pull-right-wide[

``` r
# Function to examine mode
Mode <- function(x) {
	ux <- unique(x)   #finds all unique values
	ux[which.max(tabulate(match(x, ux)))] #returns the value which is most frequent
}
Mode(variable)
```

```
## [1] 0
```
]

---

# Relationship between mean, median, and mode
.pull-left[
- When we have a symmetric unimodal distribution
    - Mean=median=mode
    ]
--
.pull-left[
- positively skewed
    - mode < median < mean
- negatively skewed
    - mean < median < mode
]
.pull-right[
<img src="data:image/png;base64,#../img/mmmskew.jpg" width="90%" style="display: block; margin: auto;" />
]
---

.pull-left-narrow[

``` r
hist(variable)
```

<img src="data:image/png;base64,#summary_files/figure-html/hist-1.png" width="90%" style="display: block; margin: auto;" />
]
.pull-right-wide.small[

``` r
library(Hmisc)
describe(variable)
```

```
## variable 
##        n  missing distinct     Info     Mean  pMedian      Gmd 
##      280        0        5    0.828      0.7      0.5   0.8752 
##                                         
## Value          0     1     2     3     4
## Frequency    144    91    32    11     2
## Proportion 0.514 0.325 0.114 0.039 0.007
## 
## For the frequency table, variable is rounded to the nearest 0
```
]
---
# Measures of Spread around the Median

- Range
    - maximum value - minimum value
    - `$max(x_{i})-min(x_{i})$`
    - Non-robust to outliers
--
- Quartiles
    - Lower quartile (Q1), 25th percentile
    - Second quartile (Q2), 50th percentile / median
    - Upper quartile (Q3), 75th percentile
--
- Interquartile Range (IQR)
    - Q3-Q1
    - Sometimes called h-spread; h = hinges
    
---

# R Examples
.medium[

``` r
# Range
range(variable)
```

```
## [1] 0 4
```

``` r
max(variable) -min(variable)
```

```
## [1] 4
```

``` r
summary(variable)# 5 Number Summary / Quartiles
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     0.7     1.0     4.0
```
]

---

# Spread around the Mean

- Variance 
`$(\sigma^{2}; s^{2})$`
    - Measure of spread around the mean
    - Goal of the measure to use every score

`$\sigma^{2}$` = `$\frac{\sum^{n}_{i=1}(x_{i}-\mu)^{2}}{N}$`
    `$s^{2}$` = `$\frac{\sum^{n}_{i=1}(x_{i}-\bar{x})^{2}}{n-1}$`
- Standard Deviation 
`$(\sigma; s)$`
    - `$\sigma$` = `$\sqrt{\sigma^{2}}$`
    - s = `$\sqrt{s^{2}}$`
    
---

# Bessel's Correction
- `$s^{2}$` is nearly the average squared deviation
- `$s^{2}$` uses n-1 instead of N
    - Otherwise we get biased estimates
    - This adjustment is called Bessel's correction
- In our formula, we are using the sample mean (x) instead of the true mean (`$\mu$`)
    - this results in underestimating each `$x_{i} − \mu$` by `$x − \mu$`. 
    - This is why we divide by n − 1 instead of n.
---

# Properties
- Std is in raw score units
- Both the variance and standard deviation are highly NOT robust
   
---

# R Examples

``` r
# Variance
var(variable)
```

```
## [1] 0.762724
```

``` r
# Standard Deviation
sqrt(var(variable))
```

```
## [1] 0.8733407
```

``` r
sd(variable)
```

```
## [1] 0.8733407
```

---

# Wrapping Up...

<br><br>
![](data:image/png;base64,#../img/centralbears.jpg)