class: center, middle, inverse, title-slide .title[ # Descriptive Statistics: Numeric Edition ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # Characteristics of distributions --- ## Describing distributions with numbers .pull-left[ - Hundreds of descriptive statistics exist - Goal: Describe the data with a single number that represents an entire distribution - Taxonomy of descriptive statistics (a.k.a. Key characteristics) - Shape - Center - Spread - Unusual observations ] .pull-right[ <img src="data:image/png;base64,#summary_files/figure-html/unnamed-chunk-2-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Taxonomy - shape: - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) - modality: unimodal, bimodal, multimodal, uniform - center: mean (`mean`), median (`median`), mode (not always useful) - spread: range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) - unusual observations: outliers, extreme values --- # Center - Center (or Central tendency or Location) aims to capture the center of the distribution. - Display the normal distributions with various means .small[ <img src="data:image/png;base64,#summary_files/figure-html/cent-1.png" width="55%" style="display: block; margin: auto;" /> ] --- .small[ ``` r x <- seq(-80, 80, length=1000) hx <- dnorm(x) colors <- c("red", "blue", "green", "purple", "black") plot(x, hx, type="l", lty=2, xlab="x value", ylab="Density", main="Distributions with Different Means",xlim=c(-7, 7)) location<-c(2,4,-2,-4) labels=paste0("Mean = ",location) labels[length(labels)+1]="Mean = 0" for (i in 1:length(location)){ lines(x, dnorm(x,mean=location[i]), lwd=1, col=colors[i]) } legend("topright", inset=.05, title="Distributions", labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors) ``` <img src="data:image/png;base64,#summary_files/figure-html/unnamed-chunk-3-1.png" width="30%" style="display: block; margin: auto;" /> ] --- # Spread - Spread (a.k.a. Variability ) describes how spread out the data are from that center - Low variance has a less wide distribution, with the bulk of the mass in the center - High variance has a very wide distribution, bulk of distribution is spread out --- # Normal distributions w/ various standard deviations .small[ <img src="data:image/png;base64,#summary_files/figure-html/spread-1.png" width="75%" style="display: block; margin: auto;" /> ] --- .small[ ``` r plot(x, hx, type="l", lty=2, xlab="x value", ylab="Density", main="Distributions with Different Standard Deviations",xlim=c(-7, 7)) spread=c(.5,2,4) labels=paste0("SD = ",spread) labels[length(labels)+1]="SD = 1" for (i in 1:length(spread)){ lines(x, dnorm(x,sd=spread[i]), lwd=1, col=colors[i]) } legend("topright", inset=.05, title="Distributions", labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors[c(1:3,5)]) ``` <img src="data:image/png;base64,#summary_files/figure-html/unnamed-chunk-4-1.png" width="30%" style="display: block; margin: auto;" /> ] --- # Skew / Asymetry .pull-left[ - Skewness is a measure of symmetry, or more precisely, the lack of symmetry. - A distribution, or data set, is symmetric - if it looks the same to the left and right of the center point. - Refers to the tail of the distribution ] .pull-right[ - Positively skewed (right skewed) - Bulk of the mass is on the left - Negatively skewed (left skewed) - Bulk of the mass is on the right ] <img src="data:image/png;base64,#../img/skew.gif" width="40%" style="display: block; margin: auto;" /> --- # Kurtosis – peakedness - Kurtosis is a measure of whether the data are - heavy-tailed or light-tailed relative to a normal distribution. - Data sets with high kurtosis tend to have heavy tails, or outliers. - Data sets with low kurtosis tend to have light tails, or lack of outliers. - A uniform (flat) distribution would be the extreme case. --- # Kurtosis – peakedness .pull-left[ - Low k - Leptokurtic - Normal - Mesokurtic - High k - Platykurtic ] <br> .pull-right[ <img src="data:image/png;base64,#../img/KurtosisPict.jpg" width="90%" style="display: block; margin: auto;" /> ] --- # Specific Measures - Measures of Central Tendency - Mean - Median - Mode --- # Central tendency: Mean - `\(Mean (\mu; \bar{X})\)` - arithmetic average - `\(\bar{X}\)` is used for samples - Mu (`\(\mu\)`) is used for population `\(\bar{X}= \frac{1}{n} \sum^{n}_{i=1}x_{i}\)` --- # Central tendency: Mean .medi[#Properties] - Is the balance point of the distribution (in terms of center of mass) -- `\(\sum^{n}_{i=1}(x_{i}-\bar{x})=0\)` -- - Least squares property - The sum of squared deviations about the mean is small -- - highly sensitive to outliers (extreme scores) - (weakness; it means that the mean is not so good as a measure of central tendency in highly skewed distributions) - Is not a robust statistic (low robust = sensitive to outliers; high robust = not sensitive to outliers) --- # Central tendency: Mean - Very good with quantitative data (interval and ratio data, - especially bell shaped distributions) - Very popular statistic --- # Horse kick Data .pull-left[ ``` r library(pscl) data(prussian) #horse kick fatalities by year prussian$y ``` ``` ## [1] 0 2 2 1 0 0 1 1 0 3 0 2 1 0 0 1 0 1 0 1 0 0 0 2 0 3 0 2 0 0 ## [31] 0 1 1 1 0 2 0 3 1 0 0 0 0 2 0 2 0 0 1 1 0 0 2 1 1 0 0 2 0 0 ## [61] 0 0 0 1 1 1 2 0 2 0 0 0 1 0 1 2 1 0 0 0 0 1 0 1 1 1 1 0 0 0 ## [91] 0 1 0 0 0 0 1 1 0 0 0 0 0 0 2 1 0 0 1 0 0 1 0 1 1 1 1 1 1 0 ## [121] 0 0 1 0 2 0 0 1 2 0 1 1 3 1 1 1 0 3 0 0 1 0 1 0 0 0 1 0 1 1 ## [151] 0 0 2 0 0 2 1 0 2 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 ## [181] 0 0 0 0 0 2 1 1 1 0 2 1 1 0 1 2 0 1 0 0 0 0 1 1 0 1 0 2 0 2 ## [211] 0 0 0 0 2 1 3 0 1 1 0 0 0 0 2 4 0 1 3 0 1 1 1 1 2 1 3 1 3 1 ## [241] 1 1 2 1 1 3 0 4 0 1 0 3 2 1 0 2 1 1 0 0 0 1 0 0 0 0 0 1 0 1 ## [271] 1 0 0 0 2 2 0 0 0 0 ``` ``` r variable<-prussian$y ``` ] .pull-right[ ``` r # Mean mean(variable) ``` ``` ## [1] 0.7 ``` ] --- # Central Tendency: Median - Median (Md) - Def: central score in a distribution - If *n is odd* then - Md = value of the `\(\frac{n+1}{2}\)` item term. - If *n is even* then - Md = average of the `\(\frac{n+}{2}\)` and `\(\frac{n+1}{2}\)` item terms. --- # Central Tendency: Median - Properties - Balance point of scores - Highly robust to outliers (less sensitive than the mean to outliers) - Sum of absolute deviations is smaller than any other constant (c) -- <br> `\(\sum^{n}_{i=1}(\left|X-c\right|)\)` - Often used for ordinal data ``` r median(variable) ``` ``` ## [1] 0 ``` --- # Calculating by "hand" .pull-left.small[ - Sample Size? ``` r (sample.size <- length(variable)) ``` ``` ## [1] 280 ``` - Even or Odd? - Test if number is divisible by 2. - If yes, then even - Else, is odd ``` r sample.size %% 2 == 0 ``` ``` ## [1] TRUE ``` ] .pull-right.small[ - Sort our values ``` r variable <- sort(variable) ``` - if odd, grab the midpoint value sample size / 2 ``` r variable[sample.size/2] ``` ``` ## [1] 0 ``` - if even, grab the average of the midpoint values sample size / 2 and sample size + 1 / 2 ``` r (variable[sample.size/2] + variable[1+sample.size/2])/2 ``` ``` ## [1] 0 ``` ] --- # Central Tendency: Mode .pull-left-narrow.small[ - Mode: most common score - Local modes are the highest point with a subset of the distribution, there can be multiple ones ``` r #Mode mode(variable) # That's not mode! ``` ``` ## [1] "numeric" ``` ] .pull-right-wide[ ``` r # Function to examine mode Mode <- function(x) { ux <- unique(x) #finds all unique values ux[which.max(tabulate(match(x, ux)))] #returns the value which is most frequent } Mode(variable) ``` ``` ## [1] 0 ``` ] --- # Relationship between mean, median, and mode .pull-left[ - When we have a symmetric unimodal distribution - Mean=median=mode ] -- .pull-left[ - positively skewed - mode < median < mean - negatively skewed - mean < median < mode ] .pull-right[ <img src="data:image/png;base64,#../img/mmmskew.jpg" width="90%" style="display: block; margin: auto;" /> ] --- .pull-left-narrow[ ``` r hist(variable) ``` <img src="data:image/png;base64,#summary_files/figure-html/hist-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right-wide.small[ ``` r library(Hmisc) describe(variable) ``` ``` ## variable ## n missing distinct Info Mean pMedian Gmd ## 280 0 5 0.828 0.7 0.5 0.8752 ## ## Value 0 1 2 3 4 ## Frequency 144 91 32 11 2 ## Proportion 0.514 0.325 0.114 0.039 0.007 ## ## For the frequency table, variable is rounded to the nearest 0 ``` ] --- # Measures of Spread around the Median - Range - maximum value - minimum value - `\(max(x_{i})-min(x_{i})\)` - Non-robust to outliers -- - Quartiles - Lower quartile (Q1), 25th percentile - Second quartile (Q2), 50th percentile / median - Upper quartile (Q3), 75th percentile -- - Interquartile Range (IQR) - Q3-Q1 - Sometimes called h-spread; h = hinges --- # R Examples .medium[ ``` r # Range range(variable) ``` ``` ## [1] 0 4 ``` ``` r max(variable) -min(variable) ``` ``` ## [1] 4 ``` ``` r summary(variable)# 5 Number Summary / Quartiles ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 0.0 0.0 0.7 1.0 4.0 ``` ] --- # Spread around the Mean - Variance `\((\sigma^{2}; s^{2})\)` - Measure of spread around the mean - Goal of the measure to use every score `\(\sigma^{2}\)` = `\(\frac{\sum^{n}_{i=1}(x_{i}-\mu)^{2}}{N}\)` `\(s^{2}\)` = `\(\frac{\sum^{n}_{i=1}(x_{i}-\bar{x})^{2}}{n-1}\)` - Standard Deviation `\((\sigma; s)\)` - `\(\sigma\)` = `\(\sqrt{\sigma^{2}}\)` - s = `\(\sqrt{s^{2}}\)` --- # Bessel's Correction - `\(s^{2}\)` is nearly the average squared deviation - `\(s^{2}\)` uses n-1 instead of N - Otherwise we get biased estimates - This adjustment is called Bessel's correction - In our formula, we are using the sample mean (x) instead of the true mean (`\(\mu\)`) - this results in underestimating each `\(x_{i} − \mu\)` by `\(x − \mu\)`. - This is why we divide by n − 1 instead of n. --- # Properties - Std is in raw score units - Both the variance and standard deviation are highly NOT robust --- # R Examples ``` r # Variance var(variable) ``` ``` ## [1] 0.762724 ``` ``` r # Standard Deviation sqrt(var(variable)) ``` ``` ## [1] 0.8733407 ``` ``` r sd(variable) ``` ``` ## [1] 0.8733407 ``` --- # Wrapping Up... <br><br> 