class: center, middle, inverse, title-slide .title[ # Introduction to Robust Statistics and Outliers ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # Understanding Outliers and Robust Statistics --- ## Course Outline 1. Introduction to Outliers 2. Impact of Outliers on Statistics 3. Identifying Outliers 4. Robust Statistics 5. Handling Outliers 6. Visualizing Data and Outliers 7. Real-world Examples 8. Wrap-up and Discussion --- ## What is an outlier? .pull-left[ **out-li-er** noun: 1. Something that is situated away from or classed differently from a main or related body 2. A statistical observation that is markedly different in value from the others of the sample ] .pull-right[ "An important kind of deviation is an outlier, an individual value that falls outside the overall pattern." ] --- <img src="data:image/png;base64,#../img/gladwell.png" width="100%" style="display: block; margin: auto;" /> --- ## Types of Outliers 1. Error outliers - Due to measurement or recording mistakes - Example: Typing 1000 instead of 100 2. Natural outliers - Unusual but valid data points - Example: A 7-foot-tall person in a height study 3. Influential outliers - Significantly affect statistical analyses - Example: A billionaire in an income study --- ## Why Do Outliers Matter? - They can significantly affect our statistical calculations - Might represent errors in data collection - Could be interesting cases worth studying - Can lead to incorrect conclusions if not handled properly --- # Robustness - Robust Statistics are less sensitive to outliers - Most common statistics are highly sensitive to outliers: - Mean - Standard deviation - Correlation <div style="float: right; width: 50%;"> {} </div> --- <img src="data:image/png;base64,#../img/huff.jpg" width="50%" style="display: block; margin: auto;" /> --- ## Example: Impact of Outliers on Mean Consider these exam scores: 70, 75, 80, 85, 90 What happens if we add one very high score: 200? .pull-left[ Without 200: ``` r scores1 <- c(70, 75, 80, 85, 90) mean(scores1) ``` ``` ## [1] 80 ``` ] .pull-right[ With 200: ``` r scores2 <- c(scores1, 200) mean(scores2) ``` ``` ## [1] 100 ``` ] --- ## Impact on Other Statistics Let's look at how the outlier affects other measures: .pull-left[ Without 200: ``` r median(scores1) ``` ``` ## [1] 80 ``` ``` r sd(scores1) ``` ``` ## [1] 7.905694 ``` ] .pull-right[ With 200: ``` r median(scores2) ``` ``` ## [1] 82.5 ``` ``` r sd(scores2) ``` ``` ## [1] 49.49747 ``` ] Which measures changed more dramatically? --- ## Class Activity: Creating Outliers In pairs, create a dataset of 5 numbers between 1 and 10. Then: 1. Calculate the mean and median 2. Add an "outlier" of 100 to your dataset 3. Recalculate the mean and median 4. Discuss which measure changed more and why <!-- (5 minutes for activity, 5 minutes for class discussion) --> --- # What Should We Do with Outliers? - Remove them? - Keep them? - Ignore them? - Use different statistical methods? --- # Example - The data shows Tukey’s scores for the last 5 math tests: 88, 90, 55, 94, and 89. - Identify the outlier in the data set. - Then determine how the outlier affects the mean, median, and mode of the data --- # Calculations: Mean .pull-left[ - Mean with outlier: (88 + 90 + 55 + 94 + 89) / 5 = 83.2 ] .pull-right[ - Mean without outlier (88, 90, 94, 89): (88 + 90 + 94 + 89) / 4 = 90.25 ] --- # Calculations: Median .pull-left[ - Median with outlier: 55, 88, *89*, 90, 94 - The median is 89 ] .pull-right[ - Median without outlier: 88, *89*, *90*, 94 - The median is 89+90/2 = 89.5 ] --- # So What DO We Do with Outliers? - Do we remove them? - Do we keep them? - Do we ignore them? -- - It’s up to the researcher --- ## Handling Outliers: Considerations - If you remove them: - Document why you removed them - Was it a clear error? An impossible value? - If you keep them: - Easier, but it might alter our results - Be aware they might change your results - Consider using robust statistical methods --- class: middle # Robust Statistics --- ## 5-Number Summary A robust way to summarize data: 1. Minimum 2. First Quartile (Q1) 3. Median 4. Third Quartile (Q3) 5. Maximum This summary is less affected by extreme values! --- ## 5-Number Summary Table - Summarizes a distribution - Old way: mean, standard deviation - Tukey's way: lower extreme, lower hinge, median, upper hinge, upper extreme - Commonly referred to as: min, q1, median, q3, max <img src="data:image/png;base64,#../img/hinges.gif" width="100%" style="display: block; margin: auto;" /> --- ## Calculating 5-Number Summary in R Let's use our exam scores example: ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 70.00 76.25 82.50 100.00 88.75 200.00 ``` Notice how the median and quartiles are less affected by the outlier than the mean. --- ## Visualizing Data: Boxplot - Graphical representation of the 5-number summary - Helps identify potential outliers  --- ## Understanding Boxplots - The "box" represents the middle 50% of the data - The line in the box is the median - "Whiskers" usually extend to the minimum and maximum (excluding outliers) - Points beyond the whiskers are potential outliers --- ## Creating a Boxplot in R Let's visualize our exam scores: <img src="data:image/png;base64,#outliers_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" /> --- ## Identifying Outliers: IQR Method IQR = Interquartile Range = Q3 - Q1 Potential outliers are: - Below: Q1 - 1.5 * IQR - Above: Q3 + 1.5 * IQR - This creates "fences" beyond which values are considered potential outliers. --- ## Example: Applying the IQR Method Let's apply this to our exam scores: ``` ## Lower fence: 57.5 ``` ``` ## Upper fence: 107.5 ``` Is 200 an outlier according to this method? --- # Other Rules for Identifying Outliers - 2.5 Standard Deviations beyond the mean - More Complex Algorithms - [Illustration with Rshiny](http://projects.rajivshah.com/shiny/outlier/) --- ## Beeswarm Boxplot A fun variant of the boxplot that shows individual data points: --- ## Creating a Beeswarm Boxplot in R <img src="data:image/png;base64,#outliers_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" /> --- ## Why Use Robust Methods? 1. They're less affected by outliers 2. They give a more typical picture of your data 3. They can help prevent incorrect conclusions Remember: The goal is to understand your data, not just calculate numbers! --- ## Robust vs. Non-Robust Methods | What we want to know | Non-Robust | Robust | |----------------------|------------|--------| | Typical value | Mean | Median | | Spread of data | Standard Deviation | Interquartile Range | | Relationship between variables | Pearson Correlation | Spearman Correlation | --- ## Real-World Example: Movie Ratings Imagine you're analyzing movie ratings (1-10 scale): Movie A: 7, 8, 8, 9, 10 Movie B: 1, 8, 8, 9, 10 .pull-left[ Mean ratings: ``` ## Movie A: 8.4 ``` ``` ## Movie B: 7.2 ``` ] .pull-right[ Median ratings: ``` ## Movie A: 8 ``` ``` ## Movie B: 8 ``` ] - Which method gives a better picture of typical ratings? --- # If Time Permits: ## Class Discussion: Outliers in Different Fields In small groups, discuss: 1. In your major, what kinds of outliers might you encounter? 2. How might outliers affect conclusions in your field of study? 3. Can you think of situations where outliers might be the most interesting part of the data? --- ## Wrapping Up: Key Points 1. Outliers can meaningfully impact our analyses 2. Always visualize your data to spot potential outliers 3. Consider why outliers might occur in your data 4. Be transparent about how you handle outliers 5. Robust methods can provide valuable insights, especially with messy real-world data