class: center, middle, inverse, title-slide .title[ # Sampling in Action: The M&M Challenge ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # Sampling in Action: The M&M Challenge --- ## Roadmap - Activity: count and enter M&M colors (15 min) - Class analysis: live charts from your data (20 min) - Concepts: why sample size and design matter (10 min) - Wrap-up and takeaways (5 min) --- # Quick refresher: what is sampling? .pull-left[ - We study a part (sample) to learn about the whole (population). - Samples differ from each other; that’s normal variability. - Our goal today: estimate each color’s percentage in the wider “population” of candies. - We’ll see that combining more bags leads to more stable class estimates. ] .pull-right[ <img src="data:image/png;base64,#../img/Population_versus_sample_(statistics).png" width="100%" style="display: block; margin: auto;" /> ] --- ## M&M Sampling Activity - Objective: Demonstrate sampling principles using M&M's - Hands-on experience with data collection and analysis -- - Materials: - Small packages of plain M&M's (one per student) - Napkins for sorting -- - Outcome: start with your bag, then build to a class-wide estimate --- ## M&M Sampling Procedure - Distribute M&M packages and materials -- - Sort M&M's by color onto your napkin -- - Count each color: Blue, Orange, Green, Red, Yellow, Brown. -- - Enter your counts (raw numbers; 0 if a color is absent). -- - Hypothesize population color distribution -- - Compare with a partner: are your percentages similar? -- - We’ll pool the entire class and visualize the results - using Google Sheets (and some R magic) --- class: middle # What to expect... --- # Analysis in Action - What we'll get from the class data <img src="data:image/png;base64,#sampling_files/figure-html/unnamed-chunk-3-1.png" width="65%" style="display: block; margin: auto;" /> --- # Source Code .tiny[ ``` r library(gridExtra) set.seed(123) # For reproducibility # Define the number of students and colors students <- c("Tukey", "Gauss", "Noether", "Fisher", "Bayes", "Pearson", "Student", "Fiducial", "Neyman", "Cochran") base_cols <- c( Blue = "#0072CE", Orange = "#FF7F00", Green = "#3CB043", Red = "#E41A1C", Yellow = "#FFD700", Brown = "#7B3F00" ) colors <- names(base_cols) # Simulate the total number of M&Ms for each student bag_sizes <- sample(15:20, length(students), replace = TRUE) # Simulate the counts of each color for each student color_counts <- replicate(length(colors), sample(1:bag_sizes, length(students), replace = TRUE)) # Create the dataframe df_syn <- data.frame(Name = students, color_counts) colnames(df_syn)[-1] <- colors # Calculate the percentages df_syn <- df_syn %>% mutate(Total = Blue + Orange + Green + Red + Yellow + Brown) df_long_syn <- df_syn %>% pivot_longer(Blue:Brown, names_to="Color", values_to="Count") %>% group_by(Name) %>% mutate(Prop = Count/sum(Count)) %>% ungroup() # Plotting the data p_syn_bags <- ggplot(df_long_syn, aes(Name, Prop, fill = Color)) + geom_col() + scale_fill_manual(values = base_cols) + scale_y_continuous(labels = scales::percent_format()) + labs(title = "Each bag’s color composition (synthetic)", x = NULL, y = "Share") + thm + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Overall distribution of M&Ms overall_distribution <- df_syn %>% select(Blue, Orange, Green, Red, Yellow) %>% summarise(across(everything(), sum)) %>% pivot_longer(cols = everything(), names_to = "Color", values_to = "Count") p_syn_total <- df_syn %>% summarise(across(Blue:Brown, sum)) %>% pivot_longer(everything(), names_to = "Color", values_to = "Total") %>% ggplot(aes(Color, Total, fill = Color)) + geom_col() + scale_fill_manual(values = base_cols, guide = "none") + labs(title = "Total M&Ms by color (synthetic)", x = NULL, y = "Total") + thm p_syn_bags ``` ``` r #gridExtra::grid.arrange(p_syn_bags, p_syn_total, ncol = 2) # Display both plots #grid.arrange(stacked_plot, overall_plot, ncol = 2) ``` ] --- # Data Collection .pull-left[ .center[ <img src="data:image/png;base64,#sampling_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> Scan to input your data! .footnote[https://docs.google.com/spreadsheets/d/1D4i8e0pTrqwLk_FjMFtkimqhmtOrBf-X6OU9RT57m_Q] ]] -- .pull-right[ Checklist: - Each row = one bag (add your name/initials) - Enter raw counts (not percentages) - Include all six colors (use 0 if needed) - Submit once (no duplicates) Discussion prompts: - Which colors have the widest ranges? What determines the width? - How would doubling N change these intervals? ] --- ## Analysis in action Discussion prompts: - Which colors have the widest intervals? What determines the width? - How would doubling N change these intervals? --- # What do our samples look like? <img src="data:image/png;base64,#sampling_files/figure-html/unnamed-chunk-6-1.png" width="65%" style="display: block; margin: auto;" /> --- # Class pooled counts <img src="data:image/png;base64,#sampling_files/figure-html/unnamed-chunk-7-1.png" width="55%" style="display: block; margin: auto;" /> --- # Sample Size Effects .pull-left-narrow[ <table class="table table-striped table-hover table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Sample.Size </th> <th style="text-align:left;"> Estimate.stability </th> <th style="text-align:left;"> Bag.to.bag.variation </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> Individual </td> <td style="text-align:left;"> Low </td> <td style="text-align:left;"> High </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Paired </td> <td style="text-align:left;"> Medium </td> <td style="text-align:left;"> Medium </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Class-wide </td> <td style="text-align:left;"> High </td> <td style="text-align:left;"> Low </td> </tr> </tbody> </table> ] -- .pull-right[ - One bag can look quite different from another. - Pooling more bags smooths out that variation. - If we doubled the number of bags again, what would you expect to see in the pooled chart? ] --- <img src="data:image/png;base64,#sampling_files/figure-html/unnamed-chunk-8-1.png" width="90%" height="65%" style="display: block; margin: auto;" /> --- # How our estimate stabilizes <img src="data:image/png;base64,#sampling_files/figure-html/unnamed-chunk-9-1.png" width="90%" height="65%" style="display: block; margin: auto;" /> --- class: middle # Advanced Sampling Concepts --- ## Relating to Sampling Methods .pull-left[ - Simple random sampling - Each M&M package as a random sample - Stratified sampling - If we sorted M&M bags by production date, - Could this improve representativeness? ] -- .pull-right[ - Cluster sampling - If we sampled entire boxes of M&M packages - Potential production batch effects? - Systematic sampling - If we selected every nth M&M package from production line - Could introduce cyclical biases? ] --- # Potential Biases in M&M Sampling .pull-left[ - Production process biases - Color distribution variations between factories - Akin to sampling frame bias in surveys - Selection bias - If students choose their favorite color of package - Akin to non-random sample selection in research ] .pull-right[ - Measurement bias - Errors in counting or recording M&M colors - Akin to survey response errors - Non-response bias - If some students don't participate or eat their M&M's - Akin to survey non-respondents ] --- # Importance of Representative Samples .pull-left[ - What if we only sampled from one factory? - Implications for psychological research - Generalizing from sample to population - External validity of research findings ] -- .pull-right[ - Strategies for improving representativeness - Increasing sample size (more M&M packages) - Diversifying sample sources (different stores, batches) - Random selection procedures - Weighting techniques for unequal probability samples ] --- # Wrapping Up... .pull-left[ ## Key Takeaways 1. Sampling lets us learn about populations with manageable data. 2. More data (more bags) leads to more stable estimates. 3. How we sample matters: source, selection, and recording affect results. 4. We can make smarter designs to improve generalization. ] .pull-right[  .center[.footnote[Source: https://www.reddit.com/r/dataisbeautiful/comments/10wxrh9/oc_distribution_of_mms_by_color_in_3lb_mms_jar/] ] ]