Sampling in Action: The M&M Challenge

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a>
</span>
</div>

---

# Sampling in Action: The M&M Challenge

---

## Roadmap

- Activity: count and enter M&M colors (15 min)
- Class analysis: live charts from your data (20 min)
- Concepts: why sample size and design matter (10 min)
- Wrap-up and takeaways (5 min)

---

# Quick refresher: what is sampling?

.pull-left[
- We study a part (sample) to learn about the whole (population).
- Samples differ from each other; that’s normal variability.
- Our goal today: estimate each color’s percentage in the wider “population” of candies.
- We’ll see that combining more bags leads to more stable class estimates.
]

.pull-right[
<img src="data:image/png;base64,#../img/Population_versus_sample_(statistics).png" width="100%" style="display: block; margin: auto;" />
]

---

## M&M Sampling Activity

- Objective: Demonstrate sampling principles using M&M's
  - Hands-on experience with data collection and analysis
--

- Materials:
  - Small packages of plain M&M's (one per student)
  - Napkins for sorting
--

- Outcome: start with your bag, then build to a class-wide estimate

---

## M&M Sampling Procedure

- Distribute M&M packages and materials
--

- Sort M&M's by color onto your napkin
--

- Count each color: Blue, Orange, Green, Red, Yellow, Brown.
--

- Enter your counts (raw numbers; 0 if a color is absent).
--

- Hypothesize population color distribution
--

- Compare with a partner: are your percentages similar?
--

- We’ll pool the entire class and visualize the results
  - using Google Sheets (and some R magic)
---

# What to expect...

---

# Analysis in Action

- What we'll get from the class data

---

# Source Code

``` r
library(gridExtra)
set.seed(123) # For reproducibility

# Define the number of students and colors
students <- c("Tukey", "Gauss", "Noether", "Fisher", 
              "Bayes", "Pearson", "Student", 
              "Fiducial", "Neyman", "Cochran")
base_cols <- c(
  Blue   = "#0072CE",
  Orange = "#FF7F00",
  Green  = "#3CB043",
  Red    = "#E41A1C",
  Yellow = "#FFD700",
  Brown  = "#7B3F00"
)
colors <- names(base_cols)

# Simulate the total number of M&Ms for each student
bag_sizes <- sample(15:20, length(students), replace = TRUE)

# Simulate the counts of each color for each student
color_counts <- replicate(length(colors), 
                          sample(1:bag_sizes, 
                                 length(students), replace = TRUE))

# Create the dataframe
df_syn <- data.frame(Name = students, 
                     color_counts)

colnames(df_syn)[-1] <- colors

# Calculate the percentages
df_syn <- df_syn %>%
  mutate(Total = Blue + Orange + Green + Red + Yellow + Brown)

df_long_syn <- df_syn %>%
  pivot_longer(Blue:Brown, names_to="Color", values_to="Count") %>%
  group_by(Name) %>% mutate(Prop = Count/sum(Count)) %>% ungroup()

# Plotting the data
p_syn_bags <- ggplot(df_long_syn, aes(Name, Prop, fill = Color)) +
  geom_col() +
  scale_fill_manual(values = base_cols) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Each bag’s color composition (synthetic)", x = NULL, y = "Share") +
  thm + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Overall distribution of M&Ms
overall_distribution <- df_syn %>%
  select(Blue, Orange, Green, Red, Yellow) %>%
  summarise(across(everything(), sum)) %>%
  pivot_longer(cols = everything(), names_to = "Color", values_to = "Count")

p_syn_total <- df_syn %>%
  summarise(across(Blue:Brown, sum)) %>%
  pivot_longer(everything(), names_to = "Color", values_to = "Total") %>%
  ggplot(aes(Color, Total, fill = Color)) +
  geom_col() +
  scale_fill_manual(values = base_cols, guide = "none") +
  labs(title = "Total M&Ms by color (synthetic)", x = NULL, y = "Total") +
  thm
p_syn_bags
```

``` r
#gridExtra::grid.arrange(p_syn_bags, p_syn_total, ncol = 2)

# Display both plots

#grid.arrange(stacked_plot, overall_plot, ncol = 2)
```

]

---

# Data Collection

.pull-left[
.center[
<img src="data:image/png;base64,#sampling_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />

Scan to input your data!

.pull-right[
Checklist:
- Each row = one bag (add your name/initials)
- Enter raw counts (not percentages)
- Include all six colors (use 0 if needed)
- Submit once (no duplicates)

Discussion prompts:
- Which colors have the widest ranges? What determines the width?
- How would doubling N change these intervals?
]

---

## Analysis in action

Discussion prompts:
- Which colors have the widest intervals? What determines the width?
- How would doubling N change these intervals?

---

# What do our samples look like?

---

# Class pooled counts

---

# Sample Size Effects

.pull-left-narrow[
<table class="table table-striped table-hover table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Sample.Size </th>
   <th style="text-align:left;"> Estimate.stability </th>
   <th style="text-align:left;"> Bag.to.bag.variation </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;"> Individual </td>
   <td style="text-align:left;"> Low </td>
   <td style="text-align:left;"> High </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> Paired </td>
   <td style="text-align:left;"> Medium </td>
   <td style="text-align:left;"> Medium </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> Class-wide </td>
   <td style="text-align:left;"> High </td>
   <td style="text-align:left;"> Low </td>
  </tr>
</tbody>
</table>
]
--

.pull-right[
- One bag can look quite different from another.
- Pooling more bags smooths out that variation.
- If we doubled the number of bags again, what would you expect to see in the pooled chart?
]

---

---

# How our estimate stabilizes

---

# Advanced Sampling Concepts

---

## Relating to Sampling Methods

.pull-left[
- Simple random sampling
    - Each M&M package as a random sample
- Stratified sampling
    - If we sorted M&M bags by production date,
    - Could this improve representativeness?
]
--

- Cluster sampling
  - If we sampled entire boxes of M&M packages
  - Potential production batch effects?

- Systematic sampling
  - If we selected every nth M&M package from production line
  - Could introduce cyclical biases?
]

---

# Potential Biases in M&M Sampling

.pull-left[
- Production process biases
    - Color distribution variations between factories
    - Akin to sampling frame bias in surveys
    
- Selection bias
  - If students choose their favorite color of package
  - Akin to  non-random sample selection in research    
  ]
  
.pull-right[

- Measurement bias
  - Errors in counting or recording M&M colors
  - Akin to survey response errors

- Non-response bias
  - If some students don't participate or eat their M&M's
  - Akin to survey non-respondents
]

---

# Importance of Representative Samples

.pull-left[
- What if we only sampled from one factory?
- Implications for psychological research
    - Generalizing from sample to population
    - External validity of research findings
    ]
    
--

.pull-right[
- Strategies for improving representativeness
  - Increasing sample size (more M&M packages)
  - Diversifying sample sources (different stores, batches)
  - Random selection procedures
  - Weighting techniques for unequal probability samples
]
---

# Wrapping Up...

.pull-left[
## Key Takeaways
1. Sampling lets us learn about populations with manageable data.
2. More data (more bags) leads to more stable estimates.
3. How we sample matters: source, selection, and recording affect results.
4. We can make smarter designs to improve generalization.
]

.pull-right[
![M&Ms](data:image/png;base64,#../img/reddit.png)
.center[.footnote[Source: https://www.reddit.com/r/dataisbeautiful/comments/10wxrh9/oc_distribution_of_mms_by_color_in_3lb_mms_jar/]
]
]