Bivariate Relationships

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a>
</span>
</div>

---

# Relationships Between Variables

---

# Bivariate 
- So far, we have been analyzing summary statistics that describe aspects of a single list of numbers
- Frequently, however, we are interested in how variables behave together
    - Measuring the relationship (correlation)
    - Modeling the relationship (regression)

---

# Motivating Example: Smoking and Lung Capacity
.pull-left[
- Suppose, for example, we wanted to investigate the relationship between cigarette smoking and lung capacity
- We might ask a group of people about their smoking habits, and measure their lung capacities
]

<table class="table table-striped table-hover" style="color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> cigarettes </th>
   <th style="text-align:right;"> lung_capacity </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 45 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 42 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 33 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 31 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 29 </td>
  </tr>
</tbody>
</table>
]

---

# Visualizing

.pull-left-narrow[
- With R or any other statistics software, we can produce a scatterplot, like this one.
]
.pull-right-wide[

``` r
plot(smoking)
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-2-1.png" width="90%" style="display: block; margin: auto;" />
]
---

# Or this one

.pull-left[
- A scatterplot shows the relationship between two quantitative variables that are measured on the same individuals. 
    - The values of one variable appear on the horizontal axis, 
    - and the values of the other variable appear on the vertical axis. 
    - Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual.
]
.pull-right[

``` r
library(car)
scatterplot(lung_capacity~cigarettes, data=smoking)
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" />
]

---

# Or this one

.pull-left[
- Always plot the explanatory variable, 
  - if there is one, on the horizontal axis (the x-axis) of a scatterplot
- If there is no explanatory-response distinction, 
  - either variable can go on the horizontal axis.

``` r
library(ggplot2)
library(ggExtra)
# classic plot
p <- ggplot(smoking, aes(x=cigarettes,
                         y=lung_capacity, 
                         color = cigarettes,
                         size=cigarettes)) +
      geom_point() + theme_minimal() +
      theme(legend.position="none") 
```
]
]

.pull-right[
<img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-4-1.png" width="85%" style="display: block; margin: auto;" />
]

---

# Or this one

.pull-left-narrow[
- We can see from the graph that as smoking goes up, lung capacity tends to go down.
] .pull-right-wide[

``` r
# with marginal histogram
ggMarginal(p, type ="histogram")
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/hist-1.png" width="75%" style="display: block; margin: auto;" />
]
---

# Or this one
.pull-left-narrow[
- Here, the two variables covary in opposite directions. 
- This is a negative relationship
] .pull-right-wide[

``` r
# marginal density
ggMarginal(p, type="density")
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/density-1.png" width="75%" style="display: block; margin: auto;" />
]

---

# Scatter Plots
.pull-left[
- In any graph of data, look for the overall pattern and for striking deviations from that pattern.
- You can describe the overall pattern of a scatterplot by the 
    - direction (positive or negative), 
    - form, and 
    - strength of the relationship.
] .pull-right[

``` r
# marginal boxplot
ggMarginal(p, type="boxplot")
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/boxplot-1.png" width="90%" style="display: block; margin: auto;" />
]
- More examples  https://www.statmethods.net/graphs/scatterplot.html

---

# Direction of Association

.pull-left[
- Two variables are **positively associated** when above-average values of one tend to accompany above-average values of the other, and below-average values also tend to occur together.
- Two variables are **negatively associated** when above-average values of one tend to accompany below-average values of the other, and vice versa.
]

--
.pull-right[
<img src="data:image/png;base64,#11_Bivariate_files/figure-html/direction_association_plot-1.png" width="65%" style="display: block; margin: auto;" />
]
---

# Strength of the Relationship

- We now examine two statistics for quantifying how variables covary: 
  - covariance and 
  - correlation.
--

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/strength-1.png" width="55%" style="display: block; margin: auto;" />
---
class: middle

# Covariance

---

# Covariance

- Covariance is a measure of how much two random variables vary together.
- When two variables covary in opposite directions, as smoking and lung capacity do,
--

- values tend to be on opposite sides of the group mean.  
    - That is, when smoking is above its group mean, 
    - lung capacity tends to be below its group mean.
--
- Consequently, by averaging the product of deviation scores, we can obtain a measure of how the variables vary together.

---

# Sample Covariance

- Sample covariance between X and Y is an estimate of that average (cross-) product of deviation scores in the population

`$s_{x,y}=\frac{1}{n-1}\sum\limits^{n}_{i=1}(x_{i}-\bar{x})(y_{i}-\bar{y})$`

- A more computationally convenient formula:

`$s_{x,y}=\frac{1}{n-1}(\sum\limits^{n}_{i=1}x_{i}y_{i}-\frac{\sum^{n}_{i=1}x_{i}\sum^{n}_{i=1}y_{i}}{n})$`

- Useful fact: the variance of a variable is its covariance with itself.

`$s^{2}=s_{x,x}=\frac{1}{n-1}\sum\limits^{n}_{i=1}((x_{i}-\bar{x})(x_{i}-\bar{x}))$`

---

# Computing Covariance

- If we wanted to compute the covariance for smoking and lung capacity, we can save ourselves some heartache.
- Rather than finding each deviation from the mean for x and y

``` r
covariance_calc=data.frame(X=smoking$cigarettes,Y=smoking$lung_capacity)

# find x_i-x_bar
covariance_calc$dx=covariance_calc$X-mean(covariance_calc$X)

# find y_i-y_bar
covariance_calc$dy=covariance_calc$Y-mean(covariance_calc$Y)
```
]

- and then finding their product

``` r
#cross multiply x deviations and y deviations
covariance_calc$dxdy=covariance_calc$dy*covariance_calc$dx

# cross multi X and Y
covariance_calc$xy=covariance_calc$Y*covariance_calc$X
```
]

---

# Computing Covariance
.pull-left[
- Giving these values

``` r
kable(covariance_calc)
```

|  X|  Y|  dx| dy| dxdy|  xy|
|--:|--:|---:|--:|----:|---:|
|  0| 45| -10|  9|  -90|   0|
|  5| 42|  -5|  6|  -30| 210|
| 10| 33|   0| -3|    0| 330|
| 15| 31|   5| -5|  -25| 465|
| 20| 29|  10| -7|  -70| 580|
]
--
.pull-right[
- We'd then sum the dXdY column and get -215, and we then compute the covariance as:

`$s_{x,y}=\frac{1}{n-1}\sum\limits^{n}_{i=1}(x_{i}-\bar{x})(y_{i}-\bar{y})= \frac{-215}{5-1}=-53.75$`
]

---

# Alternative Computation

- We can compute `$\sum x=50$`, `$\sum y=180$`, `$\sum xy=1585$`, and n, and use that easier equation

`$s_{x,y}=\frac{1}{n-1}(\sum x_{i}y_{i}-\frac{\sum x_{i}\sum y_{i}}{n})$`

`$=\frac{1}{5-1}(1585-\frac{50\times180}{5})$`

`$=\frac{1}{4}(1585-\frac{9000}{5})$`

`$=\frac{1}{4}(1585-1800)$`

`$=\frac{1}{4}(-215)$`

`$=-53.75$`

---

# Covariance in R

``` r
cov(smoking)
```

```
##               cigarettes lung_capacity
## cigarettes         62.50        -53.75
## lung_capacity     -53.75         50.00
```

``` r
cov(smoking)[1,2]
```

```
## [1] -53.75
```

``` r
var(smoking$cigarettes)
```

```
## [1] 62.5
```

---

# Covariance

- Although covariance is an *extremely* important concept and is the cornerstone of many advanced methods (ANOVA, ANCOVA, SEM, regression, etc), it has some limitations:
- it has interpretation problems just like variance
- it isn't in a meaningful scale
- it tells us whether a relationship is positive or negative
    - but not much more than that.

---

# Covariance Scaling

- Is -53.75 a strong relationship? 
    - It depends on the scale
- If we convert cigarettes into packs, the relationship hasn't changed.
- but the covariance has...

``` r
smoking$packs=smoking$cigarettes/20

cov(smoking$packs,smoking$lung_capacity)
```

```
## [1] -2.6875
```

- The value is, in a sense, "polluted by the metric of the numbers."
- Depending on the scale of the data, the absolute value of the covariance can be very large or very small

---

# Standardizing Covariance

- We can take the scale out of the covariance.
- What happens if we use z-scores instead of raw deviations?
- Remember, that z-scores are also a measure of deviations
- This is called the (Pearson) Correlation Coefficient

---

# Correlation Coefficient

- The sample correlation is the sum of cross-products of z-scores divided by n-1:

`$r_{x,y}=\frac{1}{n-1}\sum\limits^{n}_{i=1}(Z_{x_{i}}\times Z_{y_{i}})$`

- The sample correlation is the sum of cross-products of z-scores divided by n:

`$\rho_{x,y}=\frac{1}{n}\sum\limits^{n}_{i=1}(Z_{x_{i}}\times Z_{y_{i}})$`

---

# Formulae for Correlation

- We can think of a correlation coefficient as a covariance with
the standard deviations factored out:

`$r_{x,y}=\frac{s_{x,y}}{s_{x}\times s_{y}}$`

- we can think of covariance as a correlation with the standard deviations put back in.

`$s_{x,y}=r_{x,y}s_{x}s_{y}$`

---

# Properties of r
.pull-left[
- Notation
    - Sample Statistic r
    - Population Parameter `$\rho$`
        - pronounced row, but spelt rho
- Transformation
    - r is the same regardless of how the values are measured. 
    - Height in inches and centimeters lead to the same r
]
.pull-right[
<img src="data:image/png;base64,#11_Bivariate_files/figure-html/galhist-1.png" width="95%" style="display: block; margin: auto;" />
]

---

# Properties of r
- r ranges from -1 to 1 [-1,1]
    - |r| indicates the size of the relationship
    - r>0 indicates positive relationship (0,1]
    - r<0 indicates negative relationship [-1,0)
    - r is symmetric
        - `$r_{x,y}=r_{y,x}$`

---

# Correlation Magnitude

---

# Correlation Interpretation

.pull-left-wide.medi[
- Perfect
    - Exactly –1. A perfect downhill (negative) linear relationship
    - Exactly +1. A perfect uphill (positive) linear relationship
- Strong
    - -0.70. A strong downhill (negative) linear relationship
    - +0.70. A strong uphill (positive) linear relationship
- Moderate
    - –0.50. A moderate downhill (negative) relationship
    - +0.50. A moderate uphill (positive) relationship
- Small
    - –0.30. A small downhill (negative) linear relationship
    - +0.30. A small uphill (positive) linear relationship
]

.pull-right-narrow.medi[
- Develop your correlation intuition
- [http://guessthecorrelation.com/](guessthecorrelation.com)

]
---

# Linear Relationship

- The correlation quantifies the linear association. 
--

- If your data aren't linear, then the correlation can be misleading.
--

- If your data have outliers, then the correlation can be misleading.
---

# Anscombe's Quartet

- Anscombe's Quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed.

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/aqc-1.png" width="90%" style="display: block; margin: auto;" />
]
--

- These data were constructed in 1973 by the statistician Francis Anscombe (1973). 
- Demonstrates the importance of graphing data before analyzing it and the effect of outliers on statistical properties. 
- It's intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."

---

# Correlation Cautions

- Correlation requires that both variables be quantitative, so it makes sense to do the arithmetic indicated by the formula for r𝑟.
- Correlation does not describe curved relationships between variables, no matter how strong the relationship is between them.
- Correlation is strongly affected by a few outlying observations.
- Correlation is not a complete summary of two-variable data.

---

# More Correlation Cautions

- Restriction of Range.
- Results in r to go down artificially
    - r -> `$\rho$`
- Combined groups can hid the relationships, when groups have different relationships
    - Simpson’s Paradox
   
---

# Simpson's Paradox

Simpson's Paradox occurs when a trend appears in different groups of data but disappears or reverses when these groups are combined.

Let's explore this through two examples:
1. Extraversion and Job Performance
2. Extraversion, Education, and Salary

---

# Example 1: Extraversion and Job Performance
.pull-left-wide[
<img src="data:image/png;base64,#11_Bivariate_files/figure-html/simpsons_example1-1.png" width="90%" style="display: block; margin: auto;" />
]

.pull-right-narrow[
- Left: Overall, there's a slight positive relationship between Extraversion and Performance.
- Right: When separated by job type, we see negative relationships within each group.
- This illustrates Simpson's Paradox: the trend in the overall data is different from the trends in the subgroups.
]

---

# Example 2: Extraversion, Education, and Salary

.pull-left-wide[
<img src="data:image/png;base64,#11_Bivariate_files/figure-html/simpsons_example2-1.png" width="90%" style="display: block; margin: auto;" />
]
--

.pull-right-narrow[
- Left: Overall, there's a positive relationship between Extraversion and Salary.
- Right: When separated by Education level, we see negative relationships within each group.
- This is another example of Simpson's Paradox: the overall trend is reversed when we look at subgroups.
]

---

# Simposon's Paradox: Illustrated

---

# Final Comments on the Correlation 
.pull-left-wide[
- The correlation coefficient is neither robust nor resistant.
    - Not robust because strong nonlinear relationships between the two variables may not be recognized.
    - Not resistant because it is sensitive to outlying points.
    - It is the most sensitive summary stat we have seen thus far
]

---

# Shared Variance: `$R^{2}$`
- How much do those correlated variables have in common?
[http://rpsychologist.com/d3/correlation/](http://rpsychologist.com/d3/correlation/)
- Common variance
    - i.e. Shared Variance
- Coefficient of Determination
    - % variability between the two variables that has been accounted for 
    - the remaining `$1-R^{2}$` of the variability is still unaccounted for

---

# Wrapping Up...

---

# Magnitude Code

``` r
# Load necessary libraries
library(ggplot2)
library(MASS)
library(gridExtra)

# Function to create correlated data and plot
create_plot <- function(correlation) {
  Sigma <- matrix(c(1, correlation, correlation, 1), 2, 2)
  data <- as.data.frame(mvrnorm(n = 100, mu = c(0, 0), Sigma = Sigma))
  colnames(data) <- c("x", "y")
  plot <- ggplot(data, aes(x = x, y = y)) +
    geom_point(shape = 1) +
    ggtitle(paste("Correlation =", correlation)) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 12, hjust = 0.5, face = "bold"),
      axis.text = element_blank(),
      axis.ticks = element_blank(),
      panel.grid = element_blank(),
      axis.title = element_blank()
    ) +
    xlim(-3, 3) + ylim(-3, 3) +
    coord_fixed(ratio = 1)
  return(plot)
}

# List of correlations
correlations <- c(-0.99, -0.75, -0.5, -0.25,0.99, 0.75, 0.5, 0.25)

# Create list of plots
plots <- lapply(correlations, create_plot)

# Arrange plots in a grid
grid.arrange(grobs = plots, ncol = 4)
```

```
## Warning: Removed 1 row containing missing values or values outside the
## scale range (`geom_point()`).
## Removed 1 row containing missing values or values outside the
## scale range (`geom_point()`).
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" />
]

---

# association code
.small[

``` r
library(ggplot2)
library(gridExtra)

# Create data for positive association
set.seed(123)  # for reproducibility
pos_data <- data.frame(
  x = 1:10,
  y = 1:10 + rnorm(10, sd = 0.5)
)

# Create data for negative association
neg_data <- data.frame(
  x = 1:10,
  y = 10:1 + rnorm(10, sd = 0.5)
)

# Plot for positive association
pos_plot <- ggplot(pos_data, aes(x, y)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Positive Association") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Plot for negative association
neg_plot <- ggplot(neg_data, aes(x, y)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Negative Association") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Combine plots
grid.arrange(pos_plot, neg_plot, ncol = 2)
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" />
]
---

# Strength Code

``` r
library(ggplot2)
library(gridExtra)

set.seed(123)  # for reproducibility

# Function to generate data with different strengths of relationship
generate_data <- function(n, strength) {
  x <- rnorm(n)
  y <- strength * x + rnorm(n, sd = sqrt(1 - strength^2))
  data.frame(x = x, y = y)
}

# Generate data
weak_data <- generate_data(100, 0.3)
moderate_data <- generate_data(100, 0.6)
strong_data <- generate_data(100, 0.9)

# Create plots
plot_data <- function(data, title) {
  ggplot(data, aes(x, y)) +
    geom_point(color = "blue", alpha = 0.6) +
    geom_smooth(method = "lm", se = FALSE, color = "red") +
    ggtitle(title) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5))
}

weak_plot <- plot_data(weak_data, "Weak Relationship")
moderate_plot <- plot_data(moderate_data, "Moderate Relationship")
strong_plot <- plot_data(strong_data, "Strong Relationship")

# Combine plots
grid.arrange(weak_plot, moderate_plot, strong_plot, ncol = 3)
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-13-1.png" width="90%" style="display: block; margin: auto;" />
]

---

# Simpsons's Code
.small[

``` r
library(scatterplot3d) 
set.seed(123)
n = 1000
Education = rbinom(n, 2, 0.5)
Extraversion = rnorm(n) + Education
Salary = Education * 2 + rnorm(n) - Extraversion * 0.3
Salary = sample(10000:11000, 1) + rescale(Salary, to = c(0, 100000))
Extraversion = rescale(Extraversion, to = c(0, 7))
Education = factor(Education, labels = c("Low", "Medium", "High"))

data <- data.frame(Salary, Extraversion, Education)

attach(data)
s3d1 <- scatterplot3d(Salary,
  Education, Extraversion,
  pch=16, highlight.3d=TRUE,
  type="h", main="3D Scatterplot")
```

``` r
s3d2 <- scatterplot3d(Salary,
  Extraversion,Education,
  pch=16, highlight.3d=TRUE,
  type="h", main="3D Scatterplot")
```

``` r
detach(data)
```
]

---

# Anscombe's Code
.small[

``` r
cor1 <- format(cor(anscombe$x1, anscombe$y1), digits=4)
cor2 <- format(cor(anscombe$x2, anscombe$y2), digits=4)
cor3 <- format(cor(anscombe$x3, anscombe$y3), digits=4)
cor4 <- format(cor(anscombe$x4, anscombe$y4), digits=4)
 
#define the OLS regression
line1 <- lm(y1 ~ x1, data=anscombe)
line2 <- lm(y2 ~ x2, data=anscombe)
line3 <- lm(y3 ~ x3, data=anscombe)
line4 <- lm(y4 ~ x4, data=anscombe)
 
circle.size = 5
colors = list('red', '#0066CC', '#4BB14B', '#FCE638')
 
#plot1
plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) +
  geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2]) +
  annotate("text", x = 12, y = 5, label = paste("correlation = ", cor1))
 
#plot2
plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) +
  geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2]) +
  annotate("text", x = 12, y = 3, label = paste("correlation = ", cor2))
 
#plot3
plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) +
  geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2]) +
  annotate("text", x = 12, y = 6, label = paste("correlation = ", cor3))
 
#plot4
plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) +
  geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2]) +
  annotate("text", x = 15, y = 6, label = paste("correlation = ", cor4))
 
grid.arrange(plot1, plot2, plot3, plot4, top='Anscombe Quadrant -- Correlation Demostration')
```

<img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" />
]