class: center, middle, inverse, title-slide .title[ # Bivariate Relationships ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # Relationships Between Variables --- # Bivariate - So far, we have been analyzing summary statistics that describe aspects of a single list of numbers - Frequently, however, we are interested in how variables behave together - Measuring the relationship (correlation) - Modeling the relationship (regression) --- # Motivating Example: Smoking and Lung Capacity .pull-left[ - Suppose, for example, we wanted to investigate the relationship between cigarette smoking and lung capacity - We might ask a group of people about their smoking habits, and measure their lung capacities ] .pull-right[ <table class="table table-striped table-hover" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> cigarettes </th> <th style="text-align:right;"> lung_capacity </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 45 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 42 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 29 </td> </tr> </tbody> </table> ] --- # Visualizing .pull-left-narrow[ - With R or any other statistics software, we can produce a scatterplot, like this one. ] .pull-right-wide[ ``` r plot(smoking) ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-2-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Or this one .pull-left[ - A scatterplot shows the relationship between two quantitative variables that are measured on the same individuals. - The values of one variable appear on the horizontal axis, - and the values of the other variable appear on the vertical axis. - Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual. ] .pull-right[ ``` r library(car) scatterplot(lung_capacity~cigarettes, data=smoking) ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Or this one .pull-left[ - Always plot the explanatory variable, - if there is one, on the horizontal axis (the x-axis) of a scatterplot - If there is no explanatory-response distinction, - either variable can go on the horizontal axis. .small[ ``` r library(ggplot2) library(ggExtra) # classic plot p <- ggplot(smoking, aes(x=cigarettes, y=lung_capacity, color = cigarettes, size=cigarettes)) + geom_point() + theme_minimal() + theme(legend.position="none") ``` ] ] .pull-right[ <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-4-1.png" width="85%" style="display: block; margin: auto;" /> ] --- # Or this one .pull-left-narrow[ - We can see from the graph that as smoking goes up, lung capacity tends to go down. ] .pull-right-wide[ ``` r # with marginal histogram ggMarginal(p, type ="histogram") ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/hist-1.png" width="75%" style="display: block; margin: auto;" /> ] --- # Or this one .pull-left-narrow[ - Here, the two variables covary in opposite directions. - This is a negative relationship ] .pull-right-wide[ ``` r # marginal density ggMarginal(p, type="density") ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/density-1.png" width="75%" style="display: block; margin: auto;" /> ] --- # Scatter Plots .pull-left[ - In any graph of data, look for the overall pattern and for striking deviations from that pattern. - You can describe the overall pattern of a scatterplot by the - direction (positive or negative), - form, and - strength of the relationship. ] .pull-right[ ``` r # marginal boxplot ggMarginal(p, type="boxplot") ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/boxplot-1.png" width="90%" style="display: block; margin: auto;" /> ] - More examples https://www.statmethods.net/graphs/scatterplot.html --- # Direction of Association .pull-left[ - Two variables are **positively associated** when above-average values of one tend to accompany above-average values of the other, and below-average values also tend to occur together. - Two variables are **negatively associated** when above-average values of one tend to accompany below-average values of the other, and vice versa. ] -- .pull-right[ <img src="data:image/png;base64,#11_Bivariate_files/figure-html/direction_association_plot-1.png" width="65%" style="display: block; margin: auto;" /> ] --- # Strength of the Relationship - We now examine two statistics for quantifying how variables covary: - covariance and - correlation. -- <img src="data:image/png;base64,#11_Bivariate_files/figure-html/strength-1.png" width="55%" style="display: block; margin: auto;" /> --- class: middle # Covariance --- # Covariance - Covariance is a measure of how much two random variables vary together. - When two variables covary in opposite directions, as smoking and lung capacity do, -- - values tend to be on opposite sides of the group mean. - That is, when smoking is above its group mean, - lung capacity tends to be below its group mean. -- - Consequently, by averaging the product of deviation scores, we can obtain a measure of how the variables vary together. --- # Sample Covariance - Sample covariance between X and Y is an estimate of that average (cross-) product of deviation scores in the population `\(s_{x,y}=\frac{1}{n-1}\sum\limits^{n}_{i=1}(x_{i}-\bar{x})(y_{i}-\bar{y})\)` - A more computationally convenient formula: `\(s_{x,y}=\frac{1}{n-1}(\sum\limits^{n}_{i=1}x_{i}y_{i}-\frac{\sum^{n}_{i=1}x_{i}\sum^{n}_{i=1}y_{i}}{n})\)` - Useful fact: the variance of a variable is its covariance with itself. `\(s^{2}=s_{x,x}=\frac{1}{n-1}\sum\limits^{n}_{i=1}((x_{i}-\bar{x})(x_{i}-\bar{x}))\)` --- # Computing Covariance - If we wanted to compute the covariance for smoking and lung capacity, we can save ourselves some heartache. - Rather than finding each deviation from the mean for x and y .midi[ ``` r covariance_calc=data.frame(X=smoking$cigarettes,Y=smoking$lung_capacity) # find x_i-x_bar covariance_calc$dx=covariance_calc$X-mean(covariance_calc$X) # find y_i-y_bar covariance_calc$dy=covariance_calc$Y-mean(covariance_calc$Y) ``` ] - and then finding their product .midi[ ``` r #cross multiply x deviations and y deviations covariance_calc$dxdy=covariance_calc$dy*covariance_calc$dx # cross multi X and Y covariance_calc$xy=covariance_calc$Y*covariance_calc$X ``` ] --- # Computing Covariance .pull-left[ - Giving these values ``` r kable(covariance_calc) ``` | X| Y| dx| dy| dxdy| xy| |--:|--:|---:|--:|----:|---:| | 0| 45| -10| 9| -90| 0| | 5| 42| -5| 6| -30| 210| | 10| 33| 0| -3| 0| 330| | 15| 31| 5| -5| -25| 465| | 20| 29| 10| -7| -70| 580| ] -- .pull-right[ - We'd then sum the dXdY column and get -215, and we then compute the covariance as: `\(s_{x,y}=\frac{1}{n-1}\sum\limits^{n}_{i=1}(x_{i}-\bar{x})(y_{i}-\bar{y})= \frac{-215}{5-1}=-53.75\)` ] --- # Alternative Computation - We can compute `\(\sum x=50\)`, `\(\sum y=180\)`, `\(\sum xy=1585\)`, and n, and use that easier equation `\(s_{x,y}=\frac{1}{n-1}(\sum x_{i}y_{i}-\frac{\sum x_{i}\sum y_{i}}{n})\)` `\(=\frac{1}{5-1}(1585-\frac{50\times180}{5})\)` `\(=\frac{1}{4}(1585-\frac{9000}{5})\)` `\(=\frac{1}{4}(1585-1800)\)` `\(=\frac{1}{4}(-215)\)` `\(=-53.75\)` --- # Covariance in R ``` r cov(smoking) ``` ``` ## cigarettes lung_capacity ## cigarettes 62.50 -53.75 ## lung_capacity -53.75 50.00 ``` ``` r cov(smoking)[1,2] ``` ``` ## [1] -53.75 ``` ``` r var(smoking$cigarettes) ``` ``` ## [1] 62.5 ``` --- # Covariance - Although covariance is an *extremely* important concept and is the cornerstone of many advanced methods (ANOVA, ANCOVA, SEM, regression, etc), it has some limitations: - it has interpretation problems just like variance - it isn't in a meaningful scale - it tells us whether a relationship is positive or negative - but not much more than that. --- # Covariance Scaling - Is -53.75 a strong relationship? - It depends on the scale - If we convert cigarettes into packs, the relationship hasn't changed. - but the covariance has... -- ``` r smoking$packs=smoking$cigarettes/20 cov(smoking$packs,smoking$lung_capacity) ``` ``` ## [1] -2.6875 ``` - The value is, in a sense, "polluted by the metric of the numbers." - Depending on the scale of the data, the absolute value of the covariance can be very large or very small --- # Standardizing Covariance - We can take the scale out of the covariance. - What happens if we use z-scores instead of raw deviations? - Remember, that z-scores are also a measure of deviations - This is called the (Pearson) Correlation Coefficient --- # Correlation Coefficient - The sample correlation is the sum of cross-products of z-scores divided by n-1: `\(r_{x,y}=\frac{1}{n-1}\sum\limits^{n}_{i=1}(Z_{x_{i}}\times Z_{y_{i}})\)` - The sample correlation is the sum of cross-products of z-scores divided by n: `\(\rho_{x,y}=\frac{1}{n}\sum\limits^{n}_{i=1}(Z_{x_{i}}\times Z_{y_{i}})\)` --- # Formulae for Correlation - We can think of a correlation coefficient as a covariance with the standard deviations factored out: `\(r_{x,y}=\frac{s_{x,y}}{s_{x}\times s_{y}}\)` - we can think of covariance as a correlation with the standard deviations put back in. `\(s_{x,y}=r_{x,y}s_{x}s_{y}\)` --- # Properties of r .pull-left[ - Notation - Sample Statistic r - Population Parameter `\(\rho\)` - pronounced row, but spelt rho - Transformation - r is the same regardless of how the values are measured. - Height in inches and centimeters lead to the same r ] .pull-right[ <img src="data:image/png;base64,#11_Bivariate_files/figure-html/galhist-1.png" width="95%" style="display: block; margin: auto;" /> ] --- # Properties of r - r ranges from -1 to 1 [-1,1] - |r| indicates the size of the relationship - r>0 indicates positive relationship (0,1] - r<0 indicates negative relationship [-1,0) - r is symmetric - `\(r_{x,y}=r_{y,x}\)` --- # Correlation Magnitude <img src="data:image/png;base64,#11_Bivariate_files/figure-html/mag-1.png" width="80%" style="display: block; margin: auto;" /> --- # Correlation Interpretation .pull-left-wide.medi[ - Perfect - Exactly –1. A perfect downhill (negative) linear relationship - Exactly +1. A perfect uphill (positive) linear relationship - Strong - -0.70. A strong downhill (negative) linear relationship - +0.70. A strong uphill (positive) linear relationship - Moderate - –0.50. A moderate downhill (negative) relationship - +0.50. A moderate uphill (positive) relationship - Small - –0.30. A small downhill (negative) linear relationship - +0.30. A small uphill (positive) linear relationship ] .pull-right-narrow.medi[ - Develop your correlation intuition - [http://guessthecorrelation.com/](guessthecorrelation.com) ] --- # Linear Relationship - The correlation quantifies the linear association. -- - If your data aren't linear, then the correlation can be misleading. -- - If your data have outliers, then the correlation can be misleading. --- # Anscombe's Quartet - Anscombe's Quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. .pull-left[ <img src="data:image/png;base64,#11_Bivariate_files/figure-html/aqc-1.png" width="90%" style="display: block; margin: auto;" /> ] -- - These data were constructed in 1973 by the statistician Francis Anscombe (1973). - Demonstrates the importance of graphing data before analyzing it and the effect of outliers on statistical properties. - It's intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough." --- # Correlation Cautions - Correlation requires that both variables be quantitative, so it makes sense to do the arithmetic indicated by the formula for r𝑟. - Correlation does not describe curved relationships between variables, no matter how strong the relationship is between them. - Correlation is strongly affected by a few outlying observations. - Correlation is not a complete summary of two-variable data. --- # More Correlation Cautions - Restriction of Range. - Results in r to go down artificially - r -> `\(\rho\)` - Combined groups can hid the relationships, when groups have different relationships - Simpson’s Paradox --- # Simpson's Paradox Simpson's Paradox occurs when a trend appears in different groups of data but disappears or reverses when these groups are combined. Let's explore this through two examples: 1. Extraversion and Job Performance 2. Extraversion, Education, and Salary --- # Example 1: Extraversion and Job Performance .pull-left-wide[ <img src="data:image/png;base64,#11_Bivariate_files/figure-html/simpsons_example1-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right-narrow[ - Left: Overall, there's a slight positive relationship between Extraversion and Performance. - Right: When separated by job type, we see negative relationships within each group. - This illustrates Simpson's Paradox: the trend in the overall data is different from the trends in the subgroups. ] --- # Example 2: Extraversion, Education, and Salary .pull-left-wide[ <img src="data:image/png;base64,#11_Bivariate_files/figure-html/simpsons_example2-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right-narrow[ - Left: Overall, there's a positive relationship between Extraversion and Salary. - Right: When separated by Education level, we see negative relationships within each group. - This is another example of Simpson's Paradox: the overall trend is reversed when we look at subgroups. ] --- # Simposon's Paradox: Illustrated <img src="data:image/png;base64,#11_Bivariate_files/figure-html/Simposon-1.png" width="90%" style="display: block; margin: auto;" /><img src="data:image/png;base64,#11_Bivariate_files/figure-html/Simposon-2.png" width="90%" style="display: block; margin: auto;" /> --- # Final Comments on the Correlation .pull-left-wide[ - The correlation coefficient is neither robust nor resistant. - Not robust because strong nonlinear relationships between the two variables may not be recognized. - Not resistant because it is sensitive to outlying points. - It is the most sensitive summary stat we have seen thus far ] <img src="data:image/png;base64,#../img/nonline.png" width="40%" style="display: block; margin: auto;" /> --- # Shared Variance: `\(R^{2}\)` - How much do those correlated variables have in common? [http://rpsychologist.com/d3/correlation/](http://rpsychologist.com/d3/correlation/) - Common variance - i.e. Shared Variance - Coefficient of Determination - % variability between the two variables that has been accounted for - the remaining `\(1-R^{2}\)` of the variability is still unaccounted for --- class: center, middle # Wrapping Up... --- # Magnitude Code .small[ ``` r # Load necessary libraries library(ggplot2) library(MASS) library(gridExtra) # Function to create correlated data and plot create_plot <- function(correlation) { Sigma <- matrix(c(1, correlation, correlation, 1), 2, 2) data <- as.data.frame(mvrnorm(n = 100, mu = c(0, 0), Sigma = Sigma)) colnames(data) <- c("x", "y") plot <- ggplot(data, aes(x = x, y = y)) + geom_point(shape = 1) + ggtitle(paste("Correlation =", correlation)) + theme_minimal() + theme( plot.title = element_text(size = 12, hjust = 0.5, face = "bold"), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank(), axis.title = element_blank() ) + xlim(-3, 3) + ylim(-3, 3) + coord_fixed(ratio = 1) return(plot) } # List of correlations correlations <- c(-0.99, -0.75, -0.5, -0.25,0.99, 0.75, 0.5, 0.25) # Create list of plots plots <- lapply(correlations, create_plot) # Arrange plots in a grid grid.arrange(grobs = plots, ncol = 4) ``` ``` ## Warning: Removed 1 row containing missing values or values outside the ## scale range (`geom_point()`). ## Removed 1 row containing missing values or values outside the ## scale range (`geom_point()`). ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # association code .small[ ``` r library(ggplot2) library(gridExtra) # Create data for positive association set.seed(123) # for reproducibility pos_data <- data.frame( x = 1:10, y = 1:10 + rnorm(10, sd = 0.5) ) # Create data for negative association neg_data <- data.frame( x = 1:10, y = 10:1 + rnorm(10, sd = 0.5) ) # Plot for positive association pos_plot <- ggplot(pos_data, aes(x, y)) + geom_point(color = "blue", size = 3) + geom_smooth(method = "lm", se = FALSE, color = "red") + ggtitle("Positive Association") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) # Plot for negative association neg_plot <- ggplot(neg_data, aes(x, y)) + geom_point(color = "blue", size = 3) + geom_smooth(method = "lm", se = FALSE, color = "red") + ggtitle("Negative Association") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) # Combine plots grid.arrange(pos_plot, neg_plot, ncol = 2) ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Strength Code .small[ ``` r library(ggplot2) library(gridExtra) set.seed(123) # for reproducibility # Function to generate data with different strengths of relationship generate_data <- function(n, strength) { x <- rnorm(n) y <- strength * x + rnorm(n, sd = sqrt(1 - strength^2)) data.frame(x = x, y = y) } # Generate data weak_data <- generate_data(100, 0.3) moderate_data <- generate_data(100, 0.6) strong_data <- generate_data(100, 0.9) # Create plots plot_data <- function(data, title) { ggplot(data, aes(x, y)) + geom_point(color = "blue", alpha = 0.6) + geom_smooth(method = "lm", se = FALSE, color = "red") + ggtitle(title) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) } weak_plot <- plot_data(weak_data, "Weak Relationship") moderate_plot <- plot_data(moderate_data, "Moderate Relationship") strong_plot <- plot_data(strong_data, "Strong Relationship") # Combine plots grid.arrange(weak_plot, moderate_plot, strong_plot, ncol = 3) ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-13-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Simpsons's Code .small[ ``` r library(scatterplot3d) set.seed(123) n = 1000 Education = rbinom(n, 2, 0.5) Extraversion = rnorm(n) + Education Salary = Education * 2 + rnorm(n) - Extraversion * 0.3 Salary = sample(10000:11000, 1) + rescale(Salary, to = c(0, 100000)) Extraversion = rescale(Extraversion, to = c(0, 7)) Education = factor(Education, labels = c("Low", "Medium", "High")) data <- data.frame(Salary, Extraversion, Education) attach(data) s3d1 <- scatterplot3d(Salary, Education, Extraversion, pch=16, highlight.3d=TRUE, type="h", main="3D Scatterplot") ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" /> ``` r s3d2 <- scatterplot3d(Salary, Extraversion,Education, pch=16, highlight.3d=TRUE, type="h", main="3D Scatterplot") ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-14-2.png" width="90%" style="display: block; margin: auto;" /> ``` r detach(data) ``` ] --- # Anscombe's Code .small[ ``` r cor1 <- format(cor(anscombe$x1, anscombe$y1), digits=4) cor2 <- format(cor(anscombe$x2, anscombe$y2), digits=4) cor3 <- format(cor(anscombe$x3, anscombe$y3), digits=4) cor4 <- format(cor(anscombe$x4, anscombe$y4), digits=4) #define the OLS regression line1 <- lm(y1 ~ x1, data=anscombe) line2 <- lm(y2 ~ x2, data=anscombe) line3 <- lm(y3 ~ x3, data=anscombe) line4 <- lm(y4 ~ x4, data=anscombe) circle.size = 5 colors = list('red', '#0066CC', '#4BB14B', '#FCE638') #plot1 plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) + geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2]) + annotate("text", x = 12, y = 5, label = paste("correlation = ", cor1)) #plot2 plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) + geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2]) + annotate("text", x = 12, y = 3, label = paste("correlation = ", cor2)) #plot3 plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) + geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2]) + annotate("text", x = 12, y = 6, label = paste("correlation = ", cor3)) #plot4 plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) + geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2]) + annotate("text", x = 15, y = 6, label = paste("correlation = ", cor4)) grid.arrange(plot1, plot2, plot3, plot4, top='Anscombe Quadrant -- Correlation Demostration') ``` <img src="data:image/png;base64,#11_Bivariate_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" /> ]