class: center, middle, inverse, title-slide .title[ # Regression: Understanding Relationships Between Variables
📈 ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # Regression: Understanding Relationships Between Variables --- # Roadmap - Introduction to Regression - The Regression Line - Least Squares Method - Interpreting Regression Results - Residuals and Residual Plots - Assumptions and Limitations - Real-World Applications - Practice and Hands-on Analysis? --- class: center, middle, inverse # Introduction to Regression --- # What is Regression? - So far, we've explored how variables move together using bivariate statistics. - However, we haven't considered these variables within a system where we have: - DV (dependent variable): The outcome we're interested in - IV (independent variable): The predictor or factor we think influences the DV. -- - Regression is a useful statistical method that allows us to understand how the DV changes as the IV changes. - It helps in modeling and analyzing relationships between variables. --- # Example: Smoking and Lung Capacity .pull-left[ - Let's revisit our earlier example examining the relationship between smoking and lung capacity. - We're interested in how the number of cigarettes smoked affects lung capacity. ``` r # Create the data frame smoking <- data.frame( cigarettes=c(0,5,10,15,20), lung_capacity=c(45,42,33,31,29)) ``` ] .pull-right[ <table class="table table-striped table-hover" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> cigarettes </th> <th style="text-align:right;"> lung_capacity </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 45 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 42 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 29 </td> </tr> </tbody> </table> ] --- # Interpreting Smoking and Lung Capacity .pull-left[ - When we plot these data, we observe a **negative linear association between** the number of cigarettes smoked and lung capacity. - The **correlation coefficient** between lung capacity and cigarette consumption is `\(-0.962\)`. - This value indicates a strong negative relationship. - The **coefficient of determination** (r-squared) is `\(0.924\)`, meaning that approximately `\(92.4%\)` of the variation in lung capacity can be explained by cigarette consumption. ] -- .pull-right[ - Knowing the number of cigarettes someone smokes allows us to predict their lung capacity. - The higher the correlation (and by extension, variance explained) the more accurate our predictions *ought* to be. <img src="data:image/png;base64,#12_Regression_files/figure-html/scatterplot-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # The Regression Line --- # What is a Regression Line? - Simple explanation: - We draw a line through the points in a scatter plot to summarize the relationship. -- - Technical explanation: - We fit a linear model of the relationship between `\(x\)` and `\(y\)`. -- - A regression line is a straight line that describes how a response variable `\(y\)` changes as an explanatory variable `\(x\)` changes. - We often use a regression line to predict the value of `\(y\)` for a given value of `\(x\)`, - when we believe the relationship between `\(x\)` and `\(y\)` is linear. --- # Equation of a Line .pull-left[ - We can describe this line in a familiar equation - Suppose that `\(y\)` is a response variable (plotted on the vertical axis) and `\(x\)` is an explanatory variable (plotted on the horizontal axis). - A straight line relating `\(x\)` to `\(y\)` has an equation of the form `\(y = a + bx; (\text{or} y = mx + b)\)` ] .pull-right[ - Example where a = 1 and b = 3 <img src="data:image/png;base64,#12_Regression_files/figure-html/line-plot-ggplot-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Components of the Equation `\(y = a + bx; (\text{or} y = mx + b)\)` - In this equation, - `\(b\)` `\((m)\)` is the slope: - The amount by which `\(y\)` changes when `\(x\)` increases by one unit. - `\(a\)` `\((b)\)` is the intercept: - The value of `\(y\)` when `\(x\)` = 0. - With those two points, any straight line can be defined - within the Cartesian plane (except lines parallel to the `\(y\)`-axis). --- # Understanding the Slope - Positive slope: - As `\(x\)` increases, `\(y\)` increases. - Negative slope: - As `\(x\)` increases, `\(y\)` decreases. - The magnitude of the slope indicates the steepness of the line. - A larger absolute value of `\(b\)` means a steeper line --- # Understanding the Intercept - The intercept (a) represents the starting value of `\(y\)` when `\(x = 0\)`. - It determines where the line crosses the y-axis. - Changing the intercept shifts the line up or down without altering its slope. --- # Visualizing Different Slopes .pull-left[ - Here, we demonstrate lines with different positive slopes. - A higher slope means a steeper line. <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ - In this plot, we demonstrate lines with different negative slopes. - Negative slopes show a downward trend as `\(x\)` increases. <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Visualizing Different Intercepts .pull-left[ - This plot demonstrates how changes in the intercept shift the line vertically. - A higher intercept means the line starts higher on the y-axis. <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ - In this plot, we explore lines with negative intercepts, shifting the line down on the y-axis. - A more negative intercept places the line lower on the y-axis. <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # Least Squares Method --- # Objective: Find the "best-fitting" line .pull-left-narrow[ - When we have a set of data points, we often want to find a line that **best fits** the data. - To eliminate the subjectivity of creating a line to fit the data, - we need an objective way to draw the line. ] -- .pull-right-wide[ <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" /> ] -- - The least squares method is the most popular approach. --- # Goal of Least Squares - The least-squares regression line of `\(y\)` on `\(x\)` is the line that makes the **sum of the squares** of the vertical distances of the data points from the line as small as possible. - **Mathematically**, we aim to: $$ \text{minimize} \sum_{i=1}^{n} (y_i - (a + bx_i))^2 $$ - Sometimes the principle of least squares is described as minimizing the sum of the: - squares, - squared residuals, or - squared errors. --- # In other words... - The least-squares regression line of `\(y\)` on `\(x\)` is the line that makes - the sum of the squares of the vertical distances of the data points from the line as small as possible. -- - The goal is to minimize the difference between actual and predicted values of the dependent variable `\(y\)` - `\(min(\sum\limits^{n}_{i=1}(y_{i}-(a+bx_{i}))^{2})\)` - `\(min(\sum\limits^{n}_{i=1}(y_{i}-\widehat{y}_{i})^{2})\)` - `\(min(\sum\limits^{n}_{i=1}(e_{i})^{2})\)` -- - This method ensures that the total error between the data points and the regression line is as small as possible. --- .question[Why squares?] - Squaring emphasizes larger errors and avoids negative residuals canceling out positive ones. --- # Mathematical Formulation - We have data on an explanatory variable `\(x\)` and a response variable `\(y\)` for `\(n\)` individuals. - From the data, calculate the means `\(\bar{x}\)` and `\(\bar{y}\)` - and the standard deviations `\(s_{x}\)` and `\(s_{y}\)` of the two variables and - their correlation r. - The least-squares regression line is the line: $$ `\begin{align*} \widehat{y}= a + bx \end{align*}` $$ .pull-left[ - with slope: $$ `\begin{align*} b&=r \frac{s_{y}}{s_{x}} \end{align*}` $$ ] -- .pull-right[ - and intercept: $$ `\begin{align*} a&=\bar{y}-b\bar{x} \end{align*}` $$ ] -- - The least-squares regression line always passes through `\((\bar{x}\text{,}\bar{y})\)` and `\((0, a)\)` on the graph of `\(y\)` against `\(x\)`. --- class: center, middle, inverse # Interpreting Regression Results --- # Calculating Regression Line for Smoking Data - Using our smoking data, we can estimate the slope and intercept - **Compute the necessary statistics**: ``` r # Means mean_x <- mean(smoking$cigarettes) mean_y <- mean(smoking$lung_capacity) # Standard deviations sd_x <- sd(smoking$cigarettes) sd_y <- sd(smoking$lung_capacity) # Correlation r_xy <- cor(smoking$cigarettes, smoking$lung_capacity) ``` --- # Calculating Regression Line for Smoking Data - Calculate slope (b): $$ `\begin{align*} b &= r \left( \dfrac{s_y}{s_x} \right) \\ &= r_{\text{lung capacity,cigarettes}} \frac{s_{lung capacity}}{s_{cigarettes}}\\ &= -0.962 \times \dfrac{7.071}{7.906} \\ &= -0.962 \times 0.894\\ &= -0.86 \end{align*}` $$ --- ### Calculating Regression Line for Smoking Data: Intercept - **Calculate intercept (a)**: $$ `\begin{align*} a &= \bar{y} - b \bar{x} \\ &= \bar{lung capacity}-b\bar{cigarettes}\\ &= \bar{lung capacity}--0.86*\bar{cigarettes}\\ &= 36 - -0.86 \times 10 \\ &= 44.6 \end{align*}` $$ --- # Combined Regression Equation - The **regression equation** for our data is: $$ \hat{y} = a + bx = 44.6 + -0.86 x $$ - Interpretation - **Intercept (a)**: Predicted lung capacity when no cigarettes are smoked 44.6 - **Slope (b)**: For each additional cigarette smoked, we would expect a 0.86 liter decline in lung capacity --- # Interpreting the Regression Equation - The intercept makes a prediction for the `\(y\)` outcome, when `\(x\)` is 0. - Here, that means that the expected/predicted lung capacity for a non-smoker is 44.6 - The slope gives us the predicted change in outcome for a 1-unit increase in `\(x\)`. - For every 1 additional cigarette, we would expect a -0.86 decline in lung capacity --- ### Using the Regression Equation to Make Predictions - **Predict lung capacity for a 5-cigarette smoker**: - If we want to predict lung capacity for a 5 cigarette smoker, we use the regression equation to predict y. $$ `\begin{align*} \widehat{y} &= 44.6 + -0.86x\\ &= 44.6 + -0.86*5\\ &= 44.6 + -4.3\\ &= 40.3 \end{align*}` $$ - Given our equation, we would predict that a 5-cigarette smoker would have a lung capacity of 40.3 liters. --- # Two Regression Lines - The distinction between explanatory variables (x) and response variables (y) is essential in regression. `\(\widehat{y}=a+b_{yx}x\)` `\(\widehat{x}=a+b_{xy}y\)` `\(b_{yx} \neq b_{xy}\)` - These b coeffiecents are *not* the same - The equation in not symmetric - The slope of the regression line for predicting `\(y\)` from `\(x\)` is not the same as the slope of the regression line for predicting `\(x\)` from `\(y\)`. --- ### Relationship Between Regression and Correlation - There is a close connection between **correlation** and the **slope** of the least-squares regression line. - The **slope** is calculated using the correlation coefficient: $$ b = r \left( \dfrac{s_y}{s_x} \right) $$ - **Key Points**: - The **slope `\((b)\)`** and the **correlation coefficient `\((r)\)`** always have the **same sign**. - A positive correlation results in a positive slope. - A negative correlation results in a negative slope. - The magnitude of the slope depends on both the correlation and the standard deviations of `\(x\)` and `\(y\)`. --- ### Standardizing Variables - If we **standardize** `\(x\)` and `\(y\)` (convert them to z-scores): - **Standardized variables**: `$$( z_{x} = \dfrac{x - \bar{x}}{s_x} )$$` `$$( z_{y} = \dfrac{y - \bar{y}}{s_y} )$$` - The regression equation becomes: $$ \hat{z}_y = r z_x $$ - **Interpretation**: - The slope of the regression line of standardized variables is the **correlation coefficient** `\((r)\)`. - The intercept is **zero** because standardized variables have a mean of zero. --- ### Geometric Interpretation - **Standardizing** shifts the origin to the mean and the `\(x\)` and `\(y\)` axis are stretched so that `\(sd = 1\)` - In the standardized coordinate system: - The regression line passes through the origin (0,0). - The slope is equal to the correlation coefficient `\((r)\)`. - `\(\widehat{z_{y}} = r_{xy}\)` --- ### Implications - The correlation coefficient not only measures the strength and direction of a linear relationship but also directly influences the slope of the regression line. - Understanding this relationship helps in interpreting the regression results and in assessing how changes in `\(x\)` are associated with changes in `\(y\)`. - Along the regression line, a change of 1 standard deviation in `\(x\)` corresponds to an `\(r\)` change of standard deviations in `\(y\)` --- class: center, middle, inverse ## Residuals and Residual Plots --- ### Residual Calculation - A residual is the difference between an observed value of the response variable and the value predicted by the regression line. - That is, a residual is the prediction error that remains after we have chosen the regression line: - residual = observed `\(y\)` - predicted `\(y\)` - residual = `\(y\)` - `\(\widehat{y}\)` - Residuals represent ''leftover'' variation in the response after fitting the regression line. -- - Calculated for each data point, residuals help assess the fit of the regression model. - The residuals from the least-squares line have a special property: - The **mean** of the least-squares residuals is always **zero**. --- # Residual Calculation .small[.pull-left[ ``` r # Recall (smoke_regression=lm(lung_capacity~cigarettes,data=smoking)) ``` ``` ## ## Call: ## lm(formula = lung_capacity ~ cigarettes, data = smoking) ## ## Coefficients: ## (Intercept) cigarettes ## 44.60 -0.86 ``` ``` r prediction <- data.frame(cigarettes = 5) # Predicted Lung Capacity for 5 Cigarettes (yhat=predict(smoke_regression,prediction)) ``` ``` ## 1 ## 40.3 ``` ``` r # Actual Lung Capacity for 5 (yact= smoking$lung_capacity[smoking$cigarettes==5]) ``` ``` ## [1] 42 ``` ``` r #Difference is the Residual yact-yhat ``` ``` ## 1 ## 1.7 ``` ]] .small[.pull-right[ ``` r # Get the Residuals for all the values (smoke_regression.resid=resid(smoke_regression)) ``` ``` ## 1 2 3 4 5 ## 0.4 1.7 -3.0 -0.7 1.6 ``` ``` r # Plotting plot(smoking$cigarettes, smoke_regression.resid, ylab="Residuals", xlab="Cigarettes", main="Residual Plot of Smoking Data") abline(0, 0) ``` <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" /> ]] --- ### Verification with Actual Data - Let's compare the predicted values with actual data: <table class="table table-striped table-hover" style="color: black; margin-left: auto; margin-right: auto;"> <caption>Actual vs. Predicted Lung Capacity</caption> <thead> <tr> <th style="text-align:right;"> cigarettes </th> <th style="text-align:right;"> lung_capacity </th> <th style="text-align:right;"> predicted_lung_capacity </th> <th style="text-align:right;"> residuals </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 44.6 </td> <td style="text-align:right;"> 0.4 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 42 </td> <td style="text-align:right;"> 40.3 </td> <td style="text-align:right;"> 1.7 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> 36.0 </td> <td style="text-align:right;"> -3.0 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 31.7 </td> <td style="text-align:right;"> -0.7 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 27.4 </td> <td style="text-align:right;"> 1.6 </td> </tr> </tbody> </table> --- # Residual Plots - Plot the residual `\(y-\widehat{y}\)` against the `\(x\)` value .pull-left[ - A good residual plot, creates a flat line - The residuals are randomly scattered around the line `\(y=0\)` <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ - A concerning residual plot, shows a relationship <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse ## Assumptions and Limitations --- ### Assumptions of Linear Regression 1. **Linearity**: - The relationship between the independent and dependent variables is linear. 2. **Independence**: - Observations are independent of each other. 3. **Homoscedasticity**: - The residuals have constant variance at every level of `\(x\)`. 4. **Normality of Residuals**: - The residuals are normally distributed. - **Violations** of these assumptions can affect the validity of the regression results. --- ### Alternative Line Estimates - **MAD Regression**: - Minimizes the **Median Absolute Deviations**: `\(min(\sum\limits^{n}_{i=1}\lvert y_{i}-\widehat{y_{i}}\rvert)\)` - **LMS (Least Median of Squares)**: - Minimizes the median of the squared residuals. - **Ridge Regression**: - Adds a penalty term to the least squares to prevent overfitting. - **Maximum Likelihood Methods**: - Estimates parameters by maximizing the likelihood function. --- class: center, middle, inverse ## Real-World Applications --- ### Beyond Two Variables - The residual plot may suggest that other variables are at play. - **Multiple Regression**: - Regression can be used to predict `\(y\)` from multiple `\(x\)` variables. - Allows for modeling more complex relationships. - Example: - Predicting lung capacity based on cigarettes smoked, age, and exercise level. - While multiple regression is powerful, it is beyond the scope of our class. --- # Next time is a hands-on lab using... --- class: center, middle, inverse ## Excel