Regression: Understanding Relationships Between Variables 📊

.title[
# Regression: Understanding Relationships Between Variables <br> 📊
]
.author[
### S. Mason Garrison
]

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a>
</span>
</div>

---

# Regression: Understanding Relationships Between Variables

---

# Roadmap

- Introduction to Regression
- The Regression Line
- Least Squares Method
- Interpreting Regression Results
- Residuals and Residual Plots
- Assumptions and Limitations
- Real-World Applications
- Practice and Hands-on Analysis?

---

---

# What is Regression?

- So far, we've explored how variables move together using bivariate statistics.
- However, we haven't considered these variables within a system where we have:
    - DV (dependent variable): The outcome we're interested in
    - IV (independent variable): The predictor or factor we think influences the DV.
--
- Regression is a useful statistical method that allows us to understand how the DV changes as the IV changes.
  - It helps in modeling and analyzing relationships between variables.

---

# Example: Smoking and Lung Capacity

.pull-left[
- Let's revisit our earlier example examining the relationship between smoking and lung capacity.
- We're interested in how the number of cigarettes smoked affects lung capacity.

``` r
# Create the data frame
smoking <- data.frame(
  cigarettes=c(0,5,10,15,20),
  lung_capacity=c(45,42,33,31,29))
```
]

.pull-right[
<table class="table table-striped table-hover" style="color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> cigarettes </th>
   <th style="text-align:right;"> lung_capacity </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 45 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 42 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 33 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 31 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 29 </td>
  </tr>
</tbody>
</table>
]
---

# Interpreting Smoking and Lung Capacity

.pull-left[
- When we plot these data, we observe a **negative linear association between** the number of cigarettes smoked and lung capacity.
- The **correlation coefficient** between lung capacity and cigarette consumption is `$-0.962$`.
  - This value indicates a strong negative relationship.
- The **coefficient of determination** (r-squared) is `$0.924$`, meaning that approximately `$92.4%$` of the variation in lung capacity can be explained by cigarette consumption.
]
--
.pull-right[
- Knowing the number of cigarettes someone smokes allows us to predict their lung capacity.
- The higher the correlation (and by extension, variance explained) the more accurate our predictions *ought* to be.

]

---

# The Regression Line

---

# What is a Regression Line?

- Simple explanation:
  - We draw a line through the points in a scatter plot to summarize the relationship.
--

- Technical explanation:
   - We fit a linear model of the relationship between `$x$` and `$y$`.
--

- A regression line is a straight line that describes how a response variable `$y$` changes as an explanatory variable `$x$` changes.
- We often use a regression line to predict the value of `$y$` for a given value of `$x$`, 
    - when we believe the relationship between `$x$` and `$y$` is linear.
    
---

# Equation of a Line
.pull-left[
- We can describe this line in a familiar equation
- Suppose that `$y$` is a response variable (plotted on the vertical axis) and `$x$` is an explanatory variable (plotted on the horizontal axis). 
- A straight line relating `$x$` to `$y$` has an equation of the form

`$y = a + bx; (\text{or} y = mx + b)$`

]
.pull-right[

- Example where a = 1 and b = 3

<img src="data:image/png;base64,#12_Regression_files/figure-html/line-plot-ggplot-1.png" width="90%" style="display: block; margin: auto;" />
]

---
# Components of the Equation

`$y = a + bx; (\text{or} y = mx + b)$`

- In this equation, 
  - `$b$` `$(m)$` is the slope: 
      - The amount by which `$y$` changes when `$x$` increases by one unit. 
- `$a$` `$(b)$` is the intercept:
    - The value of `$y$` when `$x$` = 0.
- With those two points, any straight line can be defined 
  -  within the Cartesian plane (except lines parallel to the `$y$`-axis).
  
---

# Understanding the Slope

- Positive slope:
  - As `$x$` increases, `$y$` increases.
- Negative slope:
  - As `$x$` increases, `$y$` decreases.
- The magnitude of the slope indicates the steepness of the line.
  - A larger absolute value of `$b$` means a steeper line

---

# Understanding the Intercept

- The intercept (a) represents the starting value of `$y$` when `$x = 0$`.
- It determines where the line crosses the y-axis.
- Changing the intercept shifts the line up or down without altering its slope.

---

# Visualizing Different Slopes

.pull-left[
- Here, we demonstrate lines with different positive slopes.
  - A higher slope means a steeper line.

<img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" />
]

.pull-right[
- In this plot, we demonstrate lines with different negative slopes.
  - Negative slopes show a downward trend as `$x$` increases.

<img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" />
]
---

# Visualizing Different Intercepts

.pull-left[
- This plot demonstrates how changes in the intercept shift the line vertically.
  - A higher intercept means the line starts higher on the y-axis.

<img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" />
]
--

.pull-right[
- In this plot, we explore lines with negative intercepts, shifting the line down on the y-axis.
  - A more negative intercept places the line lower on the y-axis.
<img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" />
]

---

# Least Squares Method

---

# Objective: Find the "best-fitting" line

.pull-left-narrow[
- When we have a set of data points, we often want to find a line that **best fits** the data.
- To eliminate the subjectivity of creating a line to fit the data,
  - we need an objective way to draw the line.
  
]
--
.pull-right-wide[
<img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" />
]
--

- The least squares method is the most popular approach.

---

# Goal of Least Squares

- The least-squares regression line of `$y$` on `$x$` is the line that makes the **sum of the squares** of the vertical distances of the data points from the line as small as possible.
- **Mathematically**, we aim to:

$$
\text{minimize} \sum_{i=1}^{n} (y_i - (a + bx_i))^2
$$

- Sometimes the principle of least squares is described as minimizing the sum of the:
    - squares,
    - squared residuals, or
    - squared errors.

---
# In other words...

- The least-squares regression line of `$y$` on `$x$` is the line that makes 
    - the sum of the squares of the vertical distances of the data points from the line as small as possible.
--
- The goal is to minimize the difference between actual and predicted values of the dependent variable `$y$`
    - `$min(\sum\limits^{n}_{i=1}(y_{i}-(a+bx_{i}))^{2})$`
    - `$min(\sum\limits^{n}_{i=1}(y_{i}-\widehat{y}_{i})^{2})$`
    - `$min(\sum\limits^{n}_{i=1}(e_{i})^{2})$`
--
- This method ensures that the total error between the data points and the regression line is as small as possible.

---

.question[Why squares?]
  - Squaring emphasizes larger errors and avoids negative residuals canceling out positive ones.

---

# Mathematical Formulation

- We have data on an explanatory variable `$x$` and a response variable `$y$` for `$n$` individuals. 
- From the data, calculate the means `$\bar{x}$` and `$\bar{y}$` 
    - and the standard deviations `$s_{x}$` and `$s_{y}$` of the two variables and 
    - their correlation r. 
- The least-squares regression line is the line:

$$
`\begin{align*}
\widehat{y}= a + bx
\end{align*}`
$$

$$
`\begin{align*}
b&=r \frac{s_{y}}{s_{x}}
\end{align*}`
$$

]

$$
`\begin{align*}
a&=\bar{y}-b\bar{x}
\end{align*}`
$$
]

- The least-squares regression line always passes through `$(\bar{x}\text{,}\bar{y})$` and `$(0, a)$` on the graph of `$y$` against `$x$`.

---

# Interpreting Regression Results

---

# Calculating Regression Line for Smoking Data

- Using our smoking data, we can estimate the slope and intercept
- **Compute the necessary statistics**:

``` r
# Means
mean_x <- mean(smoking$cigarettes)
mean_y <- mean(smoking$lung_capacity)

# Standard deviations
sd_x <- sd(smoking$cigarettes)
sd_y <- sd(smoking$lung_capacity)

# Correlation
r_xy <- cor(smoking$cigarettes, smoking$lung_capacity)
```

---

# Calculating Regression Line for Smoking Data

- Calculate slope (b):

$$
`\begin{align*}
b &= r \left( \dfrac{s_y}{s_x} \right) \\
  &= r_{\text{lung capacity,cigarettes}} \frac{s_{lung capacity}}{s_{cigarettes}}\\
  &= -0.962 \times \dfrac{7.071}{7.906} \\
  &= -0.962 \times 0.894\\
  &= -0.86
\end{align*}`
$$

---

### Calculating Regression Line for Smoking Data: Intercept

- **Calculate intercept (a)**:
$$
`\begin{align*}
a &= \bar{y} - b \bar{x} \\
&= \bar{lung capacity}-b\bar{cigarettes}\\
&= \bar{lung capacity}--0.86*\bar{cigarettes}\\
  &= 36 - -0.86 \times 10 \\
  &= 44.6
\end{align*}`
$$

---

# Combined Regression Equation

- The **regression equation** for our data is:

$$
\hat{y} = a + bx = 44.6 + -0.86 x
$$

- Interpretation
  - **Intercept (a)**: Predicted lung capacity when no cigarettes are smoked  44.6
  - **Slope (b)**: For each additional cigarette smoked, we would expect a  0.86 liter decline in lung capacity

---

# Interpreting the Regression Equation

- The intercept makes a prediction for the `$y$` outcome, when `$x$` is 0.
- Here, that means that the expected/predicted lung capacity for a non-smoker is 44.6
- The slope gives us the predicted change in outcome for a 1-unit increase in `$x$`.
- For every 1 additional cigarette, we would expect a -0.86 decline in lung capacity

---

### Using the Regression Equation to Make Predictions

- **Predict lung capacity for a 5-cigarette smoker**:
- If we want to predict lung capacity for a 5 cigarette smoker, we use the regression equation to predict y.

$$
`\begin{align*}
\widehat{y} &= 44.6 + -0.86x\\
&= 44.6 + -0.86*5\\
&= 44.6 + -4.3\\
&= 40.3
\end{align*}`
$$

- Given our equation, we would predict that a 5-cigarette smoker would have a lung capacity of 40.3  liters.

---

# Two Regression Lines

- The distinction between explanatory variables (x) and response variables (y) is essential in regression.

`$\widehat{y}=a+b_{yx}x$`

`$\widehat{x}=a+b_{xy}y$`

`$b_{yx} \neq b_{xy}$`

- These b coeffiecents are *not* the same
    - The equation in not symmetric
    - The slope of the regression line for predicting `$y$` from `$x$` is not the same as the slope of the regression line for predicting `$x$` from `$y$`.
    
    
---

### Relationship Between Regression and Correlation

- There is a close connection between **correlation** and the **slope** of the least-squares regression line.
- The **slope** is calculated using the correlation coefficient:

$$
b = r \left( \dfrac{s_y}{s_x} \right)
$$

- **Key Points**:
  - The **slope `$(b)$`** and the **correlation coefficient `$(r)$`** always have the **same sign**.
    - A positive correlation results in a positive slope.
    - A negative correlation results in a negative slope.
  - The magnitude of the slope depends on both the correlation and the standard deviations of `$x$` and `$y$`.

---

### Standardizing Variables

- If we **standardize** `$x$` and `$y$` (convert them to z-scores):
  - **Standardized variables**:
     `$$( z_{x} = \dfrac{x - \bar{x}}{s_x} )$$`
     `$$( z_{y} = \dfrac{y - \bar{y}}{s_y} )$$`
- The regression equation becomes:

$$
\hat{z}_y = r z_x
$$

- **Interpretation**:
  - The slope of the regression line of standardized variables is the **correlation coefficient** `$(r)$`.
  - The intercept is **zero** because standardized variables have a mean of zero.

---

### Geometric Interpretation

- **Standardizing** shifts the origin to the mean and  the `$x$` and `$y$` axis are stretched so that `$sd = 1$`
- In the standardized coordinate system:
  - The regression line passes through the origin (0,0).
  - The slope is equal to the correlation coefficient `$(r)$`.
  - `$\widehat{z_{y}} = r_{xy}$`

---

### Implications

- The correlation coefficient not only measures the strength and direction of a linear relationship but also directly influences the slope of the regression line.
- Understanding this relationship helps in interpreting the regression results and in assessing how changes in `$x$` are associated with changes in `$y$`.
- Along the regression line, a change of 1 standard deviation in `$x$` corresponds to an `$r$` change of standard deviations in `$y$`

---

## Residuals and Residual Plots

---

### Residual Calculation

- A residual is the difference between an observed value of the response variable and the value predicted by the regression line. 
- That is, a residual is the prediction error that remains after we have chosen the regression line:
    - residual = observed `$y$` - predicted `$y$`
    - residual = `$y$` - `$\widehat{y}$`
- Residuals represent ''leftover'' variation in the response after fitting the regression line.
--
- Calculated for each data point, residuals help assess the fit of the regression model.
- The residuals from the least-squares line have a special property:
  - The **mean** of the least-squares residuals is always **zero**.

---

# Residual Calculation

``` r
# Recall 
(smoke_regression=lm(lung_capacity~cigarettes,data=smoking))
```

```
## 
## Call:
## lm(formula = lung_capacity ~ cigarettes, data = smoking)
## 
## Coefficients:
## (Intercept)   cigarettes  
##       44.60        -0.86
```

``` r
prediction <- data.frame(cigarettes = 5)

# Predicted Lung Capacity for 5 Cigarettes
(yhat=predict(smoke_regression,prediction))
```

```
##    1 
## 40.3
```

``` r
# Actual Lung Capacity for 5
(yact= smoking$lung_capacity[smoking$cigarettes==5])
```

```
## [1] 42
```

``` r
#Difference is the Residual
yact-yhat
```

```
##   1 
## 1.7
```
]]
.small[.pull-right[

``` r
# Get the Residuals for all the values
(smoke_regression.resid=resid(smoke_regression))
```

```
##    1    2    3    4    5 
##  0.4  1.7 -3.0 -0.7  1.6
```

``` r
# Plotting 
plot(smoking$cigarettes, smoke_regression.resid, 
ylab="Residuals", xlab="Cigarettes", 
main="Residual Plot of Smoking Data") 
abline(0, 0)
```

<img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" />
]]

---

### Verification with Actual Data

- Let's compare the predicted values with actual data:

<table class="table table-striped table-hover" style="color: black; margin-left: auto; margin-right: auto;">
<caption>Actual vs. Predicted Lung Capacity</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> cigarettes </th>
   <th style="text-align:right;"> lung_capacity </th>
   <th style="text-align:right;"> predicted_lung_capacity </th>
   <th style="text-align:right;"> residuals </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 45 </td>
   <td style="text-align:right;"> 44.6 </td>
   <td style="text-align:right;"> 0.4 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 42 </td>
   <td style="text-align:right;"> 40.3 </td>
   <td style="text-align:right;"> 1.7 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 33 </td>
   <td style="text-align:right;"> 36.0 </td>
   <td style="text-align:right;"> -3.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 31 </td>
   <td style="text-align:right;"> 31.7 </td>
   <td style="text-align:right;"> -0.7 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 29 </td>
   <td style="text-align:right;"> 27.4 </td>
   <td style="text-align:right;"> 1.6 </td>
  </tr>
</tbody>
</table>

---

# Residual Plots

- Plot the residual `$y-\widehat{y}$` against the `$x$` value

- A good residual plot, creates a flat line
- The residuals are randomly scattered around the line `$y=0$`

<img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" />
]

]

---

---

### Assumptions of Linear Regression

1. **Linearity**:
   - The relationship between the independent and dependent variables is linear.
2. **Independence**:
   - Observations are independent of each other.
3. **Homoscedasticity**:
   - The residuals have constant variance at every level of `$x$`.
4. **Normality of Residuals**:
   - The residuals are normally distributed.

- **Violations** of these assumptions can affect the validity of the regression results.

---

### Alternative Line Estimates

- **MAD Regression**:
  - Minimizes the **Median Absolute Deviations**:

`$min(\sum\limits^{n}_{i=1}\lvert y_{i}-\widehat{y_{i}}\rvert)$`

- **LMS (Least Median of Squares)**:
  - Minimizes the median of the squared residuals.
- **Ridge Regression**:
  - Adds a penalty term to the least squares to prevent overfitting.
- **Maximum Likelihood Methods**:
  - Estimates parameters by maximizing the likelihood function.

---

## Real-World Applications

---

### Beyond Two Variables

- The residual plot may suggest that other variables are at play.
- **Multiple Regression**:
  - Regression can be used to predict `$y$` from multiple `$x$` variables.
  - Allows for modeling more complex relationships.
  - Example:
    - Predicting lung capacity based on cigarettes smoked, age, and exercise level.
- While multiple regression is powerful, it is beyond the scope of our class.

---

# Next time is a hands-on lab using...

---

## Excel