Correlation & Regression • • A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of interest. The goal of such an investigation is typically to estimate (predict) the value of one variable based on the observed value of the other variable (or variables). Quantitative Variables • Dependent Variable (Y) • • • the variable being predicted called the response variable Independent Variable (X) • • the variable used to explain or predict Y called the explanatory or predictor variable Correlation & Regression • Correlation • • Addresses the questions: “Is there a relationship between X and Y?” “If so, how strong is it?” Regression • Addresses the question “What is the relationship between X and Y?” Simple Linear Relationship • A linear (straight line) relationship between Y and a single X. • • The form of the equation is: Y = b0 + b1 X, where b0 is the y-intercept and b1is the slope A scatter-plot of X versus Y is useful for spotting linear relationships, and obvious departures from linear. Correlation • • A correlation exists between two variables when they are related in some way. Linear Correlation Coefficient (r) • • measures the strength of the linear relationship between X and Y Properties of r • • -1 ≤ r ≤ 1 r = 1 for a perfect positive linear relationship r = -1 for a perfect negative linear relationship p = 0 if there is no linear relationship Sample Correlation Coefficient • Statistics that is useful for estimating the linear correlation coefficient r n xy x y n( x ) x 2 2 n( y ) y 2 2 Coefficient of Determination • The coefficient of determination is the proportion of variability in Y that can be explained by its linear relationship to X. • Computed by squaring the sample correlation squared (r2) Hypothesis Testing of the Linear Correlation Coefficient • Appropriate Hypothesis: H 0 : r 0 (No Linear Relationsh ip) H1 : r 0 (Linear Relationsh ip) Testing r • Test Statistic: t r 1 r 2 , df n 2 n2 • Rejection Region (3 cases of H1) 1. 2. 3. Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2 Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα Simple Linear Regression • The Least Squares Regression line is our "best" line for explaining the relationship between Y and X. • • It minimizes the squared error (distance between the observed values and the values predicted by the line). The predicted value of Y for any value of X can be found by plugging that value in for X in the least squares regression line. Simple Linear Regression Line • The equation is: where b1 and yˆ b0 b1 x n xy x y n( x ) x 2 b0 y b1 x 2 Proper Use of Correlation & Regression • • • • • Correlation does not imply causation. Simple linear regression is appropriate only if the data clusters about a line. Do not extrapolate. Do not apply model to other populations. For multiple regression, the size of the parametersdoes not indicate importance. Effect of Extreme Values • • Extreme values can have a very large effect on correlation and regression analysis. Influential outliers can largely impact model fit.