 # Data Analysis & Interpretation 3.2: Testing a Basic Linear Regression Model

## Week 2

This week’s assignment asks you to test a basic linear regression model for the association between your primary explanatory variable and a response variable.

1) If you have a categorical explanatory variable, make sure one of your categories is coded “0” and generate a frequency table for this variable to check your coding. If you have a quantitative explanatory variable, center it so that the mean = 0 (or really close to 0) by subtracting the mean, and then calculate the mean to check your centering.

2) Test a linear regression model and summarize the results in a couple of sentences. Make sure to include statistical results (regression coefficients and p-values) in your summary.

## Crime Rate vs. Employee Rate: Centering the Explanatory Variable

The explanatory variable in this case is Crime Rate. I need to center Crime Rate by subtracting the mean of the variable from each value in the variable. The centered variable should be 0 or close to 0.

In :
```# center Crime Rate
cr_mean = sub1['Crime Rate'].mean()
sub1['CrimeRate_Centered'] = sub1['Crime Rate'].apply(lambda x: x - cr_mean)```

In :

```# check centering
sub1['CrimeRate_Centered'].mean()
```
Out:
`-7.550377883057303e-13`

The mean value of CrimeRate_Centered is -0.00000000000075503778; sufficiently close to 0.

## Remove Outliers from Employee Rate

In :
```# remove outliers > 1200
sub11 = sub1[['CrimeRate_Centered','Employee Rate']][sub1['Employee Rate']<1200]
```
In :
```# run model
sub11.columns = ['CrimeRate_Centered','Employee_Rate'] # need to remove spaces from column headers
centered_mod1 = smf.ols(formula='Employee_Rate ~ CrimeRate_Centered', data=sub11).fit()
```
In :
```print(centered_mod1.summary())
```
```                            OLS Regression Results
==============================================================================
Dep. Variable:          Employee_Rate   R-squared:                       0.199
Model:                            OLS   Adj. R-squared:                  0.198
Method:                 Least Squares   F-statistic:                     513.6
Date:                Sat, 29 Sep 2018   Prob (F-statistic):          9.18e-102
Time:                        11:04:11   Log-Likelihood:                -12648.
No. Observations:                2074   AIC:                         2.530e+04
Df Residuals:                    2072   BIC:                         2.531e+04
Df Model:                           1
Covariance Type:            nonrobust
======================================================================================
coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept            175.9691      2.366     74.370      0.000     171.329     180.609
CrimeRate_Centered     0.0875      0.004     22.662      0.000       0.080       0.095
==============================================================================
Omnibus:                     1101.035   Durbin-Watson:                   1.502
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            11161.902
Skew:                           2.296   Prob(JB):                         0.00
Kurtosis:                      13.396   Cond. No.                         613.
==============================================================================

Warnings:
 Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

## Summary

Using the coefficients, the formula of our best fit line is…

Employee_Rate = 175.97 + (0.0875 * CrimeRate_Centered)

The high f-statistic tells us that the difference between the means of the two variables is statistically significant. The p-value is also significantly below our level of significance (alpha), 0.05. The r-squared value of 0.199 tells us that this model only accounts for about 20% of the variability in the response variable (Employee_Rate).