## Week 2

This week’s assignment asks you to test a basic linear regression model for the association between your primary explanatory variable and a response variable.

1) If you have a categorical explanatory variable, make sure one of your categories is coded “0” and generate a frequency table for this variable to check your coding. If you have a quantitative explanatory variable, center it so that the mean = 0 (or really close to 0) by subtracting the mean, and then calculate the mean to check your centering.

2) Test a linear regression model and summarize the results in a couple of sentences. Make sure to include statistical results (regression coefficients and p-values) in your summary.

## Crime Rate vs. Employee Rate: Centering the Explanatory Variable

The explanatory variable in this case is Crime Rate. I need to center Crime Rate by subtracting the mean of the variable from each value in the variable. The centered variable should be 0 or close to 0.

```
# center Crime Rate
cr_mean = sub1['Crime Rate'].mean()
sub1['CrimeRate_Centered'] = sub1['Crime Rate'].apply(lambda x: x - cr_mean)
```

In [119]:

```
# check centering
sub1['CrimeRate_Centered'].mean()
```

The mean value of CrimeRate_Centered is -0.00000000000075503778; sufficiently close to 0.

## Remove Outliers from Employee Rate

```
# remove outliers > 1200
sub11 = sub1[['CrimeRate_Centered','Employee Rate']][sub1['Employee Rate']<1200]
```

```
# run model
sub11.columns = ['CrimeRate_Centered','Employee_Rate'] # need to remove spaces from column headers
centered_mod1 = smf.ols(formula='Employee_Rate ~ CrimeRate_Centered', data=sub11).fit()
```

```
print(centered_mod1.summary())
```

## Summary

Using the coefficients, the formula of our best fit line is…

Employee_Rate = 175.97 + (0.0875 * CrimeRate_Centered)

The high f-statistic tells us that the difference between the means of the two variables is statistically significant. The p-value is also significantly below our level of significance (alpha), 0.05. The r-squared value of 0.199 tells us that this model only accounts for about 20% of the variability in the response variable (Employee_Rate).