Data Analysis & Interpretation 3.2: Testing a Basic Linear Regression Model

Week 2

This week’s assignment asks you to test a basic linear regression model for the association between your primary explanatory variable and a response variable.

1) If you have a categorical explanatory variable, make sure one of your categories is coded “0” and generate a frequency table for this variable to check your coding. If you have a quantitative explanatory variable, center it so that the mean = 0 (or really close to 0) by subtracting the mean, and then calculate the mean to check your centering.

2) Test a linear regression model and summarize the results in a couple of sentences. Make sure to include statistical results (regression coefficients and p-values) in your summary.


Crime Rate vs. Employee Rate: Centering the Explanatory Variable

The explanatory variable in this case is Crime Rate. I need to center Crime Rate by subtracting the mean of the variable from each value in the variable. The centered variable should be 0 or close to 0.


In [118]:
# center Crime Rate
cr_mean = sub1['Crime Rate'].mean()
sub1['CrimeRate_Centered'] = sub1['Crime Rate'].apply(lambda x: x - cr_mean)

In [119]:

# check centering

The mean value of CrimeRate_Centered is -0.00000000000075503778; sufficiently close to 0.


Remove Outliers from Employee Rate

In [120]:
# remove outliers > 1200
sub11 = sub1[['CrimeRate_Centered','Employee Rate']][sub1['Employee Rate']<1200]
In [121]:
# run model
sub11.columns = ['CrimeRate_Centered','Employee_Rate'] # need to remove spaces from column headers
centered_mod1 = smf.ols(formula='Employee_Rate ~ CrimeRate_Centered', data=sub11).fit()
In [122]:
                            OLS Regression Results                            
Dep. Variable:          Employee_Rate   R-squared:                       0.199
Model:                            OLS   Adj. R-squared:                  0.198
Method:                 Least Squares   F-statistic:                     513.6
Date:                Sat, 29 Sep 2018   Prob (F-statistic):          9.18e-102
Time:                        11:04:11   Log-Likelihood:                -12648.
No. Observations:                2074   AIC:                         2.530e+04
Df Residuals:                    2072   BIC:                         2.531e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
Intercept            175.9691      2.366     74.370      0.000     171.329     180.609
CrimeRate_Centered     0.0875      0.004     22.662      0.000       0.080       0.095
Omnibus:                     1101.035   Durbin-Watson:                   1.502
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            11161.902
Skew:                           2.296   Prob(JB):                         0.00
Kurtosis:                      13.396   Cond. No.                         613.

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


Using the coefficients, the formula of our best fit line is…

Employee_Rate = 175.97 + (0.0875 * CrimeRate_Centered)

The high f-statistic tells us that the difference between the means of the two variables is statistically significant. The p-value is also significantly below our level of significance (alpha), 0.05. The r-squared value of 0.199 tells us that this model only accounts for about 20% of the variability in the response variable (Employee_Rate).




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s