Analyzing Coefficients - V2#

  • RENAME: we started to analyze the coefficents in lesson 01-v2. Consider a new name for this like "iterating on our coefficients" or "thoughtful selection of coefficients",etc

Lesson Objectives#

By the end of this lesson, students will be able to:

  • Extract and visualize coefficients in more helpful formats.

  • Interpret coefficients for raw data vs scaled data.

  • Use coefficient values to inform modeling choices (for insights).

  • Encode nominal categories as ordinal (based on the target)

  • Determine which version of the coefficients would be best for extracting insights and recommendations for a stakeholder.

Our Previous Results#

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns


## Reviewing the options used
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)
pd.set_option('display.float_format', lambda x: f"{x:,.2f}")

## Customization Options
plt.style.use(['fivethirtyeight','seaborn-talk'])
mpl.rcParams['figure.facecolor']='white'

## additional required imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import metrics
import joblib

Code/Model From Previous Lesson#

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

## Customization Options
plt.style.use(['fivethirtyeight','seaborn-talk'])
mpl.rcParams['figure.facecolor']='white'

## additional required imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import metrics

SEED = 321
np.random.seed(SEED)
## Load in the King's County housing dataset and display the head and info
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS6xDKNpWkBBdhZSqepy48bXo55QnRv1Xy6tXTKYzZLMPjZozMfYhHQjAcC8uj9hQ/pub?output=xlsx"

df = pd.read_excel(url,sheet_name='student-mat')
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   school      395 non-null    object 
 1   sex         395 non-null    object 
 2   age         395 non-null    float64
 3   address     395 non-null    object 
 4   famsize     395 non-null    object 
 5   Pstatus     395 non-null    object 
 6   Medu        395 non-null    float64
 7   Fedu        395 non-null    float64
 8   Mjob        395 non-null    object 
 9   Fjob        395 non-null    object 
 10  reason      395 non-null    object 
 11  guardian    395 non-null    object 
 12  traveltime  395 non-null    float64
 13  studytime   395 non-null    float64
 14  failures    395 non-null    float64
 15  schoolsup   395 non-null    object 
 16  famsup      395 non-null    object 
 17  paid        395 non-null    object 
 18  activities  395 non-null    object 
 19  nursery     395 non-null    object 
 20  higher      395 non-null    object 
 21  internet    395 non-null    object 
 22  romantic    395 non-null    object 
 23  famrel      395 non-null    float64
 24  freetime    395 non-null    float64
 25  goout       395 non-null    float64
 26  Dalc        395 non-null    float64
 27  Walc        395 non-null    float64
 28  health      395 non-null    float64
 29  absences    395 non-null    float64
 30  G1          395 non-null    float64
 31  G2          395 non-null    float64
 32  G3          395 non-null    float64
dtypes: float64(16), object(17)
memory usage: 102.0+ KB
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18.00 U GT3 A 4.00 4.00 at_home teacher course mother 2.00 2.00 0.00 yes no no no yes yes no no 4.00 3.00 4.00 1.00 1.00 3.00 6.00 5.00 6.00 6.00
1 GP F 17.00 U GT3 T 1.00 1.00 at_home other course father 1.00 2.00 0.00 no yes no no no yes yes no 5.00 3.00 3.00 1.00 1.00 3.00 4.00 5.00 5.00 6.00
2 GP F 15.00 U LE3 T 1.00 1.00 at_home other other mother 1.00 2.00 3.00 yes no yes no yes yes yes no 4.00 3.00 2.00 2.00 3.00 3.00 10.00 7.00 8.00 10.00
3 GP F 15.00 U GT3 T 4.00 2.00 health services home mother 1.00 3.00 0.00 no yes yes yes yes yes yes yes 3.00 2.00 2.00 1.00 1.00 5.00 2.00 15.00 14.00 15.00
4 GP F 16.00 U GT3 T 3.00 3.00 other other home father 1.00 2.00 0.00 no yes yes no yes yes no no 4.00 3.00 2.00 1.00 2.00 5.00 4.00 6.00 10.00 10.00
# ## Load in the King's County housing dataset and display the head and info
# df = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSEZQEzxja7Hmj5tr5nc52QqBvFQdCAGb52e1FRK1PDT2_TQrS6rY_TR9tjZjKaMbCy1m5217sVmI5q/pub?output=csv")

# ## Dropping some features for time
# df = df.drop(columns=['date'])


# ## Make the house ids the index
# df = df.set_index('id')

# ## drop lat/long
# df = df.drop(columns=['lat','long'])
# ## Treating zipcode as a category
# df['zipcode'] = df['zipcode'].astype(str)

# df.info()
# df.head()
# ### Train Test Split
## Make x and y variables
y = df['G3'].copy()
X = df.drop(columns=['G3']).copy()

## train-test-split with random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=SEED)


# ### Preprocessing + ColumnTransformer

## make categorical & numeric selectors
cat_sel = make_column_selector(dtype_include='object')
num_sel = make_column_selector(dtype_include='number')

## make pipelines for categorical vs numeric data
cat_pipe = make_pipeline(SimpleImputer(strategy='constant',
                                       fill_value='MISSING'),
                         OneHotEncoder(drop='if_binary', sparse=False))

num_pipe = make_pipeline(SimpleImputer(strategy='mean'))

## make the preprocessing column transformer
preprocessor = make_column_transformer((num_pipe, num_sel),
                                       (cat_pipe,cat_sel),
                                      verbose_feature_names_out=False)

## fit column transformer and run get_feature_names_out
preprocessor.fit(X_train)
feature_names = preprocessor.get_feature_names_out()

X_train_df = pd.DataFrame(preprocessor.transform(X_train), 
                          columns = feature_names, index = X_train.index)


X_test_df = pd.DataFrame(preprocessor.transform(X_test), 
                          columns = feature_names, index = X_test.index)
X_test_df.head(3)
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 school_MS sex_M address_U famsize_LE3 Pstatus_T Mjob_at_home Mjob_health Mjob_other Mjob_services Mjob_teacher Fjob_at_home Fjob_health Fjob_other Fjob_services Fjob_teacher reason_course reason_home reason_other reason_reputation guardian_father guardian_mother guardian_other schoolsup_yes famsup_yes paid_yes activities_yes nursery_yes higher_yes internet_yes romantic_yes
58 15.00 1.00 2.00 1.00 2.00 0.00 4.00 3.00 2.00 1.00 1.00 5.00 2.00 9.00 10.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 1.00 0.00
338 18.00 3.00 3.00 1.00 4.00 0.00 5.00 3.00 3.00 1.00 1.00 1.00 7.00 16.00 15.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00
291 17.00 4.00 3.00 1.00 3.00 0.00 4.00 2.00 2.00 1.00 2.00 3.00 0.00 15.00 15.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 0.00
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
def evaluate_linreg(model, X_train,y_train, X_test,y_test, return_df=False,
                    get_coeffs=True, coeffs_name = "Coefficients"):

    results = []
    
    y_hat_train = model.predict(X_train)
    r2_train = r2_score(y_train,y_hat_train)
    rmse_train = mean_squared_error(y_train,y_hat_train, squared=False)
    results.append({'Data':'Train', 'R^2':r2_train, "RMSE": rmse_train})
    
    y_hat_test = model.predict(X_test)
    r2_test = r2_score(y_test,y_hat_test)
    rmse_test = mean_squared_error(y_test,y_hat_test, squared=False)
    results.append({'Data':'Test', 'R^2':r2_test, "RMSE": rmse_test})
    
    results_df = pd.DataFrame(results).round(3).set_index('Data')
    results_df.loc['Delta'] = results_df.loc['Test'] - results_df.loc['Train']
    results_df = results_df.T
    

    print(results_df)
        
    if get_coeffs:
        coeffs = pd.Series(model.coef_, index= X_train.columns)
    if model.intercept_!=0:
        coeffs.loc['intercept'] = model.intercept_
    coeffs.name = coeffs_name
    return coeffs
from sklearn.linear_model import LinearRegression

## fitting a linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_df, y_train)
coeffs_orig = evaluate_linreg(lin_reg, X_train_df, y_train, X_test_df,y_test,
                             coeffs_name='Original')
coeffs_orig
Data  Train  Test  Delta
R^2    0.85  0.81  -0.04
RMSE   1.83  1.85   0.02
age                 -0.22
Medu                 0.29
Fedu                -0.18
traveltime           0.19
studytime           -0.22
failures            -0.10
famrel               0.31
freetime             0.02
goout               -0.02
Dalc                -0.21
Walc                 0.26
health               0.03
absences             0.05
G1                   0.14
G2                   0.99
school_MS            0.38
sex_M               -0.01
address_U            0.15
famsize_LE3          0.01
Pstatus_T            0.38
Mjob_at_home        -0.05
Mjob_health         -0.02
Mjob_other           0.16
Mjob_services        0.12
Mjob_teacher        -0.21
Fjob_at_home         0.33
Fjob_health          0.38
Fjob_other          -0.11
Fjob_services       -0.54
Fjob_teacher        -0.05
reason_course       -0.09
reason_home         -0.42
reason_other         0.30
reason_reputation    0.21
guardian_father     -0.20
guardian_mother      0.07
guardian_other       0.13
schoolsup_yes        0.34
famsup_yes           0.16
paid_yes             0.16
activities_yes      -0.34
nursery_yes         -0.21
higher_yes           0.82
internet_yes        -0.09
romantic_yes        -0.29
intercept           -0.95
Name: Original, dtype: float64

Iterating On Our Model#

Removing the Intercept#

First, we can remove the intercept from our model, which will force the LinearRegression to explain all of the price without being free to calculate whatever intercept would help the model.

## fitting a linear regression model
lin_reg = LinearRegression(fit_intercept=False)
lin_reg.fit(X_train_df, y_train)
coeffs_no_int = evaluate_linreg(lin_reg, X_train_df, y_train, X_test_df,y_test,
                               coeffs_name='No Intercept')
coeffs_no_int.sort_values()
Data  Train  Test  Delta
R^2    0.85  0.81  -0.04
RMSE   1.83  1.85   0.02
Fjob_services       -0.73
reason_home         -0.66
guardian_father     -0.52
Mjob_teacher        -0.40
activities_yes      -0.34
reason_course       -0.33
Fjob_other          -0.31
romantic_yes        -0.29
guardian_mother     -0.25
Mjob_at_home        -0.24
Fjob_teacher        -0.24
age                 -0.22
studytime           -0.22
Mjob_health         -0.21
Dalc                -0.21
nursery_yes         -0.21
guardian_other      -0.19
Fedu                -0.18
failures            -0.10
internet_yes        -0.09
Mjob_services       -0.07
Mjob_other          -0.03
reason_reputation   -0.03
goout               -0.02
sex_M               -0.01
famsize_LE3          0.01
freetime             0.02
health               0.03
absences             0.05
reason_other         0.06
Fjob_at_home         0.13
G1                   0.14
address_U            0.15
famsup_yes           0.16
paid_yes             0.16
Fjob_health          0.18
traveltime           0.19
Walc                 0.26
Medu                 0.29
famrel               0.31
schoolsup_yes        0.34
school_MS            0.38
Pstatus_T            0.38
higher_yes           0.82
G2                   0.99
Name: No Intercept, dtype: float64

To Intercept or Not To Intercept?#

compare = pd.concat([coeffs_orig, coeffs_no_int],axis=1)
compare = compare.sort_values('Original')
compare['Diff'] = compare['No Intercept'] - compare['Original']
compare
Original No Intercept Diff
intercept -0.95 NaN NaN
Fjob_services -0.54 -0.73 -0.19
reason_home -0.42 -0.66 -0.24
activities_yes -0.34 -0.34 0.00
romantic_yes -0.29 -0.29 -0.00
age -0.22 -0.22 -0.00
studytime -0.22 -0.22 -0.00
Dalc -0.21 -0.21 0.00
nursery_yes -0.21 -0.21 0.00
Mjob_teacher -0.21 -0.40 -0.19
guardian_father -0.20 -0.52 -0.32
Fedu -0.18 -0.18 -0.00
Fjob_other -0.11 -0.31 -0.19
failures -0.10 -0.10 0.00
internet_yes -0.09 -0.09 0.00
reason_course -0.09 -0.33 -0.24
Mjob_at_home -0.05 -0.24 -0.19
Fjob_teacher -0.05 -0.24 -0.19
Mjob_health -0.02 -0.21 -0.19
goout -0.02 -0.02 0.00
sex_M -0.01 -0.01 0.00
famsize_LE3 0.01 0.01 -0.00
freetime 0.02 0.02 -0.00
health 0.03 0.03 0.00
absences 0.05 0.05 -0.00
guardian_mother 0.07 -0.25 -0.32
Mjob_services 0.12 -0.07 -0.19
guardian_other 0.13 -0.19 -0.32
G1 0.14 0.14 -0.00
address_U 0.15 0.15 0.00
famsup_yes 0.16 0.16 -0.00
paid_yes 0.16 0.16 0.00
Mjob_other 0.16 -0.03 -0.19
traveltime 0.19 0.19 0.00
reason_reputation 0.21 -0.03 -0.24
Walc 0.26 0.26 0.00
Medu 0.29 0.29 0.00
reason_other 0.30 0.06 -0.24
famrel 0.31 0.31 -0.00
Fjob_at_home 0.33 0.13 -0.19
schoolsup_yes 0.34 0.34 0.00
Fjob_health 0.38 0.18 -0.19
school_MS 0.38 0.38 0.00
Pstatus_T 0.38 0.38 -0.00
higher_yes 0.82 0.82 -0.00
G2 0.99 0.99 -0.00
  • At this point, there is a valid argument for using either model as the basis for our stakeholder recommendations.

  • As long as you are comfortable explaining the intercept as the baseline house price (when all Xs are 0), then it is not difficult to express the findings to a stakeholder.

  • Let’s see if either version looks better when visualzied.

fig, axes = plt.subplots(ncols=2,figsize=(10,8),sharey=True)

compare['Original'].plot(kind='barh',color='blue',ax=axes[0],title='Coefficients + Intercept')
compare['No Intercept'].plot(kind='barh',color='green',ax=axes[1],title='No Intercept')

[ax.axvline(0,color='black',lw=1) for ax in axes]
fig.tight_layout()
../../_images/02_v2_Analyzing_Coefficients -RENAME_20_0.png
compare['Diff'].plot(kind='barh',figsize=(4,10),color='red',title='Coeff Change W/out Intercept');
../../_images/02_v2_Analyzing_Coefficients -RENAME_21_0.png
  • We can see that by removing the intercept from our model, which had a value of -.95, we have changed the value of several, but not all of the other coefficients.

  • Notice that, in this case, when our model removed a negative baseline value (the intercept), that many of the other coefficients became had a negative change. While this will not always be the case, it does demonstrate how our model has to change the coefficients values when it no longer can calculate a starting grade before factoring in the features.

Scaling Our Features#

  • Since we have entirely numeric features, we can simply scale our already-processed X_train/X_test variables by creating a new scaler.

    • Note: for more complicated datasets, we would want to create a new precprocessor where we add the scaler to the numeric pipeline.

# ### Preprocessing + ColumnTransformer
num_pipe_scale = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())

## make the preprocessing column transformer
preprocessor_scale = make_column_transformer((num_pipe_scale, num_sel),
                                       (cat_pipe,cat_sel),
                                      verbose_feature_names_out=False)

## fit column transformer and run get_feature_names_out
preprocessor_scale.fit(X_train)
feature_names = preprocessor_scale.get_feature_names_out()

X_train_scaled = pd.DataFrame(preprocessor_scale.transform(X_train), 
                          columns = feature_names, index = X_train.index)


X_test_scaled = pd.DataFrame(preprocessor_scale.transform(X_test), 
                          columns = feature_names, index = X_test.index)
X_test_scaled.head(3)
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 school_MS sex_M address_U famsize_LE3 Pstatus_T Mjob_at_home Mjob_health Mjob_other Mjob_services Mjob_teacher Fjob_at_home Fjob_health Fjob_other Fjob_services Fjob_teacher reason_course reason_home reason_other reason_reputation guardian_father guardian_mother guardian_other schoolsup_yes famsup_yes paid_yes activities_yes nursery_yes higher_yes internet_yes romantic_yes
58 -1.30 -1.61 -0.49 -0.65 -0.01 -0.44 0.07 -0.17 -0.94 -0.52 -0.98 1.03 -0.47 -0.56 -0.17 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 1.00 0.00
338 1.01 0.22 0.43 -0.65 2.38 -0.44 1.18 -0.17 -0.05 -0.52 -0.98 -1.81 0.17 1.53 1.13 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00
291 0.24 1.14 0.43 -0.65 1.18 -0.44 0.07 -1.16 -0.94 -0.52 -0.22 -0.39 -0.72 1.23 1.13 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 0.00
## fitting a linear regression model
lin_reg = LinearRegression(fit_intercept=False)
lin_reg.fit(X_train_scaled, y_train)
coeffs_scaled = evaluate_linreg(lin_reg, X_train_scaled, y_train, X_test_scaled,y_test)
coeffs_scaled
Data  Train  Test  Delta
R^2    0.85  0.81  -0.04
RMSE   1.83  1.85   0.02
age                 -0.29
Medu                 0.32
Fedu                -0.19
traveltime           0.14
studytime           -0.18
failures            -0.07
famrel               0.28
freetime             0.02
goout               -0.02
Dalc                -0.19
Walc                 0.34
health               0.04
absences             0.42
G1                   0.47
G2                   3.81
school_MS            0.38
sex_M               -0.01
address_U            0.15
famsize_LE3          0.01
Pstatus_T            0.38
Mjob_at_home         1.88
Mjob_health          1.91
Mjob_other           2.10
Mjob_services        2.05
Mjob_teacher         1.73
Fjob_at_home         2.26
Fjob_health          2.31
Fjob_other           1.82
Fjob_services        1.40
Fjob_teacher         1.89
reason_course        2.33
reason_home          2.00
reason_other         2.72
reason_reputation    2.63
guardian_father      3.03
guardian_mother      3.30
guardian_other       3.36
schoolsup_yes        0.34
famsup_yes           0.16
paid_yes             0.16
activities_yes      -0.34
nursery_yes         -0.21
higher_yes           0.82
internet_yes        -0.09
romantic_yes        -0.29
Name: Coefficients, dtype: float64
fig, ax = plt.subplots(figsize=(5,8))
coeffs_scaled.sort_values().plot(kind='barh')
# compare['Original'].plot(kind='barh',color='blue',ax=axes[0],title='Coefficients + Intercept')
# compare['No Intercept'].plot(kind='barh',color='green',ax=axes[1],title='No Intercept')

ax.axvline(0,color='black',lw=1)
<matplotlib.lines.Line2D at 0x2811f8160>
../../_images/02_v2_Analyzing_Coefficients -RENAME_27_1.png

đź“Ś TO DO#

  • visualize and discuss the scaled coefficients

  • select scaled vs not scaled

Revisiting Our Business Case#

  • Thus far, we have done all of our modeling under the assumption that we want to predict how well current students will do in their final year.

  • However, the stakeholder likely cares more about identifying how students will perform at very beginning of their Year 1.

    • Let’s keep this in mind and remove any features that we would not have known when the student was at the beginning of Year 1.

Modeling - For New Students#

  • We must remove:

    • G1: We wouldn’t know year 1 grades yet.

    • G2: We wouldn’t know year 1 grades yet.

  • We should probably remove:

    • paid: We would not know if students paid for extra classes in the subject yet.

      • Though we may be able to find out if they are WILLING to pay for extra classes.

    • activities: We would not know if the student was involved in extracurriculars at this school yet.

      • Though we may be able to ask students if they INTEND to be involved in activities.

  • We may want to remove:

    • absences:

      • We wouldn’t have absences from the current school, though we likely could get absences from their previous school.

    • Dalc: Work day alcohol consumption. Hopefully, the students who have not entered high school yet will not already be consuming alcohol.

    • Walc: weekend alcohol consumption. Hopefully, the students who have not entered high school yet will not already be consuming alcohol.

As you can see, some of the features are obviously inappropriate to include, but many of them are a bit more ambiguous. - Always think of your stakeholder’s goals/problem statement when deciding what features to include in your model/analysis.

  • When in doubt, contact and ask your stakeholder about the choice(s) you are considering!

DECIDE IF USING SCALED#

Unsaled#

## remove cols that MUST be removed.

df_mvp = df.drop(columns=['G1','G2'])

# ### Train Test Split
## Make x and y variables
y = df_mvp['G3'].copy()
X = df_mvp.drop(columns=['G3']).copy()

## train-test-split with random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=SEED)



## fit column transformer and run get_feature_names_out
preprocessor.fit(X_train)
feature_names = preprocessor.get_feature_names_out()

X_train_df = pd.DataFrame(preprocessor.transform(X_train), 
                          columns = feature_names, index = X_train.index)


X_test_df = pd.DataFrame(preprocessor.transform(X_test), 
                          columns = feature_names, index = X_test.index)
X_test_df.head(3)
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences school_MS sex_M address_U famsize_LE3 Pstatus_T Mjob_at_home Mjob_health Mjob_other Mjob_services Mjob_teacher Fjob_at_home Fjob_health Fjob_other Fjob_services Fjob_teacher reason_course reason_home reason_other reason_reputation guardian_father guardian_mother guardian_other schoolsup_yes famsup_yes paid_yes activities_yes nursery_yes higher_yes internet_yes romantic_yes
58 15.00 1.00 2.00 1.00 2.00 0.00 4.00 3.00 2.00 1.00 1.00 5.00 2.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 1.00 0.00
338 18.00 3.00 3.00 1.00 4.00 0.00 5.00 3.00 3.00 1.00 1.00 1.00 7.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00
291 17.00 4.00 3.00 1.00 3.00 0.00 4.00 2.00 2.00 1.00 2.00 3.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 0.00
## fitting a linear regression model
lin_reg = LinearRegression(fit_intercept=False)
lin_reg.fit(X_train_df, y_train)
coeffs_mvp = evaluate_linreg(lin_reg, X_train_df, y_train, X_test_df,y_test)
coeffs_mvp
Data  Train  Test  Delta
R^2    0.30  0.04  -0.26
RMSE   3.91  4.12   0.21
age                 -0.34
Medu                 0.58
Fedu                -0.06
traveltime          -0.13
studytime            0.29
failures            -1.79
famrel               0.12
freetime             0.27
goout               -0.70
Dalc                -0.33
Walc                 0.36
health              -0.21
absences             0.05
school_MS            0.61
sex_M                1.00
address_U            0.60
famsize_LE3          0.29
Pstatus_T           -0.12
Mjob_at_home         2.75
Mjob_health          4.74
Mjob_other           2.74
Mjob_services        3.59
Mjob_teacher         1.42
Fjob_at_home         3.21
Fjob_health          3.34
Fjob_other           2.23
Fjob_services        2.39
Fjob_teacher         4.06
reason_course        2.82
reason_home          3.60
reason_other         4.56
reason_reputation    4.26
guardian_father      4.43
guardian_mother      4.88
guardian_other       5.93
schoolsup_yes       -1.08
famsup_yes          -1.07
paid_yes             0.41
activities_yes       0.06
nursery_yes         -0.13
higher_yes           1.38
internet_yes         0.76
romantic_yes        -1.39
Name: Coefficients, dtype: float64
  • As we can see above, NOT including the grade from year 2 dramatically hurts our model’s ability to predict the final grade.

Scaled#

## fit column transformer and run get_feature_names_out
preprocessor_scale.fit(X_train)
feature_names = preprocessor_scale.get_feature_names_out()

X_train_scaled = pd.DataFrame(preprocessor_scale.transform(X_train), 
                          columns = feature_names, index = X_train.index)


X_test_scaled = pd.DataFrame(preprocessor_scale.transform(X_test), 
                          columns = feature_names, index = X_test.index)
X_test_df.head(3)
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences school_MS sex_M address_U famsize_LE3 Pstatus_T Mjob_at_home Mjob_health Mjob_other Mjob_services Mjob_teacher Fjob_at_home Fjob_health Fjob_other Fjob_services Fjob_teacher reason_course reason_home reason_other reason_reputation guardian_father guardian_mother guardian_other schoolsup_yes famsup_yes paid_yes activities_yes nursery_yes higher_yes internet_yes romantic_yes
58 15.00 1.00 2.00 1.00 2.00 0.00 4.00 3.00 2.00 1.00 1.00 5.00 2.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 1.00 0.00
338 18.00 3.00 3.00 1.00 4.00 0.00 5.00 3.00 3.00 1.00 1.00 1.00 7.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00
291 17.00 4.00 3.00 1.00 3.00 0.00 4.00 2.00 2.00 1.00 2.00 3.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 0.00
## fitting a linear regression model
lin_reg = LinearRegression(fit_intercept=False)
lin_reg.fit(X_train_scaled, y_train)
coeffs_mvp_scaled = evaluate_linreg(lin_reg, X_train_scaled, y_train,
                                    X_test_scaled,y_test, 
                                    coeffs_name="Scaled Coefficients")
coeffs_mvp_scaled
Data  Train  Test  Delta
R^2    0.30  0.04  -0.26
RMSE   3.91  4.12   0.21
age                 -0.44
Medu                 0.63
Fedu                -0.06
traveltime          -0.09
studytime            0.24
failures            -1.29
famrel               0.11
freetime             0.27
goout               -0.78
Dalc                -0.29
Walc                 0.47
health              -0.30
absences             0.38
school_MS            0.61
sex_M                1.00
address_U            0.60
famsize_LE3          0.29
Pstatus_T           -0.12
Mjob_at_home         1.65
Mjob_health          3.64
Mjob_other           1.64
Mjob_services        2.49
Mjob_teacher         0.33
Fjob_at_home         2.12
Fjob_health          2.24
Fjob_other           1.13
Fjob_services        1.30
Fjob_teacher         2.97
reason_course        1.45
reason_home          2.22
reason_other         3.19
reason_reputation    2.89
guardian_father      2.61
guardian_mother      3.05
guardian_other       4.10
schoolsup_yes       -1.08
famsup_yes          -1.07
paid_yes             0.41
activities_yes       0.06
nursery_yes         -0.13
higher_yes           1.38
internet_yes         0.76
romantic_yes        -1.39
Name: Scaled Coefficients, dtype: float64

Final Comparison#

compare = pd.concat([coeffs_mvp, coeffs_mvp_scaled],axis=1)
compare
Coefficients Scaled Coefficients
age -0.34 -0.44
Medu 0.58 0.63
Fedu -0.06 -0.06
traveltime -0.13 -0.09
studytime 0.29 0.24
failures -1.79 -1.29
famrel 0.12 0.11
freetime 0.27 0.27
goout -0.70 -0.78
Dalc -0.33 -0.29
Walc 0.36 0.47
health -0.21 -0.30
absences 0.05 0.38
school_MS 0.61 0.61
sex_M 1.00 1.00
address_U 0.60 0.60
famsize_LE3 0.29 0.29
Pstatus_T -0.12 -0.12
Mjob_at_home 2.75 1.65
Mjob_health 4.74 3.64
Mjob_other 2.74 1.64
Mjob_services 3.59 2.49
Mjob_teacher 1.42 0.33
Fjob_at_home 3.21 2.12
Fjob_health 3.34 2.24
Fjob_other 2.23 1.13
Fjob_services 2.39 1.30
Fjob_teacher 4.06 2.97
reason_course 2.82 1.45
reason_home 3.60 2.22
reason_other 4.56 3.19
reason_reputation 4.26 2.89
guardian_father 4.43 2.61
guardian_mother 4.88 3.05
guardian_other 5.93 4.10
schoolsup_yes -1.08 -1.08
famsup_yes -1.07 -1.07
paid_yes 0.41 0.41
activities_yes 0.06 0.06
nursery_yes -0.13 -0.13
higher_yes 1.38 1.38
internet_yes 0.76 0.76
romantic_yes -1.39 -1.39
compare = compare.sort_values('Coefficients')
compare['Diff'] = compare['Scaled Coefficients'] - compare['Coefficients']
compare
Coefficients Scaled Coefficients Diff
failures -1.79 -1.29 0.50
romantic_yes -1.39 -1.39 -0.00
schoolsup_yes -1.08 -1.08 0.00
famsup_yes -1.07 -1.07 -0.00
goout -0.70 -0.78 -0.09
age -0.34 -0.44 -0.10
Dalc -0.33 -0.29 0.04
health -0.21 -0.30 -0.09
nursery_yes -0.13 -0.13 -0.00
traveltime -0.13 -0.09 0.03
Pstatus_T -0.12 -0.12 -0.00
Fedu -0.06 -0.06 -0.00
absences 0.05 0.38 0.33
activities_yes 0.06 0.06 0.00
famrel 0.12 0.11 -0.01
freetime 0.27 0.27 0.00
famsize_LE3 0.29 0.29 -0.00
studytime 0.29 0.24 -0.05
Walc 0.36 0.47 0.11
paid_yes 0.41 0.41 0.00
Medu 0.58 0.63 0.05
address_U 0.60 0.60 0.00
school_MS 0.61 0.61 0.00
internet_yes 0.76 0.76 0.00
sex_M 1.00 1.00 0.00
higher_yes 1.38 1.38 -0.00
Mjob_teacher 1.42 0.33 -1.10
Fjob_other 2.23 1.13 -1.10
Fjob_services 2.39 1.30 -1.10
Mjob_other 2.74 1.64 -1.10
Mjob_at_home 2.75 1.65 -1.10
reason_course 2.82 1.45 -1.37
Fjob_at_home 3.21 2.12 -1.10
Fjob_health 3.34 2.24 -1.10
Mjob_services 3.59 2.49 -1.10
reason_home 3.60 2.22 -1.37
Fjob_teacher 4.06 2.97 -1.10
reason_reputation 4.26 2.89 -1.37
guardian_father 4.43 2.61 -1.83
reason_other 4.56 3.19 -1.37
Mjob_health 4.74 3.64 -1.10
guardian_mother 4.88 3.05 -1.83
guardian_other 5.93 4.10 -1.83
fig, axes = plt.subplots(ncols=2,figsize=(10,8),sharey=True)

compare['Coefficients'].plot(kind='barh',color='blue',ax=axes[0],title="Raw")
compare['Scaled Coefficients'].plot(kind='barh',color='green',ax=axes[1],title='Scaled')

[ax.axvline(0,color='black',lw=1) for ax in axes]
fig.tight_layout()
../../_images/02_v2_Analyzing_Coefficients -RENAME_44_0.png
# ax = coeffs_mvp.sort_values().plot(kind='barh',figsize=(6,10))
# ax.axvline(0,color='k')
# ax.set_title('LinearRegression Coefficients');
  • Notice how VERY diferent our coefficients are now that we have removed the students’ grades from the prior 2 years!

BOOKMARK: interpret new coeffs

Selecting Our Final Model for Extracting Insights#

  • Out of all of the variants we have tried, the best one to use going forward is our Ordinal encoded zipcodes with an intercept. (once again, you can make an argument for using the one without an intercept as well).

coeffs_mvp.sort_values()
failures            -1.79
romantic_yes        -1.39
schoolsup_yes       -1.08
famsup_yes          -1.07
goout               -0.70
age                 -0.34
Dalc                -0.33
health              -0.21
nursery_yes         -0.13
traveltime          -0.13
Pstatus_T           -0.12
Fedu                -0.06
absences             0.05
activities_yes       0.06
famrel               0.12
freetime             0.27
famsize_LE3          0.29
studytime            0.29
Walc                 0.36
paid_yes             0.41
Medu                 0.58
address_U            0.60
school_MS            0.61
internet_yes         0.76
sex_M                1.00
higher_yes           1.38
Mjob_teacher         1.42
Fjob_other           2.23
Fjob_services        2.39
Mjob_other           2.74
Mjob_at_home         2.75
reason_course        2.82
Fjob_at_home         3.21
Fjob_health          3.34
Mjob_services        3.59
reason_home          3.60
Fjob_teacher         4.06
reason_reputation    4.26
guardian_father      4.43
reason_other         4.56
Mjob_health          4.74
guardian_mother      4.88
guardian_other       5.93
Name: Coefficients, dtype: float64
  • In the next lesson, we will focus on another type of model-based values.