Insights for Stakeholders#

Lesson Objectives#

By the end of this lesson, students will be able to:

  • Define the stakeholder and their business problem that will be the guiding framework for this week’s lessons.

  • Identify the target and features for a machine learning model to use for insights

  • Understand how different feature/target choices can provide a different spin/perspective on the problem.

The Stakeholder#

  • We’ve been hired by a school district that wants to use data science to identify and support high school students at risk of poor performance.

  • They have provided us with data on several hundred of their former students and want to identify students who are at risk of poor grades/performance by year 3.

    • The school included 3 years of grades(presumable grades 10-12), which are labeled as (G1-G3).

    • They sent us an excel file with 3 sheets:

      • student-mat: grades for Math (the student-mat sheet)

      • student-por: grades for Portuguese

      • README: data dictionary

  • The goal is to identify these students and provide additional support/tutoring to improve their academic performance.

Our Task#

  • Develop machine-learning models to predict student performance in their final year (year 3).

  • We will then use our model(s) to extract insights into which students are most at-risk for poor performance.

  • We will provide a summary of our findings and 3 recommendations to the school on how to best identify these at-risk students.

We will be focusing on the Math grades to start.

Stakeholder Considerations#

  • Before diving into modeling, let’s examine the data the school district has provided us and let’s brainstorm/discuss our approach for our models.

EDA#

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

## Customization Options
# pd.set_option("display.max_columns",100)
plt.style.use(['seaborn-talk'])
mpl.rcParams['figure.facecolor']='white'
# import dataframe_image as dfi
# url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS4sHmGyjSFri2HksxLxhLyFaBrewkuzVXvmhEMalxNxEQPZMVUxVkkOMgpCNFGdYJV0L-2bW59aM_M/pub?output=csv"
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS6xDKNpWkBBdhZSqepy48bXo55QnRv1Xy6tXTKYzZLMPjZozMfYhHQjAcC8uj9hQ/pub?output=xlsx"

df = pd.read_excel(url,sheet_name='student-mat')
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   school      395 non-null    object 
 1   sex         395 non-null    object 
 2   age         395 non-null    float64
 3   address     395 non-null    object 
 4   famsize     395 non-null    object 
 5   Pstatus     395 non-null    object 
 6   Medu        395 non-null    float64
 7   Fedu        395 non-null    float64
 8   Mjob        395 non-null    object 
 9   Fjob        395 non-null    object 
 10  reason      395 non-null    object 
 11  guardian    395 non-null    object 
 12  traveltime  395 non-null    float64
 13  studytime   395 non-null    float64
 14  failures    395 non-null    float64
 15  schoolsup   395 non-null    object 
 16  famsup      395 non-null    object 
 17  paid        395 non-null    object 
 18  activities  395 non-null    object 
 19  nursery     395 non-null    object 
 20  higher      395 non-null    object 
 21  internet    395 non-null    object 
 22  romantic    395 non-null    object 
 23  famrel      395 non-null    float64
 24  freetime    395 non-null    float64
 25  goout       395 non-null    float64
 26  Dalc        395 non-null    float64
 27  Walc        395 non-null    float64
 28  health      395 non-null    float64
 29  absences    395 non-null    float64
 30  G1          395 non-null    float64
 31  G2          395 non-null    float64
 32  G3          395 non-null    float64
dtypes: float64(16), object(17)
memory usage: 102.0+ KB
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18.0 U GT3 A 4.0 4.0 at_home teacher ... 4.0 3.0 4.0 1.0 1.0 3.0 6.0 5.0 6.0 6.0
1 GP F 17.0 U GT3 T 1.0 1.0 at_home other ... 5.0 3.0 3.0 1.0 1.0 3.0 4.0 5.0 5.0 6.0
2 GP F 15.0 U LE3 T 1.0 1.0 at_home other ... 4.0 3.0 2.0 2.0 3.0 3.0 10.0 7.0 8.0 10.0
3 GP F 15.0 U GT3 T 4.0 2.0 health services ... 3.0 2.0 2.0 1.0 1.0 5.0 2.0 15.0 14.0 15.0
4 GP F 16.0 U GT3 T 3.0 3.0 other other ... 4.0 3.0 2.0 1.0 2.0 5.0 4.0 6.0 10.0 10.0

5 rows × 33 columns

# increasing the col width so we can read the description
pd.options.display.max_colwidth = 200
data_dict = pd.read_excel(url,sheet_name='README',usecols=[0,1,2],
                          skiprows=1,index_col=0)
data_dict
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
#
1.0 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2.0 sex - student's sex (binary: "F" - female or "M" - male)
3.0 age - student's age (numeric: from 15 to 22)
4.0 address - student's home address type (binary: "U" - urban or "R" - rural)
5.0 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
6.0 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
7.0 Medu mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education
8.0 Fedu father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education
9.0 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
10.0 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
11.0 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
12.0 guardian - student's guardian (nominal: "mother", "father" or "other")
13.0 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14.0 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15.0 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16.0 schoolsup - extra educational support (binary: yes or no)
17.0 famsup - family educational support (binary: yes or no)
18.0 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19.0 activities - extra-curricular activities (binary: yes or no)
20.0 nursery - attended nursery school (binary: yes or no)
21.0 higher - wants to take higher education (binary: yes or no)
22.0 internet - Internet access at home (binary: yes or no)
23.0 romantic - with a romantic relationship (binary: yes or no)
24.0 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25.0 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26.0 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27.0 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28.0 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29.0 health - current health status (numeric: from 1 - very bad to 5 - very good)
30.0 absences - number of school absences (numeric: from 0 to 93)
= = =
# Info these grades are related with the course subject, Math or Portuguese:
31.0 G1 - first period grade (numeric: from 0 to 20)
31.0 G2 - second period grade (numeric: from 0 to 20)
32.0 G3 - final grade (numeric: from 0 to 20, output target)
NaN NaN NaN
Note Additional note: :there are several (382) students that belong to both datasets .
Note Additional note: These students can be identified by searching for identical attributes that characterize each student, as shown in the annexed R file.

Exploratory Data Analysis#

  • We need to get an overview of this dataset and we’d like to get it quickly. While it is good practice to manually perform the EDA steps, we can also leverage additional tools/packages for quick dataset inspection.

def summarize_df(df_):
    df = df_.copy()
    report = pd.DataFrame({             
            'dtype':df.dtypes,
             '# null': df.isna().sum(),
             'null (%)': df.isna().sum()/len(df)*100,
        'nunique':df.nunique(),
        "min":df.min(),
        'max':df.max()
             })
    report.index.name='Column'
    return report.reset_index()

summarize_df(df)
Column dtype # null null (%) nunique min max
0 school object 0 0.0 2 GP MS
1 sex object 0 0.0 2 F M
2 age float64 0 0.0 8 15.0 22.0
3 address object 0 0.0 2 R U
4 famsize object 0 0.0 2 GT3 LE3
5 Pstatus object 0 0.0 2 A T
6 Medu float64 0 0.0 5 0.0 4.0
7 Fedu float64 0 0.0 5 0.0 4.0
8 Mjob object 0 0.0 5 at_home teacher
9 Fjob object 0 0.0 5 at_home teacher
10 reason object 0 0.0 4 course reputation
11 guardian object 0 0.0 3 father other
12 traveltime float64 0 0.0 4 1.0 4.0
13 studytime float64 0 0.0 4 1.0 4.0
14 failures float64 0 0.0 4 0.0 3.0
15 schoolsup object 0 0.0 2 no yes
16 famsup object 0 0.0 2 no yes
17 paid object 0 0.0 2 no yes
18 activities object 0 0.0 2 no yes
19 nursery object 0 0.0 2 no yes
20 higher object 0 0.0 2 no yes
21 internet object 0 0.0 2 no yes
22 romantic object 0 0.0 2 no yes
23 famrel float64 0 0.0 5 1.0 5.0
24 freetime float64 0 0.0 5 1.0 5.0
25 goout float64 0 0.0 5 1.0 5.0
26 Dalc float64 0 0.0 5 1.0 5.0
27 Walc float64 0 0.0 5 1.0 5.0
28 health float64 0 0.0 5 1.0 5.0
29 absences float64 0 0.0 34 0.0 75.0
30 G1 float64 0 0.0 17 3.0 19.0
31 G2 float64 0 0.0 17 0.0 19.0
32 G3 float64 0 0.0 18 0.0 20.0

Determining Our Target#

  • The stakeholder wants to identify students based on their predicted performance in their final year.

    • “G1”,”G2”,”G3” are the student final score at the end of years 1-3.

  • One approach we could take is treat this as a regression task, with G3 as our target.

    • This would allow us to predict the exact final grade of each student.

Types<?> of Model-Based Insights#

There are several approaches available to us for modeling-based insights.

First, we will examine big-picture insights for the entire model/dataset. Some options include:

  • Use a Linear Regression to extract coefficients.

    • Unscaled: exact effect on final grade of each feature.

    • Scaled: features that are the most important for final grade.

  • Use tree-based regression models for Feature Importance.

    • Which features are the most helpful for predicting grade? (built-in importance)

    • Which features damage the model’s performance the most when shuffled? (permutation importance).

We could also treat this as a classification task by creating a “at-risk” or “under-performing” classification column based on grades. - We can then leverage additional model explanation tools to understand which features make a student more likely to under-perform.

Second, we will select stakeholder-appropriate visualizations to summarize our findings.

  • We can leverage additional explanation packages to help us illustrate how a specific student’s features influence their predicted performance.

EDA#

  • Before we dive into modeling, let’s take a moment for some EDA visualizations to help us get a sense of our target.

Visualizing Grades#

ax = sns.histplot(data=df,x='G3');
ax.set_title("Distribution of Final Grades")
Text(0.5, 1.0, 'Distribution of Final Grades')
../../_images/Insights for Stakeholders_23_1.png
  • Lets compare all 3 years of grades on one histogram. Since each grade is a separate column, one approach we could take is to use the histplot function 3 times, once for each column.

## Use histplot 3 times
sns.histplot(data=df,x='G3',color='r',alpha=0.9,kde=True,label='G3');
sns.histplot(data=df,x='G2',color='b',alpha=0.7,kde=True,label='G2');
sns.histplot(data=df,x='G1',color='slategray',alpha=0.6,kde=True,label='G1');
plt.title('Compare Years 1-3 (Original Vers)')
plt.legend();
../../_images/Insights for Stakeholders_25_0.png
  • Another way we can leverage seaborn’s functionality is to make a new version of our dataframe where the columns “G1”-“G3” will be transposed so that we will have a 1 column with all of the grades and another column with the label for which year the grade was from.

## Let's turn the integer index into a col for student-id
df = df.reset_index()
df = df.rename({'index':'student'},axis=1)
df
student school sex age address famsize Pstatus Medu Fedu Mjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 0 GP F 18.0 U GT3 A 4.0 4.0 at_home ... 4.0 3.0 4.0 1.0 1.0 3.0 6.0 5.0 6.0 6.0
1 1 GP F 17.0 U GT3 T 1.0 1.0 at_home ... 5.0 3.0 3.0 1.0 1.0 3.0 4.0 5.0 5.0 6.0
2 2 GP F 15.0 U LE3 T 1.0 1.0 at_home ... 4.0 3.0 2.0 2.0 3.0 3.0 10.0 7.0 8.0 10.0
3 3 GP F 15.0 U GT3 T 4.0 2.0 health ... 3.0 2.0 2.0 1.0 1.0 5.0 2.0 15.0 14.0 15.0
4 4 GP F 16.0 U GT3 T 3.0 3.0 other ... 4.0 3.0 2.0 1.0 2.0 5.0 4.0 6.0 10.0 10.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
390 390 MS M 20.0 U LE3 A 2.0 2.0 services ... 5.0 5.0 4.0 4.0 5.0 4.0 11.0 9.0 9.0 9.0
391 391 MS M 17.0 U LE3 T 3.0 1.0 services ... 2.0 4.0 5.0 3.0 4.0 2.0 3.0 14.0 16.0 16.0
392 392 MS M 21.0 R GT3 T 1.0 1.0 other ... 5.0 5.0 3.0 3.0 3.0 3.0 3.0 10.0 8.0 7.0
393 393 MS M 18.0 R LE3 T 3.0 2.0 services ... 4.0 4.0 1.0 3.0 4.0 5.0 0.0 11.0 12.0 10.0
394 394 MS M 19.0 U LE3 T 1.0 1.0 other ... 3.0 2.0 3.0 3.0 3.0 5.0 5.0 8.0 9.0 9.0

395 rows × 34 columns

melted = pd.melt(df,id_vars='student', value_vars=['G1','G2','G3'],var_name='PrevYear',
                value_name="Grade")
melted
student PrevYear Grade
0 0 G1 5.0
1 1 G1 5.0
2 2 G1 7.0
3 3 G1 15.0
4 4 G1 6.0
... ... ... ...
1180 390 G3 9.0
1181 391 G3 16.0
1182 392 G3 7.0
1183 393 G3 10.0
1184 394 G3 9.0

1185 rows × 3 columns

ax = sns.histplot(data=melted,hue='PrevYear', x='Grade',kde=True)
ax.set_title('Comparing Years 1-3 (Melted Vers)')
Text(0.5, 1.0, 'Comparing Years 1-3 (Melted Vers)')
../../_images/Insights for Stakeholders_29_1.png
  • Visualizing all 3 grades at the same time, we can see there is actually a gap in the middle of the scores. It seems no student ever receives a grade around 10. This may be something to investigate further in the future.

  • It also looks like The # of students with scores near 0 increased each year.

    • Is there a better way we can see how students performance changed each year?

Visualizing Progress By Grade Groups#

  • Let’s define groups for each year’s scores by binning the scores.

df.describe()[['G1','G2','G3']]
G1 G2 G3
count 395.000000 395.000000 395.000000
mean 10.908861 10.713924 10.415190
std 3.319195 3.761505 4.581443
min 3.000000 0.000000 0.000000
25% 8.000000 9.000000 8.000000
50% 11.000000 11.000000 11.000000
75% 13.000000 13.000000 14.000000
max 19.000000 19.000000 20.000000
## defining bins and their labels
bins = [0,8,14,20]

bin_labels = ['1. low','2. med','3. high']
## Visualizing bin cutoffs
ax = sns.histplot(data=melted,hue='PrevYear', x='Grade',kde=True)
ax.set_title('Comparing Years 1-3 (Melted Vers)')
ax.axvspan(bins[0],bins[1],zorder=0,color='lightcoral',label='Low',alpha=0.6)
ax.axvspan(bins[1],bins[2],zorder=0,color='skyblue', label='Med',alpha=0.6)
ax.axvspan(bins[2],bins[3],zorder=0,color='limegreen', label='High',alpha=0.6)
ax.set_xlim(left=0,right=20)
(0.0, 20.0)
../../_images/Insights for Stakeholders_35_1.png
  • ADD A BRIEF EXPLANATION

pd.cut(df['G1'], bins,right=False,include_lowest=True)#,labels=bin_labels)
0        [0, 8)
1        [0, 8)
2        [0, 8)
3      [14, 20)
4        [0, 8)
         ...   
390     [8, 14)
391    [14, 20)
392     [8, 14)
393     [8, 14)
394     [8, 14)
Name: G1, Length: 395, dtype: category
Categories (3, interval[int64, left]): [[0, 8) < [8, 14) < [14, 20)]
bins
[0, 8, 14, 20]
df['First-Year Group'] = pd.cut(df['G1'],bins,labels=bin_labels, right=True,include_lowest=True)
df['Final-Year Group'] = pd.cut(df['G3'],bins,labels=bin_labels,right=True,include_lowest=True)
df
student school sex age address famsize Pstatus Medu Fedu Mjob ... goout Dalc Walc health absences G1 G2 G3 First-Year Group Final-Year Group
0 0 GP F 18.0 U GT3 A 4.0 4.0 at_home ... 4.0 1.0 1.0 3.0 6.0 5.0 6.0 6.0 1. low 1. low
1 1 GP F 17.0 U GT3 T 1.0 1.0 at_home ... 3.0 1.0 1.0 3.0 4.0 5.0 5.0 6.0 1. low 1. low
2 2 GP F 15.0 U LE3 T 1.0 1.0 at_home ... 2.0 2.0 3.0 3.0 10.0 7.0 8.0 10.0 1. low 2. med
3 3 GP F 15.0 U GT3 T 4.0 2.0 health ... 2.0 1.0 1.0 5.0 2.0 15.0 14.0 15.0 3. high 3. high
4 4 GP F 16.0 U GT3 T 3.0 3.0 other ... 2.0 1.0 2.0 5.0 4.0 6.0 10.0 10.0 1. low 2. med
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
390 390 MS M 20.0 U LE3 A 2.0 2.0 services ... 4.0 4.0 5.0 4.0 11.0 9.0 9.0 9.0 2. med 2. med
391 391 MS M 17.0 U LE3 T 3.0 1.0 services ... 5.0 3.0 4.0 2.0 3.0 14.0 16.0 16.0 2. med 3. high
392 392 MS M 21.0 R GT3 T 1.0 1.0 other ... 3.0 3.0 3.0 3.0 3.0 10.0 8.0 7.0 2. med 1. low
393 393 MS M 18.0 R LE3 T 3.0 2.0 services ... 1.0 3.0 4.0 5.0 0.0 11.0 12.0 10.0 2. med 2. med
394 394 MS M 19.0 U LE3 T 1.0 1.0 other ... 3.0 3.0 3.0 5.0 5.0 8.0 9.0 9.0 1. low 2. med

395 rows × 36 columns

df[df['Final-Year Group'].isna()]
student school sex age address famsize Pstatus Medu Fedu Mjob ... goout Dalc Walc health absences G1 G2 G3 First-Year Group Final-Year Group

0 rows × 36 columns

df['Final-Year Group'].value_counts(dropna=False)
2. med     220
1. low     102
3. high     73
Name: Final-Year Group, dtype: int64
melted = pd.melt(df,id_vars=['student','First-Year Group','Final-Year Group'], 
                 value_vars=['G1','G2','G3'],var_name='PrevYear',
                value_name="Grade")
melted
student First-Year Group Final-Year Group PrevYear Grade
0 0 1. low 1. low G1 5.0
1 1 1. low 1. low G1 5.0
2 2 1. low 2. med G1 7.0
3 3 3. high 3. high G1 15.0
4 4 1. low 2. med G1 6.0
... ... ... ... ... ...
1180 390 2. med 2. med G3 9.0
1181 391 2. med 3. high G3 16.0
1182 392 2. med 1. low G3 7.0
1183 393 2. med 2. med G3 10.0
1184 394 1. low 2. med G3 9.0

1185 rows × 5 columns

bins_info = [('1. low',{ 'bins':(bins[0],bins[1]), 
                    'line_color':'darkred','span_color': 'lightcoral'}),
             ('2. med',{ 'bins':(bins[1],bins[2]),
                    'line_color':'blue','span_color':'skyblue'}),
              ('3. high',{ 'bins':(bins[2],bins[3]),
                      'line_color':'forestgreen','span_color':'limegreen'})
            ]
bins_info
[('1. low',
  {'bins': (0, 8), 'line_color': 'darkred', 'span_color': 'lightcoral'}),
 ('2. med', {'bins': (8, 14), 'line_color': 'blue', 'span_color': 'skyblue'}),
 ('3. high',
  {'bins': (14, 20), 'line_color': 'forestgreen', 'span_color': 'limegreen'})]
## slice out list of line colors
line_colors = [v[1]['line_color'] for v in bins_info]
line_colors
['darkred', 'blue', 'forestgreen']
## Plot visual using Final year Groups for colors
ax = pd.plotting.parallel_coordinates(df,'Final-Year Group',cols=['G1','G2','G3'],
                                 color= line_colors,sort_labels=False,alpha=0.5)
ax.set_title("Year 1-3 grades - Colored by Final Grade Groups");

## Annotate
for group, grp_dict in bins_info:
    ax.axhspan(grp_dict['bins'][0],grp_dict['bins'][1],
               color=grp_dict['span_color'], alpha=0.5, label=group)

ax.set_ylim(bottom=0,top=20)
(0.0, 20.0)
../../_images/Insights for Stakeholders_46_1.png
  • Looking at our parallel coordinates plot - grouped by Final Grades, we can see that there are students who were in the low group for G3 that had been in the med-group range. So students whose performance dropped from average to below average over the years.

  • We can also see that there were some students in the high group whose G1 scores were in the med range, showing that they improved their performance over years 1-3.

  • We also notice that there are a lot of zeroes that appear in G2 and G3, what we can see all that lines that were 0 for G2 stayed at 0 for G3.

Next Steps#

  • Now that we have an idea of the target we are analyzing, we are ready to move onto modeling!

  • We will start by exploring answering our stakeholder’s question(s) with regression modeling.

  • Let’s get started!

APPENDIX#

raise Exception("stop here")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 raise Exception("stop here")

Exception: stop here
## exaple from https://pandas.pydata.org/docs/reference/api/pandas.plotting.parallel_coordinates.html
## want a version for g1-g3
df2 = pd.read_csv(
    'https://raw.github.com/pandas-dev/'
    'pandas/main/pandas/tests/io/data/csv/iris.csv'
)
df2
pd.plotting.parallel_coordinates(df2, 'Name')