Insights for Stakeholders#

Lesson Objectives#

By the end of this lesson, students will be able to:

Define the stakeholder and their business problem that will be the guiding framework for this week’s lessons.
Identify the target and features for a machine learning model to use for insights
Understand how different feature/target choices can provide a different spin/perspective on the problem.

The Stakeholder#

We’ve been hired by a school district that wants to use data science to identify and support high school students at risk of poor performance.
They have provided us with data on several hundred of their former students and want to identify students who are at risk of poor grades/performance by year 3.
- The school included 3 years of grades(presumable grades 10-12), which are labeled as (G1-G3).
- They sent us an excel file with 3 sheets:
  - student-mat: grades for Math (the student-mat sheet)
  - student-por: grades for Portuguese
  - README: data dictionary
The goal is to identify these students and provide additional support/tutoring to improve their academic performance.

Our Task#

Develop machine-learning models to predict student performance in their final year (year 3).
We will then use our model(s) to extract insights into which students are most at-risk for poor performance.
We will provide a summary of our findings and 3 recommendations to the school on how to best identify these at-risk students.

We will be focusing on the Math grades to start.

Stakeholder Considerations#

Before diving into modeling, let’s examine the data the school district has provided us and let’s brainstorm/discuss our approach for our models.

EDA#

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

## Customization Options
# pd.set_option("display.max_columns",100)
plt.style.use(['seaborn-talk'])
mpl.rcParams['figure.facecolor']='white'

# import dataframe_image as dfi

# url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS4sHmGyjSFri2HksxLxhLyFaBrewkuzVXvmhEMalxNxEQPZMVUxVkkOMgpCNFGdYJV0L-2bW59aM_M/pub?output=csv"
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS6xDKNpWkBBdhZSqepy48bXo55QnRv1Xy6tXTKYzZLMPjZozMfYhHQjAcC8uj9hQ/pub?output=xlsx"

df = pd.read_excel(url,sheet_name='student-mat')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 school      395 non-null    object 
 sex         395 non-null    object 
 age         395 non-null    float64
 address     395 non-null    object 
 famsize     395 non-null    object 
 Pstatus     395 non-null    object 
 Medu        395 non-null    float64
 Fedu        395 non-null    float64
 Mjob        395 non-null    object 
 Fjob        395 non-null    object 
reason      395 non-null    object 
guardian    395 non-null    object 
traveltime  395 non-null    float64
studytime   395 non-null    float64
failures    395 non-null    float64
schoolsup   395 non-null    object 
famsup      395 non-null    object 
paid        395 non-null    object 
activities  395 non-null    object 
nursery     395 non-null    object 
higher      395 non-null    object 
internet    395 non-null    object 
romantic    395 non-null    object 
famrel      395 non-null    float64
freetime    395 non-null    float64
goout       395 non-null    float64
Dalc        395 non-null    float64
Walc        395 non-null    float64
health      395 non-null    float64
absences    395 non-null    float64
G1          395 non-null    float64
G2          395 non-null    float64
G3          395 non-null    float64
dtypes: float64(16), object(17)
memory usage: 102.0+ KB

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	GP	F	18.0	U	GT3	A	4.0	4.0	at_home	teacher	...	4.0	3.0	4.0	1.0	1.0	3.0	6.0	5.0	6.0	6.0
1	GP	F	17.0	U	GT3	T	1.0	1.0	at_home	other	...	5.0	3.0	3.0	1.0	1.0	3.0	4.0	5.0	5.0	6.0
2	GP	F	15.0	U	LE3	T	1.0	1.0	at_home	other	...	4.0	3.0	2.0	2.0	3.0	3.0	10.0	7.0	8.0	10.0
3	GP	F	15.0	U	GT3	T	4.0	2.0	health	services	...	3.0	2.0	2.0	1.0	1.0	5.0	2.0	15.0	14.0	15.0
4	GP	F	16.0	U	GT3	T	3.0	3.0	other	other	...	4.0	3.0	2.0	1.0	2.0	5.0	4.0	6.0	10.0	10.0

5 rows × 33 columns

# increasing the col width so we can read the description
pd.options.display.max_colwidth = 200
data_dict = pd.read_excel(url,sheet_name='README',usecols=[0,1,2],
                          skiprows=1,index_col=0)
data_dict

	Attributes	for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
#
1.0	school	- student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2.0	sex	- student's sex (binary: "F" - female or "M" - male)
3.0	age	- student's age (numeric: from 15 to 22)
4.0	address	- student's home address type (binary: "U" - urban or "R" - rural)
5.0	famsize	- family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
6.0	Pstatus	- parent's cohabitation status (binary: "T" - living together or "A" - apart)
7.0	Medu	mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 ‚Äì 5th to 9th grade, 3 ‚Äì secondary education or 4 ‚Äì higher education
8.0	Fedu	father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 ‚Äì 5th to 9th grade, 3 ‚Äì secondary education or 4 ‚Äì higher education
9.0	Mjob	- mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
10.0	Fjob	- father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
11.0	reason	- reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
12.0	guardian	- student's guardian (nominal: "mother", "father" or "other")
13.0	traveltime	- home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14.0	studytime	- weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15.0	failures	- number of past class failures (numeric: n if 1<=n<3, else 4)
16.0	schoolsup	- extra educational support (binary: yes or no)
17.0	famsup	- family educational support (binary: yes or no)
18.0	paid	- extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19.0	activities	- extra-curricular activities (binary: yes or no)
20.0	nursery	- attended nursery school (binary: yes or no)
21.0	higher	- wants to take higher education (binary: yes or no)
22.0	internet	- Internet access at home (binary: yes or no)
23.0	romantic	- with a romantic relationship (binary: yes or no)
24.0	famrel	- quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25.0	freetime	- free time after school (numeric: from 1 - very low to 5 - very high)
26.0	goout	- going out with friends (numeric: from 1 - very low to 5 - very high)
27.0	Dalc	- workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28.0	Walc	- weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29.0	health	- current health status (numeric: from 1 - very bad to 5 - very good)
30.0	absences	- number of school absences (numeric: from 0 to 93)
=	=	=
#	Info	these grades are related with the course subject, Math or Portuguese:
31.0	G1	- first period grade (numeric: from 0 to 20)
31.0	G2	- second period grade (numeric: from 0 to 20)
32.0	G3	- final grade (numeric: from 0 to 20, output target)
NaN	NaN	NaN
Note	Additional note:	:there are several (382) students that belong to both datasets .
Note	Additional note:	These students can be identified by searching for identical attributes that characterize each student, as shown in the annexed R file.

Exploratory Data Analysis#

We need to get an overview of this dataset and we’d like to get it quickly. While it is good practice to manually perform the EDA steps, we can also leverage additional tools/packages for quick dataset inspection.

def summarize_df(df_):
    df = df_.copy()
    report = pd.DataFrame({             
            'dtype':df.dtypes,
             '# null': df.isna().sum(),
             'null (%)': df.isna().sum()/len(df)*100,
        'nunique':df.nunique(),
        "min":df.min(),
        'max':df.max()
             })
    report.index.name='Column'
    return report.reset_index()

summarize_df(df)

	Column	dtype	nunique	min	max
0	school	object	2	GP	MS
1	sex	object	2	F	M
2	age	float64	8	15.0	22.0
3	address	object	2	R	U
4	famsize	object	2	GT3	LE3
5	Pstatus	object	2	A	T
6	Medu	float64	5	0.0	4.0
7	Fedu	float64	5	0.0	4.0
8	Mjob	object	5	at_home	teacher
9	Fjob	object	5	at_home	teacher
10	reason	object	4	course	reputation
11	guardian	object	3	father	other
12	traveltime	float64	4	1.0	4.0
13	studytime	float64	4	1.0	4.0
14	failures	float64	4	0.0	3.0
15	schoolsup	object	2	no	yes
16	famsup	object	2	no	yes
17	paid	object	2	no	yes
18	activities	object	2	no	yes
19	nursery	object	2	no	yes
20	higher	object	2	no	yes
21	internet	object	2	no	yes
22	romantic	object	2	no	yes
23	famrel	float64	5	1.0	5.0
24	freetime	float64	5	1.0	5.0
25	goout	float64	5	1.0	5.0
26	Dalc	float64	5	1.0	5.0
27	Walc	float64	5	1.0	5.0
28	health	float64	5	1.0	5.0
29	absences	float64	34	0.0	75.0
30	G1	float64	17	3.0	19.0
31	G2	float64	17	0.0	19.0
32	G3	float64	18	0.0	20.0

Determining Our Target#

The stakeholder wants to identify students based on their predicted performance in their final year.
- “G1”,”G2”,”G3” are the student final score at the end of years 1-3.
One approach we could take is treat this as a regression task, with G3 as our target.
- This would allow us to predict the exact final grade of each student.

`Types<?>` of Model-Based Insights#

There are several approaches available to us for modeling-based insights.

First, we will examine big-picture insights for the entire model/dataset. Some options include:

Use a Linear Regression to extract coefficients.
- Unscaled: exact effect on final grade of each feature.
- Scaled: features that are the most important for final grade.
Use tree-based regression models for Feature Importance.
- Which features are the most helpful for predicting grade? (built-in importance)
- Which features damage the model’s performance the most when shuffled? (permutation importance).

We could also treat this as a classification task by creating a “at-risk” or “under-performing” classification column based on grades. - We can then leverage additional model explanation tools to understand which features make a student more likely to under-perform.

Second, we will select stakeholder-appropriate visualizations to summarize our findings.

We can leverage additional explanation packages to help us illustrate how a specific student’s features influence their predicted performance.

EDA#

Before we dive into modeling, let’s take a moment for some EDA visualizations to help us get a sense of our target.

Visualizing Grades#

ax = sns.histplot(data=df,x='G3');
ax.set_title("Distribution of Final Grades")

Text(0.5, 1.0, 'Distribution of Final Grades')

../../_images/Insights for Stakeholders_23_1.png

Lets compare all 3 years of grades on one histogram. Since each grade is a separate column, one approach we could take is to use the histplot function 3 times, once for each column.

## Use histplot 3 times
sns.histplot(data=df,x='G3',color='r',alpha=0.9,kde=True,label='G3');
sns.histplot(data=df,x='G2',color='b',alpha=0.7,kde=True,label='G2');
sns.histplot(data=df,x='G1',color='slategray',alpha=0.6,kde=True,label='G1');
plt.title('Compare Years 1-3 (Original Vers)')
plt.legend();

../../_images/Insights for Stakeholders_25_0.png

Another way we can leverage seaborn’s functionality is to make a new version of our dataframe where the columns “G1”-“G3” will be transposed so that we will have a 1 column with all of the grades and another column with the label for which year the grade was from.

## Let's turn the integer index into a col for student-id
df = df.reset_index()
df = df.rename({'index':'student'},axis=1)
df

	student	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	0	GP	F	18.0	U	GT3	A	4.0	4.0	at_home	...	4.0	3.0	4.0	1.0	1.0	3.0	6.0	5.0	6.0	6.0
1	1	GP	F	17.0	U	GT3	T	1.0	1.0	at_home	...	5.0	3.0	3.0	1.0	1.0	3.0	4.0	5.0	5.0	6.0
2	2	GP	F	15.0	U	LE3	T	1.0	1.0	at_home	...	4.0	3.0	2.0	2.0	3.0	3.0	10.0	7.0	8.0	10.0
3	3	GP	F	15.0	U	GT3	T	4.0	2.0	health	...	3.0	2.0	2.0	1.0	1.0	5.0	2.0	15.0	14.0	15.0
4	4	GP	F	16.0	U	GT3	T	3.0	3.0	other	...	4.0	3.0	2.0	1.0	2.0	5.0	4.0	6.0	10.0	10.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
390	390	MS	M	20.0	U	LE3	A	2.0	2.0	services	...	5.0	5.0	4.0	4.0	5.0	4.0	11.0	9.0	9.0	9.0
391	391	MS	M	17.0	U	LE3	T	3.0	1.0	services	...	2.0	4.0	5.0	3.0	4.0	2.0	3.0	14.0	16.0	16.0
392	392	MS	M	21.0	R	GT3	T	1.0	1.0	other	...	5.0	5.0	3.0	3.0	3.0	3.0	3.0	10.0	8.0	7.0
393	393	MS	M	18.0	R	LE3	T	3.0	2.0	services	...	4.0	4.0	1.0	3.0	4.0	5.0	0.0	11.0	12.0	10.0
394	394	MS	M	19.0	U	LE3	T	1.0	1.0	other	...	3.0	2.0	3.0	3.0	3.0	5.0	5.0	8.0	9.0	9.0

395 rows × 34 columns

melted = pd.melt(df,id_vars='student', value_vars=['G1','G2','G3'],var_name='PrevYear',
                value_name="Grade")
melted

	student	PrevYear	Grade
0	0	G1	5.0
1	1	G1	5.0
2	2	G1	7.0
3	3	G1	15.0
4	4	G1	6.0
...	...	...	...
1180	390	G3	9.0
1181	391	G3	16.0
1182	392	G3	7.0
1183	393	G3	10.0
1184	394	G3	9.0

1185 rows × 3 columns

ax = sns.histplot(data=melted,hue='PrevYear', x='Grade',kde=True)
ax.set_title('Comparing Years 1-3 (Melted Vers)')

Text(0.5, 1.0, 'Comparing Years 1-3 (Melted Vers)')

../../_images/Insights for Stakeholders_29_1.png

Visualizing all 3 grades at the same time, we can see there is actually a gap in the middle of the scores. It seems no student ever receives a grade around 10. This may be something to investigate further in the future.
It also looks like The # of students with scores near 0 increased each year.
- Is there a better way we can see how students performance changed each year?

Visualizing Progress By Grade Groups#

Let’s define groups for each year’s scores by binning the scores.

df.describe()[['G1','G2','G3']]

	G1	G2	G3
count	395.000000	395.000000	395.000000
mean	10.908861	10.713924	10.415190
std	3.319195	3.761505	4.581443
min	3.000000	0.000000	0.000000
25%	8.000000	9.000000	8.000000
50%	11.000000	11.000000	11.000000
75%	13.000000	13.000000	14.000000
max	19.000000	19.000000	20.000000

## defining bins and their labels
bins = [0,8,14,20]

bin_labels = ['1. low','2. med','3. high']

## Visualizing bin cutoffs
ax = sns.histplot(data=melted,hue='PrevYear', x='Grade',kde=True)
ax.set_title('Comparing Years 1-3 (Melted Vers)')
ax.axvspan(bins[0],bins[1],zorder=0,color='lightcoral',label='Low',alpha=0.6)
ax.axvspan(bins[1],bins[2],zorder=0,color='skyblue', label='Med',alpha=0.6)
ax.axvspan(bins[2],bins[3],zorder=0,color='limegreen', label='High',alpha=0.6)
ax.set_xlim(left=0,right=20)

(0.0, 20.0)

../../_images/Insights for Stakeholders_35_1.png

ADD A BRIEF EXPLANATION

pd.cut(df['G1'], bins,right=False,include_lowest=True)#,labels=bin_labels)

      [0, 8)
      [0, 8)
      [0, 8)
    [14, 20)
      [0, 8)
         ...   
   [8, 14)
  [14, 20)
   [8, 14)
   [8, 14)
   [8, 14)
Name: G1, Length: 395, dtype: category
Categories (3, interval[int64, left]): [[0, 8) < [8, 14) < [14, 20)]

bins

[0, 8, 14, 20]

df['First-Year Group'] = pd.cut(df['G1'],bins,labels=bin_labels, right=True,include_lowest=True)

df['Final-Year Group'] = pd.cut(df['G3'],bins,labels=bin_labels,right=True,include_lowest=True)
df

	student	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	...	goout	Dalc	Walc	health	absences	G1	G2	G3	First-Year Group	Final-Year Group
0	0	GP	F	18.0	U	GT3	A	4.0	4.0	at_home	...	4.0	1.0	1.0	3.0	6.0	5.0	6.0	6.0	1. low	1. low
1	1	GP	F	17.0	U	GT3	T	1.0	1.0	at_home	...	3.0	1.0	1.0	3.0	4.0	5.0	5.0	6.0	1. low	1. low
2	2	GP	F	15.0	U	LE3	T	1.0	1.0	at_home	...	2.0	2.0	3.0	3.0	10.0	7.0	8.0	10.0	1. low	2. med
3	3	GP	F	15.0	U	GT3	T	4.0	2.0	health	...	2.0	1.0	1.0	5.0	2.0	15.0	14.0	15.0	3. high	3. high
4	4	GP	F	16.0	U	GT3	T	3.0	3.0	other	...	2.0	1.0	2.0	5.0	4.0	6.0	10.0	10.0	1. low	2. med
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
390	390	MS	M	20.0	U	LE3	A	2.0	2.0	services	...	4.0	4.0	5.0	4.0	11.0	9.0	9.0	9.0	2. med	2. med
391	391	MS	M	17.0	U	LE3	T	3.0	1.0	services	...	5.0	3.0	4.0	2.0	3.0	14.0	16.0	16.0	2. med	3. high
392	392	MS	M	21.0	R	GT3	T	1.0	1.0	other	...	3.0	3.0	3.0	3.0	3.0	10.0	8.0	7.0	2. med	1. low
393	393	MS	M	18.0	R	LE3	T	3.0	2.0	services	...	1.0	3.0	4.0	5.0	0.0	11.0	12.0	10.0	2. med	2. med
394	394	MS	M	19.0	U	LE3	T	1.0	1.0	other	...	3.0	3.0	3.0	5.0	5.0	8.0	9.0	9.0	1. low	2. med

395 rows × 36 columns

df[df['Final-Year Group'].isna()]

	student	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	...	goout	Dalc	Walc	health	absences	G1	G2	G3	First-Year Group	Final-Year Group

0 rows × 36 columns

df['Final-Year Group'].value_counts(dropna=False)

med     220
low     102
high     73
Name: Final-Year Group, dtype: int64

melted = pd.melt(df,id_vars=['student','First-Year Group','Final-Year Group'], 
                 value_vars=['G1','G2','G3'],var_name='PrevYear',
                value_name="Grade")
melted

	student	First-Year Group	Final-Year Group	PrevYear	Grade
0	0	1. low	1. low	G1	5.0
1	1	1. low	1. low	G1	5.0
2	2	1. low	2. med	G1	7.0
3	3	3. high	3. high	G1	15.0
4	4	1. low	2. med	G1	6.0
...	...	...	...	...	...
1180	390	2. med	2. med	G3	9.0
1181	391	2. med	3. high	G3	16.0
1182	392	2. med	1. low	G3	7.0
1183	393	2. med	2. med	G3	10.0
1184	394	1. low	2. med	G3	9.0

1185 rows × 5 columns

bins_info = [('1. low',{ 'bins':(bins[0],bins[1]), 
                    'line_color':'darkred','span_color': 'lightcoral'}),
             ('2. med',{ 'bins':(bins[1],bins[2]),
                    'line_color':'blue','span_color':'skyblue'}),
              ('3. high',{ 'bins':(bins[2],bins[3]),
                      'line_color':'forestgreen','span_color':'limegreen'})
            ]
bins_info

[('1. low',
  {'bins': (0, 8), 'line_color': 'darkred', 'span_color': 'lightcoral'}),
 ('2. med', {'bins': (8, 14), 'line_color': 'blue', 'span_color': 'skyblue'}),
 ('3. high',
  {'bins': (14, 20), 'line_color': 'forestgreen', 'span_color': 'limegreen'})]

## slice out list of line colors
line_colors = [v[1]['line_color'] for v in bins_info]
line_colors

['darkred', 'blue', 'forestgreen']

## Plot visual using Final year Groups for colors
ax = pd.plotting.parallel_coordinates(df,'Final-Year Group',cols=['G1','G2','G3'],
                                 color= line_colors,sort_labels=False,alpha=0.5)
ax.set_title("Year 1-3 grades - Colored by Final Grade Groups");

## Annotate
for group, grp_dict in bins_info:
    ax.axhspan(grp_dict['bins'][0],grp_dict['bins'][1],
               color=grp_dict['span_color'], alpha=0.5, label=group)

ax.set_ylim(bottom=0,top=20)

(0.0, 20.0)

../../_images/Insights for Stakeholders_46_1.png

Looking at our parallel coordinates plot - grouped by Final Grades, we can see that there are students who were in the low group for G3 that had been in the med-group range. So students whose performance dropped from average to below average over the years.
We can also see that there were some students in the high group whose G1 scores were in the med range, showing that they improved their performance over years 1-3.
We also notice that there are a lot of zeroes that appear in G2 and G3, what we can see all that lines that were 0 for G2 stayed at 0 for G3.

Next Steps#

Now that we have an idea of the target we are analyzing, we are ready to move onto modeling!
We will start by exploring answering our stakeholder’s question(s) with regression modeling.
Let’s get started!

APPENDIX#

raise Exception("stop here")

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 raise Exception("stop here")

Exception: stop here

## exaple from https://pandas.pydata.org/docs/reference/api/pandas.plotting.parallel_coordinates.html
## want a version for g1-g3
df2 = pd.read_csv(
    'https://raw.github.com/pandas-dev/'
    'pandas/main/pandas/tests/io/data/csv/iris.csv'
)
df2
pd.plotting.parallel_coordinates(df2, 'Name')

Stack 5: Data Visualization - Model Insights

Insights for Stakeholders

Contents

Insights for Stakeholders#

Lesson Objectives#

The Stakeholder#

Our Task#

Stakeholder Considerations#

EDA#

Exploratory Data Analysis#

Determining Our Target#

`Types<?>` of Model-Based Insights#

EDA#

Visualizing Grades#

Visualizing Progress By Grade Groups#

Next Steps#

APPENDIX#

Stack 5: Data Visualization - Model Insights

Insights for Stakeholders

Contents

Insights for Stakeholders#

Lesson Objectives#

The Stakeholder#

Our Task#

Stakeholder Considerations#

EDA#

Exploratory Data Analysis#

Determining Our Target#

Types<?> of Model-Based Insights#

EDA#

Visualizing Grades#

Visualizing Progress By Grade Groups#

Next Steps#

APPENDIX#

`Types<?>` of Model-Based Insights#