{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Student Performance Indicator\n", "- Understanding the problem statement\n", "- Data Collection'\n", "- Data Checks\n", "- Exploratory Data Analysis\n", "- Data Pre-processing\n", "- Model Training\n", "- Choose the best model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1) Problem Statement\n", "- This project understands how the student's performace (test score) is affected by other variables such gender, ethinicity, parental education, test preparation course.\n", "\n", "### 2) Data Collection\n", "- Data Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977\n", "- The data consist of 8 columns and 1000 rows.\n", "\n", "#### 2.1 Import Data and Required Packages" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import CSV data as Pandas DataFrame" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('data/stud.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show top and bottom records" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderrace_ethnicityparental_level_of_educationlunchtest_preparation_coursemath_scorereading_scorewriting_score
0femalegroup Bbachelor's degreestandardnone727274
1femalegroup Csome collegestandardcompleted699088
2femalegroup Bmaster's degreestandardnone909593
3malegroup Aassociate's degreefree/reducednone475744
4malegroup Csome collegestandardnone767875
\n", "
" ], "text/plain": [ " gender race_ethnicity parental_level_of_education lunch \\\n", "0 female group B bachelor's degree standard \n", "1 female group C some college standard \n", "2 female group B master's degree standard \n", "3 male group A associate's degree free/reduced \n", "4 male group C some college standard \n", "\n", " test_preparation_course math_score reading_score writing_score \n", "0 none 72 72 74 \n", "1 completed 69 90 88 \n", "2 none 90 95 93 \n", "3 none 47 57 44 \n", "4 none 76 78 75 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(5)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderrace_ethnicityparental_level_of_educationlunchtest_preparation_coursemath_scorereading_scorewriting_score
995femalegroup Emaster's degreestandardcompleted889995
996malegroup Chigh schoolfree/reducednone625555
997femalegroup Chigh schoolfree/reducedcompleted597165
998femalegroup Dsome collegestandardcompleted687877
999femalegroup Dsome collegefree/reducednone778686
\n", "
" ], "text/plain": [ " gender race_ethnicity parental_level_of_education lunch \\\n", "995 female group E master's degree standard \n", "996 male group C high school free/reduced \n", "997 female group C high school free/reduced \n", "998 female group D some college standard \n", "999 female group D some college free/reduced \n", "\n", " test_preparation_course math_score reading_score writing_score \n", "995 completed 88 99 95 \n", "996 none 62 55 55 \n", "997 completed 59 71 65 \n", "998 completed 68 78 77 \n", "999 none 77 86 86 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2 Data Information\n", "\n", "\n", "- gender: sex of students\n", "- race_ethinicity: ethinicity of students\n", "- parental_level_of_education: parents' highest education\n", "- lunch: having lunch before test\n", "- test preparation course: completed of not before test\n", "- math score, reading score, writing score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "shape of the dataset" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000, 8)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3) Data Checks\n", "- check missing values\n", "- check duplicates\n", "- Check Datatypes\n", "- Check the number of unique values of each column\n", "- Check Statistics of dataset\n", "- check various categories present in the different categorical columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 Missing values" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gender 0.0\n", "race_ethnicity 0.0\n", "parental_level_of_education 0.0\n", "lunch 0.0\n", "test_preparation_course 0.0\n", "math_score 0.0\n", "reading_score 0.0\n", "writing_score 0.0\n", "dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### observation \n", "- no missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 Check Duplicates" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.duplicated().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### observation\n", "- no duplicates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3 Check Data Types" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1000 entries, 0 to 999\n", "Data columns (total 8 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 gender 1000 non-null object\n", " 1 race_ethnicity 1000 non-null object\n", " 2 parental_level_of_education 1000 non-null object\n", " 3 lunch 1000 non-null object\n", " 4 test_preparation_course 1000 non-null object\n", " 5 math_score 1000 non-null int64 \n", " 6 reading_score 1000 non-null int64 \n", " 7 writing_score 1000 non-null int64 \n", "dtypes: int64(3), object(5)\n", "memory usage: 62.6+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### observation\n", "- 5 categorical columns\n", "- 3 numerical columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Number of Unique Values" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gender 2\n", "race_ethnicity 5\n", "parental_level_of_education 6\n", "lunch 2\n", "test_preparation_course 2\n", "math_score 81\n", "reading_score 72\n", "writing_score 77\n", "dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### observation\n", "- gender, lunch and test_preparation_course are binary columns i.e. 2 unique values.\n", "- race_ethinicity has 5 unique and parental_level_education has 6 unique values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Check Statistics of dataset" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
math_score1000.066.08915.1630800.057.0066.077.0100.0
reading_score1000.069.16914.60019217.059.0070.079.0100.0
writing_score1000.068.05415.19565710.057.7569.079.0100.0
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "math_score 1000.0 66.089 15.163080 0.0 57.00 66.0 77.0 100.0\n", "reading_score 1000.0 69.169 14.600192 17.0 59.00 70.0 79.0 100.0\n", "writing_score 1000.0 68.054 15.195657 10.0 57.75 69.0 79.0 100.0" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### observation\n", "- math_score has a minimum score of zero.\n", "- distribution of three columns are similar in nature." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "categorical_cols: ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course']\n", "length of categorical columns: 5\n", "\n", "numerical_cols: ['math_score', 'reading_score', 'writing_score']\n", "length of numerical columns: 3\n" ] } ], "source": [ "## define categorical and numerical columns\n", "\n", "categorical_cols = [col for col in df.columns if df[col].dtype=='O']\n", "numerical_cols = [col for col in df.columns if col not in categorical_cols]\n", "\n", "print(f\"categorical_cols: {categorical_cols}\")\n", "print(f\"length of categorical columns: {len(categorical_cols)}\\n\")\n", "\n", "print(f\"numerical_cols: {numerical_cols}\")\n", "print(f\"length of numerical columns: {len(numerical_cols)}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.6 Categories of Categorical Columns\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Categories of gender ['female' 'male']: \n", "\n", "Categories of race_ethnicity ['group B' 'group C' 'group A' 'group D' 'group E']: \n", "\n", "Categories of parental_level_of_education [\"bachelor's degree\" 'some college' \"master's degree\" \"associate's degree\"\n", " 'high school' 'some high school']: \n", "\n", "Categories of lunch ['standard' 'free/reduced']: \n", "\n", "Categories of test_preparation_course ['none' 'completed']: \n", "\n" ] } ], "source": [ "for col in categorical_cols:\n", " print(\"Categories of {} {}: \\n\".format(col,df[col].unique())) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4) Exploratory Data Analysis" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderrace_ethnicityparental_level_of_educationlunchtest_preparation_coursemath_scorereading_scorewriting_score
0femalegroup Bbachelor's degreestandardnone727274
1femalegroup Csome collegestandardcompleted699088
2femalegroup Bmaster's degreestandardnone909593
3malegroup Aassociate's degreefree/reducednone475744
4malegroup Csome collegestandardnone767875
\n", "
" ], "text/plain": [ " gender race_ethnicity parental_level_of_education lunch \\\n", "0 female group B bachelor's degree standard \n", "1 female group C some college standard \n", "2 female group B master's degree standard \n", "3 male group A associate's degree free/reduced \n", "4 male group C some college standard \n", "\n", " test_preparation_course math_score reading_score writing_score \n", "0 none 72 72 74 \n", "1 completed 69 90 88 \n", "2 none 90 95 93 \n", "3 none 47 57 44 \n", "4 none 76 78 75 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of students with full marks in Maths: 7\n", "Number of students with full marks in Writing: 14\n", "Number of students with full marks in Reading: 17\n" ] } ], "source": [ "reading_full = df[df['reading_score'] == 100].shape[0]\n", "writing_full = df[df['writing_score'] == 100].shape[0]\n", "math_full = df[df['math_score'] == 100].shape[0]\n", "\n", "\n", "print(f\"Number of students with full marks in Maths: {math_full}\")\n", "print(f\"Number of students with full marks in Writing: {writing_full}\")\n", "print(f\"Number of students with full marks in Reading: {reading_full}\")" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of students with marks less than 30 in Maths: 16\n", "Number of students with marks less than 30 in Writing: 10\n", "Number of students with marks less than 30 in Reading: 8\n" ] } ], "source": [ "reading_less_30 = df[df['reading_score'] <= 30].shape[0]\n", "writing_less_30 = df[df['writing_score'] <= 30].shape[0]\n", "math_less_30 = df[df['math_score'] <= 30].shape[0]\n", "\n", "\n", "print(f\"Number of students with marks less than 30 in Maths: {math_less_30}\")\n", "print(f\"Number of students with marks less than 30 in Writing: {writing_less_30}\")\n", "print(f\"Number of students with marks less than 30 in Reading: {reading_less_30}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### observation\n", "- student have performed well in reading.\n", "- students have performed worst in maths.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }