{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Incremental/Online Training with Scikit-learn and Production with Skops\nIn this tutorial, we will train online learning models with Scikit-learn, look at differences between warm and cold start and use skops to improve reproducibility!","metadata":{}},{"cell_type":"code","source":"!pip install datasets\n!pip install skops","metadata":{"_kg_hide-input":true,"execution":{"iopub.status.busy":"2022-12-01T10:31:20.498705Z","iopub.execute_input":"2022-12-01T10:31:20.499374Z","iopub.status.idle":"2022-12-01T10:31:48.092799Z","shell.execute_reply.started":"2022-12-01T10:31:20.499234Z","shell.execute_reply":"2022-12-01T10:31:48.091058Z"},"trusted":true},"execution_count":1,"outputs":[{"name":"stdout","text":"Requirement already satisfied: datasets in /opt/conda/lib/python3.7/site-packages (2.1.0)\nRequirement already satisfied: tqdm>=4.62.1 in /opt/conda/lib/python3.7/site-packages (from datasets) (4.64.0)\nRequirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.7/site-packages (from datasets) (2.27.1)\nRequirement already satisfied: xxhash in /opt/conda/lib/python3.7/site-packages (from datasets) (3.0.0)\nRequirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.7/site-packages (from datasets) (1.21.6)\nRequirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from datasets) (21.3)\nRequirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from datasets) (4.11.4)\nRequirement already satisfied: multiprocess in /opt/conda/lib/python3.7/site-packages (from datasets) (0.70.13)\nRequirement already satisfied: aiohttp in /opt/conda/lib/python3.7/site-packages (from datasets) (3.8.1)\nRequirement already satisfied: fsspec[http]>=2021.05.0 in /opt/conda/lib/python3.7/site-packages (from datasets) (2022.5.0)\nRequirement already satisfied: huggingface-hub<1.0.0,>=0.1.0 in /opt/conda/lib/python3.7/site-packages (from datasets) (0.7.0)\nRequirement already satisfied: pyarrow>=5.0.0 in /opt/conda/lib/python3.7/site-packages (from datasets) (8.0.0)\nRequirement already satisfied: responses<0.19 in /opt/conda/lib/python3.7/site-packages (from datasets) (0.18.0)\nRequirement already satisfied: dill in /opt/conda/lib/python3.7/site-packages (from datasets) (0.3.5.1)\nRequirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from datasets) (1.3.5)\nRequirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (6.0)\nRequirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.6.0)\nRequirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (4.1.1)\nRequirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging->datasets) (3.0.9)\nRequirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.7/site-packages (from requests>=2.19.0->datasets) (2.0.12)\nRequirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.19.0->datasets) (1.26.9)\nRequirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.19.0->datasets) (3.3)\nRequirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.19.0->datasets) (2022.6.15)\nRequirement already satisfied: asynctest==0.13.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets) (0.13.0)\nRequirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets) (1.3.0)\nRequirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets) (21.4.0)\nRequirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets) (1.2.0)\nRequirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets) (6.0.2)\nRequirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets) (4.0.2)\nRequirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets) (1.7.2)\nRequirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->datasets) (3.8.0)\nRequirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas->datasets) (2.8.2)\nRequirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->datasets) (2022.1)\nRequirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.16.0)\n\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n\u001b[0mCollecting skops\n Downloading skops-0.3.0-py3-none-any.whl (58 kB)\n\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.8/58.8 kB\u001b[0m \u001b[31m2.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hRequirement already satisfied: tabulate>=0.8.8 in /opt/conda/lib/python3.7/site-packages (from skops) (0.8.9)\nRequirement already satisfied: typing-extensions>=3.7 in /opt/conda/lib/python3.7/site-packages (from skops) (4.1.1)\nCollecting huggingface-hub>=0.10.1\n Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)\n\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m182.4/182.4 kB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hRequirement already satisfied: scikit-learn>=0.24 in /opt/conda/lib/python3.7/site-packages (from skops) (1.0.2)\nRequirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from huggingface-hub>=0.10.1->skops) (2.27.1)\nRequirement already satisfied: packaging>=20.9 in /opt/conda/lib/python3.7/site-packages (from huggingface-hub>=0.10.1->skops) (21.3)\nRequirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from huggingface-hub>=0.10.1->skops) (3.6.0)\nRequirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from huggingface-hub>=0.10.1->skops) (4.11.4)\nRequirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.7/site-packages (from huggingface-hub>=0.10.1->skops) (6.0)\nRequirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from huggingface-hub>=0.10.1->skops) (4.64.0)\nRequirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.24->skops) (1.1.0)\nRequirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.24->skops) (1.7.3)\nRequirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.24->skops) (3.1.0)\nRequirement already satisfied: numpy>=1.14.6 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.24->skops) (1.21.6)\nRequirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=20.9->huggingface-hub>=0.10.1->skops) (3.0.9)\nRequirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->huggingface-hub>=0.10.1->skops) (3.8.0)\nRequirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->huggingface-hub>=0.10.1->skops) (1.26.9)\nRequirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->huggingface-hub>=0.10.1->skops) (2022.6.15)\nRequirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->huggingface-hub>=0.10.1->skops) (3.3)\nRequirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.7/site-packages (from requests->huggingface-hub>=0.10.1->skops) (2.0.12)\nInstalling collected packages: huggingface-hub, skops\n Attempting uninstall: huggingface-hub\n Found existing installation: huggingface-hub 0.7.0\n Uninstalling huggingface-hub-0.7.0:\n Successfully uninstalled huggingface-hub-0.7.0\n\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\ncached-path 1.1.3 requires huggingface-hub<0.8.0,>=0.0.12, but you have huggingface-hub 0.11.1 which is incompatible.\u001b[0m\u001b[31m\n\u001b[0mSuccessfully installed huggingface-hub-0.11.1 skops-0.3.0\n\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n\u001b[0m","output_type":"stream"}]},{"cell_type":"code","source":"import pandas as pd\nimport datasets\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import HistGradientBoostingRegressor\nimport sklearn.neural_network\nimport numpy as np","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:48.096112Z","iopub.execute_input":"2022-12-01T10:31:48.096713Z","iopub.status.idle":"2022-12-01T10:31:49.955513Z","shell.execute_reply.started":"2022-12-01T10:31:48.096642Z","shell.execute_reply":"2022-12-01T10:31:49.954231Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load\n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:49.957048Z","iopub.execute_input":"2022-12-01T10:31:49.957951Z","iopub.status.idle":"2022-12-01T10:31:49.965242Z","shell.execute_reply.started":"2022-12-01T10:31:49.957910Z","shell.execute_reply":"2022-12-01T10:31:49.964020Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"We will use a dataset from datasets library.","metadata":{}},{"cell_type":"code","source":"dataset = datasets.load_dataset(\"scikit-learn/auto-mpg\")\ndf = pd.DataFrame(dataset[\"train\"])","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:49.968676Z","iopub.execute_input":"2022-12-01T10:31:49.969228Z","iopub.status.idle":"2022-12-01T10:31:51.003435Z","shell.execute_reply.started":"2022-12-01T10:31:49.969175Z","shell.execute_reply":"2022-12-01T10:31:51.002169Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"Downloading and preparing dataset csv/scikit-learn--auto-mpg to /root/.cache/huggingface/datasets/csv/scikit-learn--auto-mpg-a20aa45e3b31b7e3/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...\n","output_type":"stream"},{"output_type":"display_data","data":{"text/plain":"Downloading data files: 0%| | 0/1 [00:00\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
mpgcylindersdisplacementhorsepowerweightaccelerationmodel yearorigincar name
018.08307.0130350412.0701chevrolet chevelle malibu
115.08350.0165369311.5701buick skylark 320
218.08318.0150343611.0701plymouth satellite
316.08304.0150343312.0701amc rebel sst
417.08302.0140344910.5701ford torino
\n"},"metadata":{}}]},{"cell_type":"code","source":"df.drop(\"car name\", inplace = True, axis = 1)\ndf.drop(df[df.horsepower == \"?\"].index, inplace=True)","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:51.034855Z","iopub.execute_input":"2022-12-01T10:31:51.035356Z","iopub.status.idle":"2022-12-01T10:31:51.055931Z","shell.execute_reply.started":"2022-12-01T10:31:51.035307Z","shell.execute_reply":"2022-12-01T10:31:51.053551Z"},"trusted":true},"execution_count":6,"outputs":[]},{"cell_type":"code","source":"y = df[\"mpg\"]\nX = df.loc[:, df.columns != \"mpg\"]\nX_train, X_test, y_train, y_test = train_test_split(X, y)","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:51.057934Z","iopub.execute_input":"2022-12-01T10:31:51.058420Z","iopub.status.idle":"2022-12-01T10:31:51.069612Z","shell.execute_reply.started":"2022-12-01T10:31:51.058371Z","shell.execute_reply":"2022-12-01T10:31:51.067852Z"},"trusted":true},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":"## Model Training\n\nScikit-learn provides couple of model training procedures. In this notebook, we'll go through partial_fit and neural networks with warm and cold starts. What partial_fit does is that it enables you to keep fitting the model on new data, lets you come up with a better fit that is adapted to recent data, technically better for data drifts. However, to my opinion, given it's super easy and fast to fit models in scikit-learn, it's always better to fit the model instead of partial_fit. You can also set warm_start to True if you want to not reset the trainable parameters and make use of them, however, it doesn't really work much for fit, it adjusts lightly in partial_fit. \n\nSee more in references section of this notebook.","metadata":{}},{"cell_type":"markdown","source":"## Warm Start vs Cold Start\nDifference between warm and cold starts (namely, warm_start parameter being True or False) is that one resets the trainable parameters for next run.\n\nWe'll fit two models with cold starts, note below how the parameters change after fitting (simply because everything's reset before fit)","metadata":{}},{"cell_type":"code","source":"np.random.seed(0)\n\ncold_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=False, max_iter=10000)\ncold_model.fit(X_train,y_train)\nprint(cold_model.coefs_, cold_model.intercepts_)\ncold_model.fit(X_train,y_train)\nprint(cold_model.coefs_, cold_model.intercepts_)","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:51.071816Z","iopub.execute_input":"2022-12-01T10:31:51.072372Z","iopub.status.idle":"2022-12-01T10:31:54.027268Z","shell.execute_reply.started":"2022-12-01T10:31:51.072328Z","shell.execute_reply":"2022-12-01T10:31:54.025723Z"},"trusted":true},"execution_count":8,"outputs":[{"name":"stdout","text":"[array([[-0.04595779],\n [ 0.00919602],\n [ 0.00347475],\n [-0.00763128],\n [ 0.23989295],\n [ 0.5130012 ],\n [ 0.59278955]])] [array([0.89936677])]\n[array([[-0.75949183],\n [ 0.04861275],\n [-0.03188835],\n [-0.00909946],\n [ 0.39644745],\n [ 0.53561677],\n [ 1.26656764]])] [array([0.05829677])]\n","output_type":"stream"}]},{"cell_type":"markdown","source":"We do partial_fit with cold start and observe no change in the parameters as nothing is reset.","metadata":{}},{"cell_type":"code","source":"cold_model.partial_fit(X_train,y_train)\nprint(cold_model.coefs_, cold_model.intercepts_)","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:54.029037Z","iopub.execute_input":"2022-12-01T10:31:54.029440Z","iopub.status.idle":"2022-12-01T10:31:54.040863Z","shell.execute_reply.started":"2022-12-01T10:31:54.029404Z","shell.execute_reply":"2022-12-01T10:31:54.039130Z"},"trusted":true},"execution_count":9,"outputs":[{"name":"stdout","text":"[array([[-0.7596341 ],\n [ 0.04846873],\n [-0.03190048],\n [-0.00918513],\n [ 0.3961202 ],\n [ 0.53545913],\n [ 1.26641392]])] [array([0.05808994])]\n","output_type":"stream"}]},{"cell_type":"markdown","source":"We can observe the slight change in parameters with warm start.","metadata":{}},{"cell_type":"code","source":"warm_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=True, max_iter=10000)\nwarm_model.fit(X_train,y_train)\nprint(warm_model.coefs_, warm_model.intercepts_)\nwarm_model.fit(X_train,y_train)\nprint(warm_model.coefs_, warm_model.intercepts_)","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:54.044591Z","iopub.execute_input":"2022-12-01T10:31:54.045077Z","iopub.status.idle":"2022-12-01T10:31:57.283027Z","shell.execute_reply.started":"2022-12-01T10:31:54.045040Z","shell.execute_reply":"2022-12-01T10:31:57.281906Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"[array([[ 0.23472605],\n [ 0.02534595],\n [ 0.04413377],\n [-0.01048141],\n [ 0.99828938],\n [ 0.32689782],\n [ 1.40256776]])] [array([1.27939597])]\n[array([[ 0.23500473],\n [ 0.02563303],\n [ 0.04441742],\n [-0.01019369],\n [ 0.9965225 ],\n [ 0.32718517],\n [ 1.40078723]])] [array([1.27966795])]\n","output_type":"stream"}]},{"cell_type":"code","source":"warm_model.partial_fit(X_train,y_train)\nprint(warm_model.coefs_, warm_model.intercepts_)","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:57.284349Z","iopub.execute_input":"2022-12-01T10:31:57.285228Z","iopub.status.idle":"2022-12-01T10:31:57.294349Z","shell.execute_reply.started":"2022-12-01T10:31:57.285193Z","shell.execute_reply":"2022-12-01T10:31:57.293179Z"},"trusted":true},"execution_count":11,"outputs":[{"name":"stdout","text":"[array([[ 0.23398246],\n [ 0.02463401],\n [ 0.04342073],\n [-0.01119203],\n [ 0.9953371 ],\n [ 0.32620675],\n [ 1.39966974]])] [array([1.27862898])]\n","output_type":"stream"}]},{"cell_type":"markdown","source":"## partial_fit with Passive Aggressive Regressor","metadata":{}},{"cell_type":"markdown","source":"As said in sklearn docs:\n> The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization parameter C.\n\nThe passive side of the algorithm keeps the model if the prediction is correct, and aggressive one changes the model if the prediction is incorrect.","metadata":{}},{"cell_type":"markdown","source":"# Split the data again","metadata":{}},{"cell_type":"code","source":"y = df[\"mpg\"]\nX = df.loc[:, df.columns != \"mpg\"]\nX_train, X_test, y_train, y_test = train_test_split(X, y)\n","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:57.296057Z","iopub.execute_input":"2022-12-01T10:31:57.296380Z","iopub.status.idle":"2022-12-01T10:31:57.308137Z","shell.execute_reply.started":"2022-12-01T10:31:57.296351Z","shell.execute_reply":"2022-12-01T10:31:57.306985Z"},"trusted":true},"execution_count":12,"outputs":[]},{"cell_type":"markdown","source":"Fit PassiveAggressiveRegressor","metadata":{}},{"cell_type":"code","source":"from sklearn.linear_model import PassiveAggressiveRegressor\nmse_list = []\nreg = PassiveAggressiveRegressor()\nreg.partial_fit(X_train, y_train)\nfirst_pred = reg.predict(X_test)\nprint(f\"MSE is: {sklearn.metrics.mean_squared_error(first_pred, y_test)}\")","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:57.310230Z","iopub.execute_input":"2022-12-01T10:31:57.311133Z","iopub.status.idle":"2022-12-01T10:31:57.328333Z","shell.execute_reply.started":"2022-12-01T10:31:57.311094Z","shell.execute_reply":"2022-12-01T10:31:57.326994Z"},"trusted":true},"execution_count":13,"outputs":[{"name":"stdout","text":"MSE is: 242.58051951172263\n","output_type":"stream"}]},{"cell_type":"code","source":"reg = PassiveAggressiveRegressor()\ndata = []\nc_cold, c_warm = 0.1, 0.01\nfor i, x in X_train.iterrows():\n reg.partial_fit([x.values], [y_train[i]])\n data.append({\n 'c0': reg.intercept_[0],\n 'c1': reg.coef_.flatten()[0],\n 'c2': reg.coef_.flatten()[1],\n 'mse_test': np.mean((reg.predict(X_test.values) - y_test)**2),\n 'i': i\n })\n if i == 20:\n reg.C = c_warm\ndf_stats = pd.DataFrame(data)","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:57.330269Z","iopub.execute_input":"2022-12-01T10:31:57.330752Z","iopub.status.idle":"2022-12-01T10:31:57.687552Z","shell.execute_reply.started":"2022-12-01T10:31:57.330711Z","shell.execute_reply":"2022-12-01T10:31:57.686475Z"},"trusted":true},"execution_count":14,"outputs":[]},{"cell_type":"code","source":"df_stats","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:31:57.689024Z","iopub.execute_input":"2022-12-01T10:31:57.690031Z","iopub.status.idle":"2022-12-01T10:31:57.708597Z","shell.execute_reply.started":"2022-12-01T10:31:57.689978Z","shell.execute_reply":"2022-12-01T10:31:57.707049Z"},"trusted":true},"execution_count":15,"outputs":[{"execution_count":15,"output_type":"execute_result","data":{"text/plain":" c0 c1 c2 mse_test i\n0 0.000001 0.000012 0.000443 224.463539 0\n1 0.000003 0.000018 0.000598 240.792368 108\n2 0.000005 0.000024 0.000753 531.940399 352\n3 0.000005 0.000026 0.000788 627.109381 167\n4 0.000003 0.000017 0.000501 226.493117 60\n.. ... ... ... ... ...\n289 0.000132 -0.000018 -0.020651 261.402379 214\n290 0.000135 -0.000005 -0.020280 333.857844 183\n291 0.000135 -0.000007 -0.020347 255.845520 340\n292 0.000134 -0.000009 -0.020401 209.395166 234\n293 0.000138 0.000007 -0.020044 1102.661045 349\n\n[294 rows x 5 columns]","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
c0c1c2mse_testi
00.0000010.0000120.000443224.4635390
10.0000030.0000180.000598240.792368108
20.0000050.0000240.000753531.940399352
30.0000050.0000260.000788627.109381167
40.0000030.0000170.000501226.49311760
..................
2890.000132-0.000018-0.020651261.402379214
2900.000135-0.000005-0.020280333.857844183
2910.000135-0.000007-0.020347255.845520340
2920.000134-0.000009-0.020401209.395166234
2930.0001380.000007-0.0200441102.661045349
\n

294 rows × 5 columns

\n
"},"metadata":{}}]},{"cell_type":"markdown","source":"## Hosting our model\nWe will use skops library that lets us push our models to Hugging Face Hub! It saves the model, generates a config file including the configs to reproduce the experiment and more! (Model cards including feature importances, model hyperparameters, metrics and plots coming very soon 🤩)\n\n**I ran this notebook on my local because skops follows PyData stack (scikit-learn & friends) which supports python3.8, so I uploaded the model [here](https://huggingface.co/merve/passive-agressive-regressor) through my local 🙂**\n\nHugging Face Hub needs a token for authentication, I will pass the token through getpass.","metadata":{}},{"cell_type":"code","source":"from getpass import getpass\ntoken = getpass(f\"Pass the Hub token:\")","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:33:45.467673Z","iopub.execute_input":"2022-12-01T10:33:45.468255Z","iopub.status.idle":"2022-12-01T10:33:48.052455Z","shell.execute_reply.started":"2022-12-01T10:33:45.468211Z","shell.execute_reply":"2022-12-01T10:33:48.051063Z"},"trusted":true},"execution_count":16,"outputs":[{"output_type":"stream","name":"stdin","text":"Pass the Hub token: ·····································\n"}]},{"cell_type":"markdown","source":"We will now save this model, use skops to initialize a Hugging Face Hub repository, let it create a configuration file to reproduce our experiment, serialize the model to repository and push it to Hub!","metadata":{}},{"cell_type":"code","source":"from skops import hub_utils\nfrom tempfile import mkstemp, mkdtemp\nimport os\nimport pickle\n\n# save the model to a temporary directory\n_, pkl_name = mkstemp(prefix=\"skops\")\nwith open(pkl_name, mode=\"bw\") as f:\n pickle.dump(reg, file=f)\n\n# initialize the repository \nlocal_repo = mkdtemp(prefix=\"skops\")\nhub_utils.init(model=pkl_name, \n task=\"tabular-regression\",\n requirements=[\"scikit-learn\"], \n dst=local_repo,\n data=X_train)\n\n# see what's inside the repository\nprint(os.listdir(local_repo))","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:34:57.198457Z","iopub.execute_input":"2022-12-01T10:34:57.199065Z","iopub.status.idle":"2022-12-01T10:34:57.211532Z","shell.execute_reply.started":"2022-12-01T10:34:57.199021Z","shell.execute_reply":"2022-12-01T10:34:57.210095Z"},"trusted":true},"execution_count":19,"outputs":[{"name":"stdout","text":"['config.json', 'skops47mqlzp0']\n","output_type":"stream"}]},{"cell_type":"markdown","source":"Currently in the repository there's model file and `config.json` that contains task type, example data, requirements and more. These are useful to enable inference API on 🤗Hub. \nWe will now add a model card.","metadata":{}},{"cell_type":"code","source":"from skops import card\nfrom pathlib import Path\n# I will pass the model and metadata from config file\nmodel_card = card.Card(reg, metadata=card.metadata_from_config(Path(local_repo)))","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:41:23.624801Z","iopub.execute_input":"2022-12-01T10:41:23.625317Z","iopub.status.idle":"2022-12-01T10:41:23.631649Z","shell.execute_reply.started":"2022-12-01T10:41:23.625272Z","shell.execute_reply":"2022-12-01T10:41:23.630653Z"},"trusted":true},"execution_count":24,"outputs":[]},{"cell_type":"markdown","source":"Let's some information to our card.","metadata":{}},{"cell_type":"code","source":"link = \"https://www.kaggle.com/code/unofficialmerve/incremental-online-training-with-scikit-learn/\"\ndescription = f\"This is a passive-agressive regression model used for continuous training. Find the notebook [here]({link})\"\nlimitations = \"This model is not ready to be used in production. It's trained to predict MPG a car spends based on it's attributes.\"\nmodel_card.add(model_description=description,\n limitations=limitations,\n )","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:48:00.772527Z","iopub.execute_input":"2022-12-01T10:48:00.773093Z","iopub.status.idle":"2022-12-01T10:48:00.785659Z","shell.execute_reply.started":"2022-12-01T10:48:00.773054Z","shell.execute_reply":"2022-12-01T10:48:00.783967Z"},"trusted":true},"execution_count":25,"outputs":[{"execution_count":25,"output_type":"execute_result","data":{"text/plain":"Card(\n model=PassiveAggressiveRegressor(C=0.01),\n metadata.library_name=sklearn,\n metadata.tags=['sklearn', 'skops', 'tabular-regression'],\n metadata.model_file=skops47mqlzp0,\n metadata.widget={...},\n model_description='This is a passiv...online-training-with-scikit-learn/)',\n limitations=\"This model is not read...r spends based on it\\'s attributes.\",\n)"},"metadata":{}}]},{"cell_type":"markdown","source":"We can now save the model card to the local repository.","metadata":{}},{"cell_type":"code","source":"model_card.save(Path(local_repo) / \"README.md\")","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:52:18.076151Z","iopub.execute_input":"2022-12-01T10:52:18.076576Z","iopub.status.idle":"2022-12-01T10:52:18.102635Z","shell.execute_reply.started":"2022-12-01T10:52:18.076544Z","shell.execute_reply":"2022-12-01T10:52:18.101491Z"},"trusted":true},"execution_count":29,"outputs":[]},{"cell_type":"markdown","source":"*Let's push the model to 🤗Hub!*","metadata":{}},{"cell_type":"code","source":"# Push to Hub!\nfrom skops import hub_utils\nhub_utils.push(\n repo_id=\"scikit-learn/passive-agressive-regressor\",\n source=local_repo,\n token=token,\n create_remote=True\n)\n# create_remote creates a remote repository if it doesn't exist","metadata":{"execution":{"iopub.status.busy":"2022-12-01T10:52:54.235466Z","iopub.execute_input":"2022-12-01T10:52:54.235916Z","iopub.status.idle":"2022-12-01T10:52:55.748498Z","shell.execute_reply.started":"2022-12-01T10:52:54.235882Z","shell.execute_reply":"2022-12-01T10:52:55.747009Z"},"trusted":true},"execution_count":30,"outputs":[]},{"cell_type":"markdown","source":"You can find the model [here](https://huggingface.co/scikit-learn/passive-agressive-regressor). We will soon introduce automatic creation of model cards. 🙂","metadata":{}},{"cell_type":"markdown","source":"# References\n- [This stackoverflow discussion](https://stackoverflow.com/questions/38052342/what-is-the-difference-between-partial-fit-and-warm-start) is full of helpful answers.\n- Sklearn docs on [PassiveAggressiveRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveRegressor.html)\n- I got a lot of help from [calmcode's chapter on partial_fit](https://calmcode.io/partial_fit/introduction.html). 🙂\n- See [skops](https://github.com/skops-dev/skops/) here and watch for new features we're adding!","metadata":{}}]}