Spaces:
Sleeping
Sleeping
import streamlit as st | |
from PIL import Image | |
import os | |
st.title('Machine Learning Operations Pipeline') | |
st.markdown(""" | |
# Machine Learning Operations (MLOps) Pipeline Documentation | |
This is the documentation covering each of the steps included in Bioma AI's time-series-forecasting MLOps Pipeline. | |
## Sequential MLOps Steps | |
The information flow of the pipeline will closely resemble that of a regression machine learning task. The model development will consist of sequential steps: | |
1. Ingestion, | |
2. Transformation, | |
3. Training, | |
4. Evaluation, and | |
5. Registration. | |
""") | |
img = Image.open(os.path.join('experimentation_mlops', | |
'mlops', | |
'pics', | |
'pipeline.png')) | |
st.image(img, caption="MLOps Pipeline for Bioma AI") | |
st.markdown(""" | |
## 1. Ingestion | |
Our pipeline involves extracting raw datasets from the internet (S3 Buckets and other cloud services), the assumed dataset is of one of the following file types: csv, json, parquet or xlsx. The extracted data is saved as an artifact which can help in documentation purposes. | |
In the case of time series forecasting, the data ingestion process is tasked on receiving data from a specific format and converting it to a Pandas Dataframe for further processing. The data will be downloaded from the web by issuing a request, the data will then be converted into parquet before being written as a Pandas dataframe. The parquet file will be saved as an artifact for the purpose of documentation. | |
## 2. Transformation | |
According to the timeframe of the time-series data, the data will be split into a train-test-validation set. The user will be able to customize each of the set's proportions. | |
Various statistical methods is considered and performed into a selection of columns, the columns and the methods are both customizable. A few methods that are considered are: | |
1. Logarithmic | |
2. Natural Logarithmic | |
3. Standardization | |
4. Identity | |
5. Logarithmic Difference | |
## 3. Training | |
The training process can be broken down into two types according to the amount of variates being predicted: univariate or multivariate. | |
Predictors are either an: | |
1. Endogenous feature (Changes in the target's value has an effect on the predictor's value or the other way around) or | |
2. Exogenous feature (changes in the predictor's value has an effect on the target's value, but not the other way around) | |
<ol type="a"> | |
<li>Static Exogenous</li> | |
Static variables such as one-hot encoding for a categorical class identifier. | |
<li>Historical Exogenous</li> | |
Exogenous features that their historical data is only known of. | |
<li>Future Exogenous</li> | |
Exogenous features that their data is known of when making the prediction on that time in the future. | |
</ol> | |
Endogenous features will be predicted in conjunction with the target's feature. Exogenous features will not be predicted, rather only be used to predict the target variable. | |
In short: multivariate predictions will use predictors as endogenous features, while multivariable predictions use predictors as exogenous features because of its univariate nature. | |
## 4. Evaluation | |
The evaluation step is constructed for the trained models to perform prediction on out-of-training data. Ideally, this step will produce outputs such as visualizations and error metrics for arbitrary datasets. | |
## 5. Registration | |
Registration includes saving the model with the highest accuracy, making it easy to retrieve for inference later on. | |
References: | |
- [1] [mlflow/recipes-regression-template](https://github.com/mlflow/recipes-regression-template/tree/main?tab=readme-ov-file#installation) | |
- [2] [MLflow deployment using Docker, EC2, S3, and RDS](https://aws.plainenglish.io/set-up-mlflow-on-aws-ec2-using-docker-s3-and-rds-90d96798e555) | |
""") |