import streamlit as st from PIL import Image import os st.title('Machine Learning Operations Pipeline') st.markdown(""" # Machine Learning Operations (MLOps) Pipeline Documentation This is the documentation covering each of the steps included in Bioma AI's time-series-forecasting MLOps Pipeline. ## Sequential MLOps Steps The information flow of the pipeline will closely resemble that of a regression machine learning task. The model development will consist of sequential steps: 1. Ingestion, 2. Transformation, 3. Training, 4. Evaluation, and 5. Registration. """) img = Image.open(os.path.join('experimentation_mlops', 'mlops', 'pics', 'pipeline.png')) st.image(img, caption="MLOps Pipeline for Bioma AI") st.markdown(""" ## 1. Ingestion Our pipeline involves extracting raw datasets from the internet (S3 Buckets and other cloud services), the assumed dataset is of one of the following file types: csv, json, parquet or xlsx. The extracted data is saved as an artifact which can help in documentation purposes. In the case of time series forecasting, the data ingestion process is tasked on receiving data from a specific format and converting it to a Pandas Dataframe for further processing. The data will be downloaded from the web by issuing a request, the data will then be converted into parquet before being written as a Pandas dataframe. The parquet file will be saved as an artifact for the purpose of documentation. ## 2. Transformation According to the timeframe of the time-series data, the data will be split into a train-test-validation set. The user will be able to customize each of the set's proportions. Various statistical methods is considered and performed into a selection of columns, the columns and the methods are both customizable. A few methods that are considered are: 1. Logarithmic 2. Natural Logarithmic 3. Standardization 4. Identity 5. Logarithmic Difference ## 3. Training The training process can be broken down into two types according to the amount of variates being predicted: univariate or multivariate. Predictors are either an: 1. Endogenous feature (Changes in the target's value has an effect on the predictor's value or the other way around) or 2. Exogenous feature (changes in the predictor's value has an effect on the target's value, but not the other way around)
  1. Static Exogenous
  2. Static variables such as one-hot encoding for a categorical class identifier.
  3. Historical Exogenous
  4. Exogenous features that their historical data is only known of.
  5. Future Exogenous
  6. Exogenous features that their data is known of when making the prediction on that time in the future.
Endogenous features will be predicted in conjunction with the target's feature. Exogenous features will not be predicted, rather only be used to predict the target variable. In short: multivariate predictions will use predictors as endogenous features, while multivariable predictions use predictors as exogenous features because of its univariate nature. ## 4. Evaluation The evaluation step is constructed for the trained models to perform prediction on out-of-training data. Ideally, this step will produce outputs such as visualizations and error metrics for arbitrary datasets. ## 5. Registration Registration includes saving the model with the highest accuracy, making it easy to retrieve for inference later on. References: - [1] [mlflow/recipes-regression-template](https://github.com/mlflow/recipes-regression-template/tree/main?tab=readme-ov-file#installation) - [2] [MLflow deployment using Docker, EC2, S3, and RDS](https://aws.plainenglish.io/set-up-mlflow-on-aws-ec2-using-docker-s3-and-rds-90d96798e555) """)