molinari135 commited on
Commit
a1a7d89
1 Parent(s): 91a6159

Initial commit

Browse files
Dockerfile ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim
2
+
3
+ ARG WORKDIR=/app
4
+ WORKDIR $WORKDIR
5
+
6
+ RUN python -m pip install --upgrade pip==23.3.1
7
+ RUN apt-get update && apt-get install -y --no-install-recommends \
8
+ build-essential && \
9
+ rm -rf /var/lib/apt/lists/*
10
+
11
+ COPY product_return_prediction $WORKDIR/product_return_prediction
12
+
13
+ # COPY product_return_prediction/api.py $WORKDIR/product_return_prediction
14
+ # COPY product_return_prediction/config.py $WORKDIR/product_return_prediction
15
+ # COPY product_return_prediction/dataset.py $WORKDIR/product_return_prediction
16
+ # COPY product_return_prediction/features.py $WORKDIR/product_return_prediction
17
+
18
+ COPY README.md $WORKDIR/
19
+ COPY requirements.txt $WORKDIR/
20
+ COPY pyproject.toml $WORKDIR/
21
+
22
+ COPY data/external/inventory.tsv $WORKDIR/data/external/
23
+ COPY models/scaler.pkl $WORKDIR/models/
24
+ COPY models/svm.pkl $WORKDIR/models/
25
+
26
+ RUN pip install --no-cache-dir -r requirements.txt
27
+ # RUN pip install --no-cache-dir .
28
+
29
+ EXPOSE 7860
30
+
31
+ CMD ["uvicorn", "product_return_prediction.api:app", "--host", "0.0.0.0", "--port", "7860", "--reload"]
LICENSE ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ The MIT License (MIT)
3
+ Copyright (c) 2024, Molinari-Pinto-Tanzi
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
6
+
7
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
8
+
9
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
10
+
README.md DELETED
@@ -1,12 +0,0 @@
1
- ---
2
- title: Product Return Prediction
3
- emoji: 🏆
4
- colorFrom: green
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- license: mit
9
- short_description: 'No'
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
data/README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- ---
2
+ # For reference on dataset card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/datasets-cards
4
+ {{ card_data }}
5
+ --- -->
6
+
7
+ # Sales Dataset Card
8
+
9
+ <!-- Provide a quick summary of the dataset. -->
10
+ This dataset contains 28,692 rows of data related to Emporio Armani's e-commerce order lines. Each row represents an order line, detailing whether the product was returned or not. The dataset includes features such as:
11
+
12
+ - **Date and Time Information**: Year (Gregorian), month (Gregorian), month name, and the exact purchase date (in date format).
13
+ - **Customer Information**: Store ID of the customer associated with the transaction.
14
+ - **Order Line Details**: Order number and order line number to uniquely identify each purchase.
15
+ - **Geographical Information**: Country where the purchase was made.
16
+ - **Product Details**: Variant code (WCS), Item Brand model, Fabric, Color, a combination of model, fabric, and color, and the product composition. Additionally, the dataset specifies the product's top category, type, and subtype, along with its gender target, age range, and a link to the product image.
17
+ - **Return Information**: Return reason group and detailed reason for the return (if applicable).
18
+ - **Financial and Quantitative Data**: Net sales (value and units), return value, and return units for each transaction.
19
+
20
+ The dataset is slightly unbalanced, with only 23% of transactions involving returned products.
21
+
22
+
23
+ <!-- ## Dataset Details -->
24
+
25
+ <!-- ### Dataset Description -->
26
+
27
+ <!-- Provide a longer summary of what this dataset is. -->
28
+
29
+ <!-- - **Curated by**: Molinari-Pinto-Tanzi -->
30
+ <!-- - **Funded by**: Armani -->
31
+ <!-- - **Shared by [optional]:** {{ shared_by | default("[More Information Needed]", true)}}
32
+ - **Language(s) (NLP):** {{ language | default("[More Information Needed]", true)}} -->
33
+ <!-- - **License:** {{ license | default("[More Information Needed]", true)}} -->
34
+
35
+ <!-- ## Dataset Sources -->
36
+
37
+ <!-- Provide the basic links for the dataset. -->
38
+
39
+ <!-- - **GitHub Repository**: [Product Return Prediction on GitHub](https://github.com/se4ai2425-uniba/product-return-prediction) -->
40
+ <!-- - **DagsHub Repository**: [Product Return Prediction on DagsHub](https://dagshub.com/se4ai2425-uniba/product-return-prediction) -->
41
+ <!-- - **Demo [optional]:** {{ demo | default("[More Information Needed]", true)}} -->
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the dataset is intended to be used. -->
46
+ The dataset can be used to support a variety of use cases in the field of e-commerce analytics and machine learning. Some of the main intended uses are:
47
+
48
+ - **Predictive Modeling**: Training and evaluating machine learning models for binary classification tasks, such as predicting the likelihood of a product being returned based on its characteristics and transaction details.
49
+ - **Exploratory Data Analysis (EDA)**: Analyzing patterns and trends in product returns to identify factors that influence customer behavior, such as product type, composition, or age range.
50
+ - **Feature Engineering**: Using the dataset to develop and test new features for predictive models, such as aggregating return reasons or combining product composition data.
51
+ - **Unbalanced Data Research**: Studying machine learning techniques and strategies to handle imbalanced datasets, as the target variable is not evenly distributed.
52
+
53
+ ### Direct Use
54
+
55
+ <!-- This section describes suitable use cases for the dataset. -->
56
+ The dataset can be used in the following cases:
57
+ - Train a **binary classification** model to predict if a product will be returned
58
+ - Train a **regression** model to predict the probability of restitution of a model
59
+ - Train a **multi-class classification** model to predict the motivation of a return
60
+
61
+
62
+ ## Dataset Structure
63
+
64
+ <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
65
+
66
+ The dataset presents the following features:
67
+
68
+ | Feature name | Description |
69
+ | --- | --- |
70
+ | Year Gregorian | Gregorian year of the purchase (e.g.: `2023`) |
71
+ | Month Gregorian | Gregorian month of the purchase (e.g.: `01/2023`, indicating january of the year 2023) |
72
+ | Month Gregorian Name | Gregorian name abbreviation of the month of the purchase (e.g.: `Jan`) |
73
+ | Date (Date Format) | Date of the purchase (e.g.: `2023-01-02`, indicating the first of january 2023) |
74
+ | Customer Store ID | Numerical code that identifies the user |
75
+ | Order Number | Alphanumerical code that identifies the receipt to which the purchase belongs |
76
+ | Order Line Number | Integer which corresponds to the purchase order of the product in question within the receipt (if the product is part of a recept in which there are 5 products, the corresponding order line number could be a value in between 1 and 5) |
77
+ | Country | Country in which the product was purchased |
78
+ | Variant WCS | Alternative identifier of the receipt, with 1:1 correspondence with Order Number |
79
+ | Item Brand Model | Alphanumerical code indicating the model of the purchased product |
80
+ | Item Brand Fabric | Alphanumerical code indicating the fabric of the purchased product |
81
+ | Item Brand Colour | Alphanumerical code indicating the colour of the purchased product |
82
+ | Item Brand Model Fabric Colour | Alphanumerical code, the combination of the codes of Model, Fabric, and Colour |
83
+ | Product Composition | Information on the percentage of materials that make up the purchased product (e.g.: `43% COTTON 29% WOOL 28% ACRYLIC`) |
84
+ | Product Top Category | Macrocategory to which the purchased product belongs (e.g.: `READY TO WEAR`) |
85
+ | Product Type | Type of garment corresponding to the product purchased (e.g.: `JACKETS`) |
86
+ | Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`) |
87
+ | Age Range | Value that could be `ADULT`, `JUNIOR` or `BABY` |
88
+ | Product Gender | Value that could be `MALE` or `FEMALE` |
89
+ | Product Image Link | URL address to images of the purchased product (the 20% of the product do not have the corresponding link) |
90
+ | Return Reason Group | Set of 9 values, identified by a string, corresponding to the macrocategory of the product return reason. A `#N/A#` value corresponds to an unreturned product |
91
+ | Return Reason | Set of 26 values, identified by a string, corresponding to the specific category of the product return reason. A `#N/A#` value corresponds to an unreturned product |
92
+ | Net Sales (FA) | Value, in Euros, of the product purchased |
93
+ | Net Sales Units (FA) | Value describing whether the product was returned or not (`-1` means the product is returned, otherwise the value corresponds to 1) |
94
+ | Returns Value (FA) | Corresponds to the same value of the net sales, it is corroborated only if the product is returned |
95
+ | Return Units (FA) | Value is `1.0` only if the product is returned, otherwise it is null |
96
+
97
+ ## Dataset Creation
98
+
99
+ ### Data Collection and Processing
100
+
101
+ <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
102
+
103
+ A feature engineering pipeline has been performed on the dataset as it follows:
104
+
105
+ 1. Added a new column named `Returned` that contains a flag to identify if a product has been returned based on `Return Units (FA)` column
106
+ 2. Removed `Year Gregorian`, `Month Gregorian`, `Month Gregorian Name`, `Country`, `Age Range`, `Product Image Link`, `Returns Value (FA)`, `Returns Units (FA)`, `Return Reason Group` and `Return Reason` because they were not useful for training
107
+ 3. Removed `Variant WCS` to remove additional IDs
108
+ 4. Added a new column named `Product Order Count` that tells the number of products belonging to the same order as the selected product based on `Order Number` and `Order Line Number`
109
+ 5. Added a new column named `Total Order Value` performing the sum of every product belonging to the same order based on `Order Number` and `Net Sales (FA)` columns
110
+ 6. Added a new column named `Main Material` which contains the first material that can be found in the `Product Composition` column
111
+ 7. Added a new column named `Colour Return Percentage` that estimates the return likelihood of a product based on its `Item Brand Model` and `Item Brand Colour`
112
+ - This operation also produced a JSON file that helps obtaining known values starting from `Item Brand Model` and `Item Brand Colour`, otherwise a median value will be found using `Product Top Category` of the product
113
+ 8. Added a new column named `Total Customer Purchases` that tells the number of purchases, within the year, of a customer that has purchased that product
114
+ 9. Added a new column named `Total Customer` Returns that tells the number of returns, within the year, of a customer that has purchased that product
115
+ 10. Added a new column named `Customer Return Percentage` that shows the likelihood of the returns made by customer that has bought that product
116
+ 11. Selected only those rows belonging to `READY TO WEAR` as `Product Top Category`
117
+ 12. Removed `Date (Date format)`, `Customer Store ID`, `Order Number`, `Order Line Number`, `Item Brand Model`, `Item Brand Fabric`, `Item Brand Colour`, `Item Brand Model Fabric Colour`, `Product Composition`, `Product Top Category`
118
+
119
+ After performing all these operations, all the categorical features have been converted into numerical ones performing a **Target Encoding** technique with smoothing to avoid partial ordering issues during training. A `StandardScaler` trained only on the train data split has been applied at the end of the whole process to prepare data for training, evaluation and inference.
120
+
121
+ The new dataset will contain the following features
122
+
123
+ | Feature | Description |
124
+ |---|---|
125
+ | Product Type | Type of garment corresponding to the product purchased (e.g.: `JACKETS`) |
126
+ | Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`) |
127
+ | Product Gender | Value that could be `MALE` or `FEMALE` |
128
+ | Net Sales (FA) | Value, in Euros, of the product purchased |
129
+ | Net Sales Units (FA) | Value describing the number of products purchased or returned (always `1`) |
130
+ | Returned | `1` if the product has been returned, `0` otherwise |
131
+ | Product Order Count | Number of products belonging to the same order |
132
+ | Total Order Value | Sum of every product belonging to the same order, in Euros |
133
+ | Main Material | Material of which the product is mainly made of |
134
+ | Colour Return Percentage | Likelihood of the product return based on the colour of the product |
135
+ | Total Customer Purchases | Number of purchases made by the user that bought or returned that product |
136
+ | Total Customer Returns | Number of returns made by the user that bought or returned that product |
137
+ | Customer Return Percentage | Likelihood of the product return based on the customer behavior |
138
+
139
+ This new dataset has been splitted into two files, `train.tsv` and `test.tsv`, performing a 80-20 split.
140
+
141
+
142
+ ### Personal and Sensitive Information
143
+
144
+ <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
145
+ The dataset does not contain personal or sensitive information. The only reference to customers is their customer ID related to orders but no sensitive informations about coustomers are involved.
146
+
147
+ ## Bias, Risks, and Limitations
148
+
149
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
150
+
151
+ The dataset presents an unbalanced distribution of returns and purchases, showing more purchases than returns, which are only 23% of the entire records. Training models using this dataset could result in slightly biased results, if not taken into account. Additionally, data exploration shows that there is no correlation between features before applying feature engineering.
152
+
153
+
154
+ ### Recommendations
155
+
156
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
157
+
158
+ Deeper data exploration and feature engineering is suggested to achieve better training results using this dataset.
data/external/.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ /inventory.csv
2
+ /inventory.tsv
data/external/.gitkeep ADDED
File without changes
data/external/inventory.csv.dvc ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ outs:
2
+ - md5: b2d0ccf46d96499bcaa47052bb57bbba
3
+ size: 4413122
4
+ hash: md5
5
+ path: inventory.csv
models/.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ /log_reg.pkl
2
+ /svm.pkl
3
+ /scaler.pkl
models/.gitkeep ADDED
File without changes
models/README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card for Product Return Prediction
2
+
3
+ ## model details
4
+
5
+ - **person or organization developing model**: team product-return-prediction
6
+ - **model date**: 24/11/2024
7
+ - **model version**: v1.4
8
+ - **model type**: Support Vector Machine
9
+
10
+ <!-- algorithm description -->
11
+ This model is a **Support Vector Machine** classifier designed to predict whether a product will be returned or not, based on various product and transaction features. Hyperparameters (C, kernel type and gamma) are chosen using a grid search, with a 10-fold cross validation.
12
+
13
+ ## intended use
14
+
15
+ ### primary intended uses
16
+
17
+ <!-- description of the model's use -->
18
+ The purpose of the model is to assist e-commerce owners (Armani) in identifying possible returns among their purchases in order to reorganize inventories to optimize product handling and transportation costs
19
+
20
+ ### primary intended users
21
+
22
+ <!-- description of the users -->
23
+ The model was developed for Armani. Specifically, the purpose is to support professional figures involved in logistics, product management, and marketing
24
+
25
+ <!-- ### out-of scope use cases -->
26
+
27
+ ## factors
28
+
29
+ ### relevant factors
30
+
31
+ <!-- factors to consider -->
32
+ Some factors to be considered that involve the model are the following:
33
+
34
+ - **product features**: characteristics like model, fabric, colour, composition, and product category may have a significant impact on the likelihood of a product being returned
35
+ - **imbalanced classes**: the class imbalance is a relevant factor that may affect the model's ability to predict the minority class (returns) accurately
36
+
37
+ ### decision thresholds
38
+
39
+ <!-- description of selected thresholds -->
40
+ The default decision threshold for the SVM model is 0.5, where probabilities greater than or equal to 0.5 indicate a "returned" prediction, and probabilities below 0.5 indicate "not returned."
41
+
42
+ ## Train and Test data
43
+
44
+ ### dataset description
45
+
46
+ - **dataset**: *German Sales 2023 EA*
47
+
48
+ the model was trained and tested on this dataset, following appropriate splitting and pre-processing steps.
49
+
50
+ ### split
51
+
52
+ Dataset splitting is as follows:
53
+ - **training**: 80%
54
+ - **validation and test**: 20%
55
+
56
+ the splitting is performed by using the corresponding sklearn function. The chosen random state is 42.
57
+
58
+ ### pre-processing
59
+
60
+ To be adapted to the binary classification task, and further adapted to a numerical model such as SVM, the model underwent an important pre-processing phase. Pre-processing steps are the following:
61
+
62
+ 1. Dataset conversion from Excel to TSV
63
+ 2. Specific columns removal from dataframe
64
+ 3. Train and test data splitting
65
+ 4. Train and save scaler
66
+ 5. Scaling data with a pre-trained scaler
67
+ 6. Target encoding of categorical columns
68
+ 7. Preparation of inventory with sales data
69
+ 8. Population of missing values
70
+ 9. Calculation and application of return percentages by color
71
+ 10. Final cleaning and processing
72
+
73
+ ## Quantitative analysis
74
+
75
+ | | PRECISION | RECALL | F1-SCORE | Support |
76
+ |-----------|-----------|-----------|-----------|-----------|
77
+ | No return | 0.95 | 0.95 | 0.95 | 2086 |
78
+ | Return | 0.89 | 0.90 | 0.89 | 960 |
79
+ | Accuracy | | | |0.93 |
80
+
81
+
82
+ <!-- ### unitary results -->
83
+
84
+ <!-- ### intersectional results -->
85
+
product_return_prediction/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from product_return_prediction import config # noqa: F401
product_return_prediction/api.py ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, HTTPException
2
+ from pydantic import BaseModel, Field
3
+ import pandas as pd
4
+ import json
5
+ import pickle
6
+ from pathlib import Path
7
+ from product_return_prediction.dataset import prepare_inventory, scale_data_with_trained_scaler
8
+ from product_return_prediction.config import MODELS_DIR, EXTERNAL_DATA_DIR
9
+
10
+ app = FastAPI(
11
+ title="Product Return Prediction API",
12
+ description="This API predicts whether a product will be returned based on products and user behavior.",
13
+ version="0.1.0"
14
+ )
15
+
16
+
17
+ class ProductRequest(BaseModel):
18
+ models: list[str] = Field(
19
+ ...,
20
+ example=["01CA9T", "0NG3DT"]
21
+ )
22
+ fabrics: list[str] = Field(
23
+ ...,
24
+ example=["0130C", "02003"]
25
+ )
26
+ colours: list[str] = Field(
27
+ ...,
28
+ example=["922", "999"]
29
+ )
30
+ total_customer_purchases: int = Field(
31
+ ...,
32
+ example=1
33
+ )
34
+ total_customer_returns: int = Field(
35
+ ...,
36
+ example=0
37
+ )
38
+
39
+
40
+ def load_json(file_path: Path) -> dict:
41
+ """Load a JSON file and return its content."""
42
+ try:
43
+ with open(file_path, 'r') as f:
44
+ return json.load(f)
45
+ except Exception as e:
46
+ raise HTTPException(status_code=500, detail=f"Error reading JSON file {file_path}: {e}")
47
+
48
+
49
+ def filter_inventory_by_combinations(inventory: pd.DataFrame, models: list, fabrics: list, colours: list) -> pd.DataFrame:
50
+ """Filter inventory based on the product combinations."""
51
+ filtered_inventory = pd.DataFrame()
52
+ for model, fabric, colour in zip(models, fabrics, colours):
53
+ matching_rows = inventory[
54
+ (inventory['Item Brand Model'] == model) & (inventory['Item Brand Fabric'] == fabric) & (inventory['Item Brand Colour'] == colour)
55
+ ]
56
+ filtered_inventory = pd.concat([filtered_inventory, matching_rows])
57
+ return filtered_inventory
58
+
59
+
60
+ def load_model(model_path: Path):
61
+ """Load the trained model and scaler."""
62
+ try:
63
+ with open(model_path, 'rb') as f:
64
+ model = pickle.load(f)
65
+ return model
66
+ except Exception as e:
67
+ raise HTTPException(status_code=500, detail=f"Error loading model: {e}")
68
+
69
+
70
+ def apply_scaling(data: pd.DataFrame, scaler) -> pd.DataFrame:
71
+ """Scale the data using the pre-trained scaler."""
72
+ try:
73
+ return scale_data_with_trained_scaler(data, scaler)
74
+ except Exception as e:
75
+ raise HTTPException(status_code=500, detail=f"Error scaling data: {e}")
76
+
77
+
78
+ def make_predictions(model, scaled_data: pd.DataFrame):
79
+ """Make predictions using the trained model."""
80
+ try:
81
+ predictions = model.predict(scaled_data)
82
+ probabilities = model.predict_proba(scaled_data)
83
+ return predictions, probabilities
84
+ except Exception as e:
85
+ raise HTTPException(status_code=500, detail=f"Error making predictions: {e}")
86
+
87
+
88
+ def prepare_inventory_data(filtered_inventory: pd.DataFrame, total_customer_purchases: int, total_customer_returns: int) -> pd.DataFrame:
89
+ """Prepare and filter inventory data based on provided sales and percentages."""
90
+
91
+ prepared_inventory = prepare_inventory(filtered_inventory)
92
+
93
+ num_items = len(filtered_inventory)
94
+ prepared_inventory['Product Order Count'] = num_items
95
+ prepared_inventory['Total Order Value'] = prepared_inventory['Net Sales Units (FA)'].sum()
96
+ if total_customer_purchases != 0:
97
+ prepared_inventory['Customer Return Percentage'] = (total_customer_returns / total_customer_purchases) * 100
98
+ else:
99
+ prepared_inventory['Customer Return Percentage'] = 0.0
100
+
101
+ return prepared_inventory
102
+
103
+
104
+ @app.get("/")
105
+ async def root():
106
+ return {
107
+ "message": "Welcome to the Product Return Prediction API! Use /predict to make predictions."
108
+ }
109
+
110
+
111
+ @app.post("/predict/")
112
+ async def predict(products: ProductRequest):
113
+ inventory_path: Path = EXTERNAL_DATA_DIR / "inventory.tsv"
114
+ model_path: Path = MODELS_DIR / "svm.pkl"
115
+ scaler_file: Path = MODELS_DIR / "scaler.pkl"
116
+
117
+ inventory = pd.read_csv(inventory_path, sep='\t')
118
+
119
+ filtered_inventory = filter_inventory_by_combinations(
120
+ inventory, products.models, products.fabrics, products.colours
121
+ )
122
+
123
+ if filtered_inventory.empty:
124
+ raise HTTPException(status_code=404, detail="No matching products found")
125
+
126
+ prepared_inventory = prepare_inventory_data(
127
+ filtered_inventory, products.total_customer_purchases, products.total_customer_returns
128
+ )
129
+
130
+ model = load_model(model_path)
131
+
132
+ scaled_inventory = apply_scaling(prepared_inventory, scaler_file)
133
+ predictions, probabilities = make_predictions(model, scaled_inventory)
134
+
135
+ result = [
136
+ {
137
+ "product": f"{row[0]}-{row[1]}-{row[2]}",
138
+ "prediction": "Return" if pred == 1 else "No Return",
139
+ "confidence": f"{prob.max():.2f}"
140
+ }
141
+ for row, pred, prob in zip(filtered_inventory.itertuples(index=False), predictions, probabilities)
142
+ ]
143
+
144
+ return {"predictions": result}
product_return_prediction/app.py ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import requests
3
+
4
+ # FastAPI endpoint URL
5
+ API_URL = "http://localhost:8000/predict/"
6
+
7
+
8
+ # Gradio Interface function
9
+ def predict_return(selected_products, total_customer_purchases, total_customer_returns):
10
+ # Input validation for returns (must be <= purchases)
11
+ if total_customer_returns > total_customer_purchases:
12
+ return "Error: Total returns cannot be greater than total purchases."
13
+
14
+ # Prepare the request data
15
+ models = []
16
+ fabrics = []
17
+ colours = []
18
+
19
+ for selected_product in selected_products:
20
+ # Split each selected product into model, fabric, and color
21
+ model, fabric, color = selected_product.split("-")
22
+ models.append(model)
23
+ fabrics.append(fabric)
24
+ colours.append(color)
25
+
26
+ # Prepare the data to send to the API
27
+ data = {
28
+ "models": models,
29
+ "fabrics": fabrics,
30
+ "colours": colours,
31
+ "total_customer_purchases": total_customer_purchases,
32
+ "total_customer_returns": total_customer_returns
33
+ }
34
+
35
+ print(data)
36
+
37
+ try:
38
+ # Make the POST request to the FastAPI endpoint
39
+ response = requests.post(API_URL, json=data)
40
+ response.raise_for_status() # Raise an error for bad responses
41
+
42
+ # Get the predictions and return them
43
+ result = response.json()
44
+ predictions = result.get('predictions', [])
45
+
46
+ if not predictions:
47
+ return "Error: No predictions found."
48
+
49
+ # Format the output to display nicely
50
+ formatted_result = "\n".join([f"Prediction: {pred['prediction']} | Confidence: {pred['confidence']}%" for pred in predictions])
51
+ return formatted_result
52
+
53
+ except requests.exceptions.RequestException as e:
54
+ return f"Error: {str(e)}"
55
+
56
+
57
+ # Predefined list of model-fabric-color combinations
58
+ combinations = [
59
+ "01CA9T-0130C-922",
60
+ "0NG3DT-02003-999",
61
+ "3R1F67-1JCYZ-0092",
62
+ "211740-3R419-06935",
63
+ "6R1J75-1DQSZ-0943"
64
+ ]
65
+
66
+ # Gradio interface elements
67
+ interface = gr.Interface(
68
+ fn=predict_return, # Function that handles the prediction logic
69
+ inputs=[
70
+ gr.CheckboxGroup(choices=combinations, label="Select Products"), # Allow multiple product selections
71
+ gr.Slider(0, 10, step=1, label="Total Customer Purchases", value=0),
72
+ gr.Slider(0, 10, step=1, label="Total Customer Returns", value=0)
73
+ ],
74
+ outputs="text", # Display predictions as text
75
+ live=True # To enable the interface to interact live
76
+ )
77
+
78
+ # Launch the Gradio interface
79
+ interface.launch()
product_return_prediction/config.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+ from dotenv import load_dotenv
4
+ from loguru import logger
5
+
6
+ # Load environment variables from .env file if it exists
7
+ load_dotenv()
8
+
9
+ # Paths
10
+ PROJ_ROOT = Path(__file__).resolve().parents[1]
11
+ logger.info(f"PROJ_ROOT path is: {PROJ_ROOT}")
12
+
13
+ DATA_DIR = PROJ_ROOT / "data"
14
+ RAW_DATA_DIR = DATA_DIR / "raw"
15
+ INTERIM_DATA_DIR = DATA_DIR / "interim"
16
+ PROCESSED_DATA_DIR = DATA_DIR / "processed"
17
+ EXTERNAL_DATA_DIR = DATA_DIR / "external"
18
+
19
+ CATEGORICAL_DATA_DIR = PROCESSED_DATA_DIR / "cat_dataset"
20
+ NUMERICAL_DATA_DIR = PROCESSED_DATA_DIR / "num_dataset"
21
+
22
+ CATEGORICAL_TRAIN_DATA_FILE = CATEGORICAL_DATA_DIR / "train.tsv"
23
+ CATEGORICAL_VAL_DATA_FILE = CATEGORICAL_DATA_DIR / "val.tsv"
24
+ CATEGORICAL_TEST_DATA_FILE = CATEGORICAL_DATA_DIR / "test.tsv"
25
+
26
+ NUMERICAL_TRAIN_DATA_FILE = NUMERICAL_DATA_DIR / "train.tsv"
27
+ NUMERICAL_VAL_DATA_FILE = NUMERICAL_DATA_DIR / "val.tsv"
28
+ NUMERICAL_TEST_DATA_FILE = NUMERICAL_DATA_DIR / "test.tsv"
29
+
30
+ LABELS_DIR = INTERIM_DATA_DIR / "labels"
31
+
32
+ MODELS_DIR = PROJ_ROOT / "models"
33
+
34
+ REPORTS_DIR = PROJ_ROOT / "reports"
35
+ FIGURES_DIR = REPORTS_DIR / "figures"
36
+
37
+ RANDOM_SEED = 42
38
+
39
+ TARGET_COLUMN = "Returned"
40
+
41
+ # If tqdm is installed, configure loguru with tqdm.write
42
+ # https://github.com/Delgan/loguru/issues/135
43
+ try:
44
+ from tqdm import tqdm
45
+
46
+ logger.remove(0)
47
+ logger.add(lambda msg: tqdm.write(msg, end=""), colorize=True)
48
+ except ModuleNotFoundError:
49
+ pass
product_return_prediction/dataset.py ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pickle
2
+ import json
3
+ from pathlib import Path
4
+
5
+ import pandas as pd
6
+ import typer
7
+ from loguru import logger
8
+ from sklearn.model_selection import train_test_split
9
+ from sklearn.preprocessing import StandardScaler
10
+
11
+ from product_return_prediction.features import add_main_material, extract_ready_to_wear
12
+ from product_return_prediction.config import (
13
+ PROCESSED_DATA_DIR,
14
+ MODELS_DIR,
15
+ RANDOM_SEED,
16
+ TARGET_COLUMN,
17
+ EXTERNAL_DATA_DIR,
18
+ INTERIM_DATA_DIR,
19
+ RAW_DATA_DIR
20
+ )
21
+
22
+ app = typer.Typer()
23
+
24
+
25
+ # TODO The input file must be the path to an Excel file (.xlsx)
26
+ # TODO The output file must be the path where the resulting TSV file will be saved
27
+ def xlsx_to_tsv(input_file: Path, output_file: Path):
28
+ """
29
+ Converts an Excel (.xlsx) file to a Tab-Separated Values (.tsv) file.
30
+
31
+ The function reads data from an Excel file, then writes the data to a TSV file
32
+ (using tab as the delimiter). It logs any errors that occur during reading
33
+ or writing the files.
34
+
35
+ Args:
36
+ input_file (Path): The path to the input Excel file (.xlsx).
37
+ output_file (Path): The path where the output TSV file should be saved.
38
+ """
39
+
40
+ try:
41
+ xlsx_data = pd.read_excel(input_file)
42
+ except Exception as e:
43
+ logger.error(f"Error reading {input_file}: {e}")
44
+ return
45
+
46
+ try:
47
+ xlsx_data.to_csv(output_file, sep='\t', index=False)
48
+ except Exception as e:
49
+ logger.error(f"Error writing to {output_file}: {e}")
50
+
51
+
52
+ # TODO The columns to drop must exist in the input dataframe
53
+ def drop_columns(df: pd.DataFrame, columns_to_drop: list) -> pd.DataFrame:
54
+ """
55
+ Removes specified columns from the DataFrame.
56
+
57
+ This function takes a DataFrame and a list of column names to be dropped,
58
+ and returns a new DataFrame with those columns removed.
59
+
60
+ Args:
61
+ df (pd.DataFrame): The input DataFrame from which columns will be removed.
62
+ columns_to_drop (list): A list of column names (strings) to be removed from the DataFrame.
63
+
64
+ Returns:
65
+ pd.DataFrame: A new DataFrame with the specified columns removed.
66
+ """
67
+
68
+ return df.drop(columns=columns_to_drop)
69
+
70
+
71
+ def split_data(df: pd.DataFrame, train_file: Path, test_file: Path, id_column: str = "Customer Store ID"):
72
+ """
73
+ Splits the input DataFrame into training and testing datasets based on unique values
74
+ of a specified column, and saves them as TSV files.
75
+
76
+ Args:
77
+ df (pd.DataFrame): The input DataFrame to be split.
78
+ train_file (Path): The file path where the training dataset should be saved as a TSV file.
79
+ test_file (Path): The file path where the testing dataset should be saved as a TSV file.
80
+ id_column (str): The column name used for splitting the DataFrame into groups.
81
+ """
82
+ unique_ids = df[id_column].unique()
83
+
84
+ train_ids, test_ids = train_test_split(unique_ids, test_size=0.2, random_state=RANDOM_SEED)
85
+
86
+ train_df = df[df[id_column].isin(train_ids)]
87
+ test_df = df[df[id_column].isin(test_ids)]
88
+
89
+ train_df.to_csv(train_file, sep='\t', index=False)
90
+ test_df.to_csv(test_file, sep='\t', index=False)
91
+
92
+ logger.info(f"Training data saved to {train_file}")
93
+ logger.info(f"Testing data saved to {test_file}")
94
+
95
+
96
+ # TODO The scaler file must be the path where the trained scaler will be salved
97
+ def train_and_save_scaler(train_df: pd.DataFrame, scaler_file: Path):
98
+ """
99
+ Trains a scaler on the training data and saves it to a file.
100
+
101
+ This function applies target encoding to specific categorical columns in the training
102
+ dataset, scales the numeric columns using `StandardScaler`, and then saves the trained
103
+ scaler to a file for later use.
104
+
105
+ Args:
106
+ train_df (pd.DataFrame): The training DataFrame containing the data to be scaled.
107
+ scaler_file (Path): The file path where the trained scaler will be saved.
108
+ """
109
+
110
+ scaler = StandardScaler()
111
+
112
+ train_df = target_encode_columns(train_df, [
113
+ 'Product Type', 'Product Subtype', 'Product Gender', 'Main Material'
114
+ ], 'Colour Return Percentage')
115
+
116
+ train_df = scaler.fit_transform(train_df.drop(columns=[TARGET_COLUMN]))
117
+
118
+ with open(scaler_file, 'wb') as f:
119
+ pickle.dump(scaler, f)
120
+ logger.info(f"Scaler trained and saved to {scaler_file}")
121
+
122
+
123
+ # TODO The scaler file must be the path of the scaler in a Pickle (.pkl) format
124
+ def scale_data_with_trained_scaler(df: pd.DataFrame, scaler_file: Path) -> pd.DataFrame:
125
+ """
126
+ Scales the input DataFrame using a previously trained scaler.
127
+
128
+ This function loads a pre-trained `StandardScaler` from a file, applies target encoding
129
+ to specific categorical columns, and then scales the numeric columns in the DataFrame
130
+ using the loaded scaler.
131
+
132
+ Args:
133
+ df (pd.DataFrame): The input DataFrame to be scaled, containing both categorical and numeric features.
134
+ scaler_file (Path): The file path from which the pre-trained scaler will be loaded.
135
+
136
+ Returns:
137
+ pd.DataFrame: A DataFrame with the numeric columns scaled using the loaded scaler.
138
+ """
139
+
140
+ with open(scaler_file, 'rb') as f:
141
+ scaler = pickle.load(f)
142
+
143
+ df = target_encode_columns(df, [
144
+ 'Product Type', 'Product Subtype', 'Product Gender', 'Main Material'
145
+ ], 'Colour Return Percentage')
146
+
147
+ if TARGET_COLUMN in df.columns:
148
+ df = scaler.transform(df.drop(columns=[TARGET_COLUMN]))
149
+ else:
150
+ df = scaler.transform(df)
151
+
152
+ logger.info(f"Data scaled using scaler from {scaler_file}")
153
+
154
+ return df
155
+
156
+
157
+ # TODO The column names and the target must exist in the dataframe
158
+ def target_encode_columns(df: pd.DataFrame, column_names: list, target: str, smoothing: float = 1.0) -> pd.DataFrame:
159
+ """
160
+ Applies target encoding to specified categorical columns in the DataFrame.
161
+
162
+ Target encoding involves replacing the categorical values with the mean of the target variable,
163
+ smoothed by a global mean. This helps to reduce overfitting, especially when dealing with sparse categories.
164
+
165
+ Args:
166
+ df (pd.DataFrame): The input DataFrame containing the columns to encode and the target variable.
167
+ column_names (list): A list of categorical column names in the DataFrame that need to be target-encoded.
168
+ target (str): The name of the target column to calculate the encoding based on.
169
+ smoothing (float, optional): The smoothing factor to control the weight between category mean and global mean. Default is 1.0.
170
+
171
+ Returns:
172
+ pd.DataFrame: The DataFrame with the target-encoded columns. The original columns are overwritten.
173
+
174
+ Example:
175
+ ```python
176
+ df = target_encode_columns(df, ['Product Type', 'Product Subtype'], 'Sales')
177
+ ```
178
+ """
179
+
180
+ for column_name in column_names:
181
+ if column_name in df.columns:
182
+ logger.info(f"Applying Target Encoding to '{column_name}'")
183
+
184
+ # Compute global mean of the target
185
+ global_mean = df[target].mean()
186
+
187
+ # Group by the categorical column and compute target mean and count
188
+ agg = df.groupby(column_name)[target].agg(['mean', 'count'])
189
+ agg.columns = ['mean', 'count']
190
+
191
+ # Apply smoothing: weighted average between category mean and global mean
192
+ agg['smooth_mean'] = (agg['mean'] * agg['count'] + global_mean * smoothing) / (agg['count'] + smoothing)
193
+
194
+ # Map the smoothed means back to the original column (overwrite)
195
+ df[column_name] = df[column_name].map(agg['smooth_mean'])
196
+
197
+ logger.success(f"Target Encoding applied to '{column_name}' and overwritten in place")
198
+ else:
199
+ logger.warning(f"Column '{column_name}' not found in the DataFrame.")
200
+
201
+ return df
202
+
203
+
204
+ # TODO Sales must have the following columns:
205
+ # Item Brand Model, Item Brand Fabric, Net Sales (FA), Product Type, Product Subtype, Product Top Category
206
+ # TODO Inventory must have the following columns:
207
+ # MODEL, FABRIC; COLOUR, MFC, BRAND, item_brand_modelname, item_age_range_category
208
+ # product_brand, composition, product_gender_unified, product_top_category, product_type,
209
+ # product_subtype, sales_season_unified, product_sale_line, image_url_new
210
+ def prepare_inventory(inventory: pd.DataFrame) -> pd.DataFrame:
211
+
212
+ # inventory.rename(columns={
213
+ # 'MODEL': 'Item Brand Model',
214
+ # 'FABRIC': 'Item Brand Fabric',
215
+ # 'COLOUR': 'Item Brand Colour',
216
+ # 'MFC': 'Item Brand Model Fabric Colour',
217
+ # 'BRAND': 'Brand',
218
+ # 'item_brand_modelname': 'Item Brand Model Name',
219
+ # 'item_age_range_category': 'Age Range',
220
+ # 'product_brand': 'Product Brand',
221
+ # 'composition': 'Product Composition',
222
+ # 'product_gender_unified': 'Product Gender',
223
+ # 'product_top_category': 'Product Top Category',
224
+ # 'product_type': 'Product Type',
225
+ # 'product_subtype': 'Product Subtype',
226
+ # 'sales_season_unified': 'Sales Season',
227
+ # 'product_sale_line': 'Product Sale Line',
228
+ # 'image_url_new': 'Product Image Link'
229
+ # }, inplace=True)
230
+
231
+ # inventory = inventory[inventory['Product Brand'] == 'EMPORIO ARMANI']
232
+
233
+ # inventory = drop_columns(inventory, [
234
+ # 'Brand', 'Sales Season', 'Product Sale Line', 'Product Image Link', 'Product Brand'
235
+ # ])
236
+
237
+ # sales['MF'] = sales['Item Brand Model'] + '' + sales['Item Brand Fabric']
238
+ # inventory['MF'] = inventory['Item Brand Model'] + '' + inventory['Item Brand Fabric']
239
+
240
+ # sales['Net Sales (FA)'] = sales['Net Sales (FA)'].abs()
241
+
242
+ # median_prices = sales.groupby('MF')['Net Sales (FA)'].first()
243
+ # inventory['Net Sales (FA)'] = inventory['MF'].map(median_prices)
244
+
245
+ # category_medians = sales.groupby(['Product Type', 'Product Subtype'])['Net Sales (FA)'].median()
246
+
247
+ # top_category_medians = sales.groupby('Product Top Category')['Net Sales (FA)'].median()
248
+
249
+ # inventory['Net Sales (FA)'] = inventory.apply(fill_missing_price, axis=1, category_medians=category_medians)
250
+ # inventory['Net Sales (FA)'] = inventory.apply(fill_missing_price_with_top_category, axis=1, top_category_medians=top_category_medians)
251
+
252
+ # inventory['Product Composition'] = inventory['Product Composition'].str.upper()
253
+ # inventory = add_main_material(inventory)
254
+
255
+ # inventory['Colour Return Percentage'] = 15.0
256
+ # inventory['Net Sales Units (FA)'] = 1
257
+ # inventory['Product Order Count'] = 1
258
+ # inventory['Total Order Value'] = 1
259
+ # inventory['Total Customer Returns'] = 1
260
+ # inventory['Total Customer Purchases'] = 1
261
+ # inventory['Customer Return Percentage'] = 15.0
262
+
263
+ # inventory['Colour Return Percentage'] = inventory.apply(
264
+ # lambda row: json_percentages.get(f"{row['Item Brand Model']} - {row['Item Brand Colour']}", 15.0),
265
+ # axis=1
266
+ # )
267
+
268
+ # inventory = extract_ready_to_wear(inventory)
269
+
270
+ # inventory = drop_columns(inventory, [
271
+ # 'Item Brand Model', 'Item Brand Fabric', 'Item Brand Colour',
272
+ # 'Item Brand Model Fabric Colour', 'Item Brand Model Name', 'Age Range',
273
+ # 'MF', 'Product Composition', 'Product Top Category'
274
+ # ])
275
+
276
+ inventory = drop_columns(inventory, [
277
+ 'Item Brand Model', 'Item Brand Fabric', 'Item Brand Colour'
278
+ ])
279
+
280
+ inventory = inventory.reindex(columns=[
281
+ 'Product Type', 'Product Subtype', 'Product Gender', 'Net Sales (FA)',
282
+ 'Net Sales Units (FA)', 'Product Order Count', 'Total Order Value',
283
+ 'Main Material', 'Colour Return Percentage', 'Total Customer Purchases',
284
+ 'Total Customer Returns', 'Customer Return Percentage'
285
+ ])
286
+
287
+ logger.info(f"Dataset columns: {inventory.columns}")
288
+
289
+ return inventory
290
+
291
+
292
+ def map_inventory(sales: pd.DataFrame, inventory: pd.DataFrame, json_percentages: dict, mapped_inventory_path: Path):
293
+
294
+ """
295
+ Prepares the inventory dataset by processing and enriching it with sales data and additional columns.
296
+
297
+ > This operation works only on a particular formatted file (see dataset documentation)
298
+
299
+ This function performs several transformations to clean and enrich the inventory data:
300
+ - Renames columns for consistency.
301
+ - Filters the inventory to include only the 'EMPORIO ARMANI' brand.
302
+ - Merges sales data with inventory to fill in missing price information based on model and fabric.
303
+ - Fills missing price values based on median prices from the sales data, grouped by product categories.
304
+ - Adds the 'Main Material' column based on the product composition.
305
+ - Assigns default values to certain columns.
306
+ - Calculates the 'Colour Return Percentage' for each inventory item using a provided dictionary.
307
+ - Filters inventory for 'READY TO WEAR' products.
308
+ - Drops unnecessary columns and reorders the remaining columns.
309
+
310
+ Args:
311
+ sales (pd.DataFrame): A DataFrame containing sales data with product details and sales information.
312
+ inventory (pd.DataFrame): A DataFrame containing inventory data to be enriched and transformed.
313
+ json_percentages (dict): A dictionary containing the colour return percentages, with the model and colour as keys.
314
+
315
+ Returns:
316
+ pd.DataFrame: The prepared and enriched inventory DataFrame.
317
+
318
+ Example:
319
+ ```python
320
+ mapped_inventory_path = "data/mapped_inventory.tsv"
321
+ sales_df = pd.read_csv('sales.tsv', sep='\\t')
322
+ inventory_df = pd.read_csv('inventory.tsv', sep='\\t')
323
+
324
+ with open(json_percentage, 'r') as f:
325
+ percentages = json.load(f)
326
+
327
+ inventory = prepare_inventory(sales_df, inventory_df, colour_return_percentages_dict, mapped_inventory_path)
328
+ ```
329
+ """
330
+
331
+ inventory.rename(columns={
332
+ 'MODEL': 'Item Brand Model',
333
+ 'FABRIC': 'Item Brand Fabric',
334
+ 'COLOUR': 'Item Brand Colour',
335
+ 'MFC': 'Item Brand Model Fabric Colour',
336
+ 'BRAND': 'Brand',
337
+ 'item_brand_modelname': 'Item Brand Model Name',
338
+ 'item_age_range_category': 'Age Range',
339
+ 'product_brand': 'Product Brand',
340
+ 'composition': 'Product Composition',
341
+ 'product_gender_unified': 'Product Gender',
342
+ 'product_top_category': 'Product Top Category',
343
+ 'product_type': 'Product Type',
344
+ 'product_subtype': 'Product Subtype',
345
+ 'sales_season_unified': 'Sales Season',
346
+ 'product_sale_line': 'Product Sale Line',
347
+ 'image_url_new': 'Product Image Link'
348
+ }, inplace=True)
349
+
350
+ inventory = inventory[inventory['Product Brand'] == 'EMPORIO ARMANI']
351
+
352
+ inventory = drop_columns(inventory, [
353
+ 'Brand', 'Sales Season', 'Product Sale Line', 'Product Image Link', 'Product Brand'
354
+ ])
355
+
356
+ sales['MF'] = sales['Item Brand Model'] + '' + sales['Item Brand Fabric']
357
+ inventory['MF'] = inventory['Item Brand Model'] + '' + inventory['Item Brand Fabric']
358
+
359
+ sales['Net Sales (FA)'] = sales['Net Sales (FA)'].abs()
360
+
361
+ median_prices = sales.groupby('MF')['Net Sales (FA)'].first()
362
+ inventory['Net Sales (FA)'] = inventory['MF'].map(median_prices)
363
+
364
+ category_medians = sales.groupby(['Product Type', 'Product Subtype'])['Net Sales (FA)'].median()
365
+
366
+ top_category_medians = sales.groupby('Product Top Category')['Net Sales (FA)'].median()
367
+
368
+ inventory['Net Sales (FA)'] = inventory.apply(fill_missing_price, axis=1, category_medians=category_medians)
369
+ inventory['Net Sales (FA)'] = inventory.apply(fill_missing_price_with_top_category, axis=1, top_category_medians=top_category_medians)
370
+
371
+ inventory['Product Composition'] = inventory['Product Composition'].str.upper()
372
+ inventory = add_main_material(inventory)
373
+
374
+ inventory['Colour Return Percentage'] = 15.0
375
+ inventory['Net Sales Units (FA)'] = 1
376
+ inventory['Product Order Count'] = 1
377
+ inventory['Total Order Value'] = 1
378
+ inventory['Total Customer Returns'] = 1
379
+ inventory['Total Customer Purchases'] = 1
380
+ inventory['Customer Return Percentage'] = 15.0
381
+
382
+ inventory['Colour Return Percentage'] = inventory.apply(
383
+ lambda row: json_percentages.get(f"{row['Item Brand Model']} - {row['Item Brand Colour']}", 15.0),
384
+ axis=1
385
+ )
386
+
387
+ inventory = extract_ready_to_wear(inventory)
388
+
389
+ inventory = drop_columns(inventory, [
390
+ 'Item Brand Model Fabric Colour', 'Item Brand Model Name', 'Age Range',
391
+ 'MF', 'Product Composition', 'Product Top Category'
392
+ ])
393
+
394
+ inventory = inventory.reindex(columns=[
395
+ 'Item Brand Model', 'Item Brand Fabric', 'Item Brand Colour',
396
+ 'Product Type', 'Product Subtype', 'Product Gender', 'Net Sales (FA)',
397
+ 'Net Sales Units (FA)', 'Product Order Count', 'Total Order Value',
398
+ 'Main Material', 'Colour Return Percentage', 'Total Customer Purchases',
399
+ 'Total Customer Returns', 'Customer Return Percentage'
400
+ ])
401
+
402
+ inventory.to_csv(mapped_inventory_path, sep='\t', index=False)
403
+
404
+
405
+ # TODO The input row must have Net Sales (FA), Product Type and Product Subtype columns
406
+ def fill_missing_price(row: pd.Series, category_medians: dict):
407
+ """
408
+ Fills missing 'Net Sales (FA)' values based on the median price of the product category.
409
+
410
+ This function checks if the 'Net Sales (FA)' value is missing (NaN) in the row. If it is,
411
+ it attempts to fill the missing value using the median price for the product category,
412
+ which is determined by the combination of 'Product Type' and 'Product Subtype'.
413
+ The median values are provided in the `category_medians` dictionary, where the key is a tuple
414
+ of ('Product Type', 'Product Subtype') and the value is the corresponding median price.
415
+
416
+ Args:
417
+ row (pd.Series): A row of the DataFrame containing product data, including 'Net Sales (FA)',
418
+ 'Product Type', and 'Product Subtype'.
419
+ category_medians (dict): A dictionary with keys as tuples of ('Product Type', 'Product Subtype')
420
+ and values as the median price for that category.
421
+
422
+ Returns:
423
+ float: The 'Net Sales (FA)' value if it is not missing, or the median price for the product category
424
+ if it is missing, or None if no median is found for the category.
425
+ """
426
+
427
+ if pd.isna(row['Net Sales (FA)']):
428
+ product_type = row['Product Type']
429
+ product_subtype = row['Product Subtype']
430
+ return category_medians.get((product_type, product_subtype), None)
431
+ return row['Net Sales (FA)']
432
+
433
+
434
+ # TODO The input row must have Net Sales (FA) and Product Top Category columns
435
+ def fill_missing_price_with_top_category(row: pd.Series, top_category_medians: dict):
436
+ """
437
+ Fills missing 'Net Sales (FA)' values based on the median price of the product's top category.
438
+
439
+ This function checks if the 'Net Sales (FA)' value is missing (NaN) in the row. If it is,
440
+ it attempts to fill the missing value using the median price for the product's 'Product Top Category'.
441
+ The median prices for each 'Product Top Category' are provided in the `top_category_medians` dictionary,
442
+ where the key is the 'Product Top Category' and the value is the corresponding median price.
443
+
444
+ Args:
445
+ row (pd.Series): A row of the DataFrame containing product data, including 'Net Sales (FA)' and
446
+ 'Product Top Category'.
447
+ top_category_medians (dict): A dictionary with keys as 'Product Top Category' and values as thes
448
+ median price for that category.
449
+
450
+ Returns:
451
+ float: The 'Net Sales (FA)' value if it is not missing, or the median price for the 'Product Top Category'
452
+ if it is missing, or None if no median is found for the top category.
453
+ """
454
+
455
+ if pd.isna(row['Net Sales (FA)']):
456
+ product_top_category = row['Product Top Category']
457
+ return top_category_medians.get(product_top_category, None)
458
+ return row['Net Sales (FA)']
459
+
460
+
461
+ #####################################################################################
462
+
463
+
464
+ @app.command()
465
+ def main(
466
+ # input_path: Path = PROCESSED_DATA_DIR / "features.tsv",
467
+ scaler_path: Path = MODELS_DIR / "scaler.pkl",
468
+ train_path: Path = PROCESSED_DATA_DIR / "train.tsv",
469
+ sales_path: Path = RAW_DATA_DIR / "sales.xlsx",
470
+ inventory_path: Path = EXTERNAL_DATA_DIR / "inventory.csv",
471
+ json_percentages_file: Path = INTERIM_DATA_DIR / "colour_return_percentage.json"
472
+ # test_path: Path = PROCESSED_DATA_DIR / "test.tsv"
473
+ ):
474
+ # ---- Split dataset into train and test ----
475
+ # try:
476
+ # data = pd.read_csv(input_path, sep='\t')
477
+ # split_data(data, train_path, test_path)
478
+ # except Exception as e:
479
+ # logger.error(f"Error during dataset split: {e}")
480
+ # return
481
+
482
+ # ---- Train and save the scaler ----
483
+ try:
484
+ train_data = pd.read_csv(train_path, sep='\t')
485
+ train_and_save_scaler(train_data, scaler_path)
486
+ except Exception as e:
487
+ logger.error(f"Error during scaler training: {e}")
488
+ return
489
+
490
+ # ---- Prepare inference file ----
491
+ try:
492
+ sales_data = pd.read_excel(sales_path)
493
+ inventory_data = pd.read_csv(inventory_path)
494
+
495
+ with open(json_percentages_file, 'r') as f:
496
+ percentages = json.load(f)
497
+
498
+ mapped_inventory_path = EXTERNAL_DATA_DIR / "inventory.tsv"
499
+ map_inventory(sales_data, inventory_data, percentages, mapped_inventory_path)
500
+ except Exception as e:
501
+ logger.error(f"Error during inventory preparation: {e}")
502
+ return
503
+
504
+
505
+ if __name__ == "__main__":
506
+ app()
product_return_prediction/features.py ADDED
@@ -0,0 +1,401 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+ import typer
3
+ import json
4
+ import pandas as pd
5
+ from loguru import logger
6
+
7
+ from product_return_prediction.config import (
8
+ RAW_DATA_DIR, PROCESSED_DATA_DIR, INTERIM_DATA_DIR, TARGET_COLUMN
9
+ )
10
+
11
+ app = typer.Typer()
12
+
13
+
14
+ # TODO The input dataframe must contain Customer Store ID, Item Brand Model Fabric Colour and Return Units (FA) columns
15
+ # TODO The returned dataframe must have an additional column named as the TARGET_COLUMN
16
+ def add_returned(df: pd.DataFrame) -> pd.DataFrame:
17
+ """
18
+ Adds a column with the name of the TARGET_COLUMN to the DataFrame based on the 'Returns Units (FA)' column
19
+ and performs data cleaning operations.
20
+
21
+ The function calculates whether an item has been returned (indicated by a value > 0
22
+ in the 'Returns Units (FA)' column) and groups by 'Customer Store ID' and
23
+ 'Item Brand Model Fabric Colour' to ensure consistency. It then removes rows where
24
+ 'Returns Units (FA)' equals 1.
25
+
26
+ Args:
27
+ df (pd.DataFrame): Input DataFrame containing at least the columns:
28
+ - 'Customer Store ID'
29
+ - 'Item Brand Model Fabric Colour'
30
+ - 'Returns Units (FA)'
31
+
32
+ Returns:
33
+ pd.DataFrame: Modified DataFrame with the following updates:
34
+ - A new column 'Returned' indicating if the item was returned (1) or not (0).
35
+ - Rows where 'Returns Units (FA)' equals 1 are removed.
36
+ """
37
+
38
+ df[TARGET_COLUMN] = df.groupby(
39
+ ['Customer Store ID', 'Item Brand Model Fabric Colour']
40
+ )['Returns Units (FA)'].transform(lambda x: (x.fillna(0) > 0).astype(int))
41
+
42
+ df[TARGET_COLUMN] = df.groupby(
43
+ ['Customer Store ID', 'Item Brand Model Fabric Colour']
44
+ )['Returns Units (FA)'].transform('max')
45
+
46
+ df[TARGET_COLUMN] = df[TARGET_COLUMN].fillna(0).astype(int)
47
+ df = df[df['Returns Units (FA)'] != 1]
48
+
49
+ return df
50
+
51
+
52
+ # TODO The input dataframe must contain the Order Number column
53
+ # TODO The returned dataframe must have an additional column named Product Order Count
54
+ def add_product_order_count(df: pd.DataFrame) -> pd.DataFrame:
55
+ """
56
+ Adds a 'Product Order Count' column to the DataFrame, indicating the number
57
+ of products associated with each order.
58
+
59
+ The function groups the data by 'Order Number', calculates the count of
60
+ products for each order, and merges this information back into the original
61
+ DataFrame as a new column.
62
+
63
+ Args:
64
+ df (pd.DataFrame): Input DataFrame containing the column:
65
+ - 'Order Number': Identifier for each order.
66
+
67
+ Returns:
68
+ pd.DataFrame: Modified DataFrame with an additional column:
69
+ - 'Product Order Count': The count of products per order.
70
+ """
71
+
72
+ order_product_count = df.groupby('Order Number').size().reset_index(name='Product Order Count')
73
+ df = df.merge(order_product_count, on='Order Number', how='left')
74
+ return df
75
+
76
+
77
+ # TODO The input dataframe must contain the Order Number and Net Sales (FA) columns
78
+ # TODO The returned dataframe must have an additional column named Total Order Value
79
+ def add_total_order_value(df: pd.DataFrame) -> pd.DataFrame:
80
+ """
81
+ Adds a 'Total Order Value' column to the DataFrame, representing the total
82
+ net sales value for each order.
83
+
84
+ The function groups the data by 'Order Number', calculates the sum of
85
+ 'Net Sales (FA)' for each order, and merges this information back into the
86
+ original DataFrame as a new column. The total order values are rounded to
87
+ 4 decimal places.
88
+
89
+ Args:
90
+ df (pd.DataFrame): Input DataFrame containing the columns:
91
+ - 'Order Number': Identifier for each order.
92
+ - 'Net Sales (FA)': Net sales value for each item.
93
+
94
+ Returns:
95
+ pd.DataFrame: Modified DataFrame with an additional column:
96
+ - 'Total Order Value': The total net sales value for each order,
97
+ rounded to 4 decimal places.
98
+ """
99
+
100
+ order_total = df.groupby('Order Number')['Net Sales (FA)'].sum().reset_index().round(4)
101
+ order_total = order_total.rename(columns={'Net Sales (FA)': 'Total Order Value'})
102
+ df = df.merge(order_total, on='Order Number', how='left')
103
+ return df
104
+
105
+
106
+ # TODO The input must be a string formatted with percentages followed by material names (e.g. 70% Wool, 30% Nylon)
107
+ # TODO The returned dataframe must be a string, if successfully extracted, otherwise Unknown
108
+ def extract_main_material(composition: str) -> str:
109
+ """
110
+ Extracts the main material from a product composition string.
111
+
112
+ The function identifies the material associated with the first percentage value
113
+ in the input string and returns it. If no percentage or material is found,
114
+ 'Unknown' is returned. Handles cases where the material name spans multiple words.
115
+
116
+ Args:
117
+ composition (str): A string describing the product composition, typically
118
+ containing percentages followed by material names (e.g., "50% Cotton, 50% Polyester").
119
+
120
+ Returns:
121
+ str: The name of the main material (e.g., "Cotton"), or 'Unknown' if the input
122
+ is not a string or if no material can be extracted.
123
+ """
124
+
125
+ if isinstance(composition, str):
126
+ parts = composition.split()
127
+ material = []
128
+ for i, part in enumerate(parts):
129
+ if "%" in part:
130
+ if i + 1 < len(parts):
131
+ material.append(parts[i + 1])
132
+ j = i + 2
133
+ while j < len(parts) and not any(c.isdigit() for c in parts[j]):
134
+ material.append(parts[j])
135
+ j += 1
136
+ break
137
+ return " ".join(material) if material else 'Unknown'
138
+ return 'Unknown'
139
+
140
+
141
+ # TODO The input dataframe must contain the Product Top Category column
142
+ # TODO The returned dataframe must have only records with Product Top Category equal to READY TO WEAR
143
+ def extract_ready_to_wear(df: pd.DataFrame) -> pd.DataFrame:
144
+ """
145
+ Filters the DataFrame to include only rows where the product belongs to
146
+ the 'READY TO WEAR' category.
147
+
148
+ The function selects rows where the 'Product Top Category' column is equal to
149
+ 'READY TO WEAR' and returns the filtered DataFrame.
150
+
151
+ Args:
152
+ df (pd.DataFrame): Input DataFrame containing the column:
153
+ - 'Product Top Category': The top-level category of the product.
154
+
155
+ Returns:
156
+ pd.DataFrame: A filtered DataFrame containing only rows where
157
+ 'Product Top Category' equals 'READY TO WEAR'.
158
+ """
159
+
160
+ df = df[df['Product Top Category'] == 'READY TO WEAR']
161
+ return df
162
+
163
+
164
+ # TODO The input dataframe must contain the Product Composition column
165
+ # TODO The returned dataframe must have an additional column named Main Material
166
+ def add_main_material(df: pd.DataFrame) -> pd.DataFrame:
167
+ """
168
+ Adds a 'Main Material' column to the DataFrame by extracting the primary material
169
+ from the 'Product Composition' column.
170
+
171
+ The function applies the `extract_main_material` function to each value in the
172
+ 'Product Composition' column to determine the main material and stores the result
173
+ in a new column, 'Main Material'.
174
+
175
+ Args:
176
+ df (pd.DataFrame): Input DataFrame containing the column:
177
+ - 'Product Composition': A string describing the composition of the product,
178
+ typically with percentages followed by material names.
179
+
180
+ Returns:
181
+ pd.DataFrame: Modified DataFrame with an additional column:
182
+ - 'Main Material': The extracted primary material from 'Product Composition'.
183
+ If the composition is invalid or a material cannot be determined, the
184
+ value will be 'Unknown'.
185
+ """
186
+
187
+ df['Main Material'] = df['Product Composition'].apply(extract_main_material)
188
+ return df
189
+
190
+
191
+ # TODO The input dataframe must contain Item Brand Model, Item Brand Colour and TARGET_COLUMN columns
192
+ # TODO The returned dataframe must have an additional column named Colour Return Percentage
193
+ # TODO The json_file must be a valid Path object specifying the location to save the JSON file
194
+ # TODO The JSON file must have the number of rows equal to the number of distinct Model-Colour items in the dataframe
195
+ def add_colour_return_percentage(df: pd.DataFrame, json_file: Path) -> pd.DataFrame:
196
+ """
197
+ Adds a 'Colour Return Percentage' column to the DataFrame and saves the results
198
+ as a JSON file. The percentage is calculated for each combination of
199
+ 'Item Brand Model' and 'Item Brand Colour' based on the returned and total counts.
200
+
201
+ Args:
202
+ df (pd.DataFrame): Input DataFrame containing the columns:
203
+ - 'Item Brand Model': The model identifier for the item.
204
+ - 'Item Brand Colour': The color associated with the item.
205
+ - TARGET_COLUMN (e.g., 'Returned'): Indicates whether an item was returned (1) or not (0).
206
+
207
+ json_file (Path): Path to save the JSON file containing the 'Colour Return Percentage'
208
+ for each 'Model - Colour' combination.
209
+
210
+ Returns:
211
+ pd.DataFrame: Modified DataFrame with an additional column:
212
+ - 'Colour Return Percentage': The percentage of returns for each
213
+ 'Item Brand Model' and 'Item Brand Colour' combination, rounded to 1 decimal place.
214
+ """
215
+
216
+ grouped = df.groupby(['Item Brand Model', 'Item Brand Colour'])
217
+ returned_count = grouped[TARGET_COLUMN].sum().reset_index(name='Returned Count')
218
+ total_count = grouped.size().reset_index(name='Total Count')
219
+
220
+ result = pd.merge(returned_count, total_count, on=['Item Brand Model', 'Item Brand Colour'])
221
+ result['Colour Return Percentage'] = (
222
+ (result['Returned Count'] / result['Total Count']) * 100
223
+ ).round(1)
224
+
225
+ df = pd.merge(df, result[
226
+ ['Item Brand Model', 'Item Brand Colour', 'Colour Return Percentage']
227
+ ], on=['Item Brand Model', 'Item Brand Colour'], how='left')
228
+
229
+ # Create a dictionary for JSON output
230
+ colour_return_percentage_json = df.assign(
231
+ Model_Colour=df['Item Brand Model'] + ' - ' + df['Item Brand Colour']
232
+ ).set_index('Model_Colour')['Colour Return Percentage'].to_dict()
233
+
234
+ with open(json_file, 'w') as f:
235
+ json.dump(colour_return_percentage_json, f, indent=4)
236
+
237
+ return df
238
+
239
+
240
+ # TODO The input dataframe must contain Customer Store ID column
241
+ # TODO The returned dataframe must have an additional column named Total Customer Purchases
242
+ def add_total_customer_purchases(df: pd.DataFrame) -> pd.DataFrame:
243
+ """
244
+ Adds a 'Total Customer Purchases' column to the DataFrame, representing the total
245
+ number of purchases made by each customer.
246
+
247
+ The function groups the data by 'Customer Store ID', calculates the total count
248
+ of purchases for each customer, and merges this information back into the original
249
+ DataFrame as a new column.
250
+
251
+ Args:
252
+ df (pd.DataFrame): Input DataFrame containing the column:
253
+ - 'Customer Store ID': The unique identifier for each customer.
254
+
255
+ Returns:
256
+ pd.DataFrame: Modified DataFrame with an additional column:
257
+ - 'Total Customer Purchases': The total number of purchases for each
258
+ customer.
259
+ """
260
+
261
+ total_purchases = df.groupby('Customer Store ID').size().reset_index(name='Total Customer Purchases')
262
+ df = pd.merge(df, total_purchases, on='Customer Store ID', how='left')
263
+ return df
264
+
265
+
266
+ # TODO The input dataframe must contain Customer Store ID and TARGET_COLUMN columns
267
+ # TODO The returned dataframe must have an additional column named Total Customer Returns
268
+ def add_total_customer_returns(df: pd.DataFrame) -> pd.DataFrame:
269
+ """
270
+ Adds a 'Total Customer Returns' column to the DataFrame, representing the total
271
+ number of returns made by each customer.
272
+
273
+ The function groups the data by 'Customer Store ID', calculates the sum of
274
+ the returns (as indicated by the `TARGET_COLUMN`), and merges this information
275
+ back into the original DataFrame as a new column.
276
+
277
+ Args:
278
+ df (pd.DataFrame): Input DataFrame containing the columns:
279
+ - 'Customer Store ID': The unique identifier for each customer.
280
+ - TARGET_COLUMN (e.g., 'Returned'): A column indicating the return status
281
+ of a purchase (typically 1 for returned, 0 for not returned).
282
+
283
+ Returns:
284
+ pd.DataFrame: Modified DataFrame with an additional column:
285
+ - 'Total Customer Returns': The total number of returns made by each
286
+ customer based on the `TARGET_COLUMN`.
287
+ """
288
+
289
+ total_returns = df.groupby('Customer Store ID')[TARGET_COLUMN].sum().reset_index(name='Total Customer Returns')
290
+ df = df.merge(total_returns, on='Customer Store ID', how='left')
291
+ return df
292
+
293
+
294
+ # TODO The input dataframe must contain Customer Store ID, Total Customer Returns and Total Customer Purchases columns
295
+ # TODO The returned dataframe must have an additional column named Customer Return Percentage
296
+ def add_customer_return_percentage(df: pd.DataFrame) -> pd.DataFrame:
297
+ """
298
+ Adds a 'Customer Return Percentage' column to the DataFrame, representing the
299
+ percentage of returns made by each customer relative to their total purchases.
300
+
301
+ The function groups the data by 'Customer Store ID', calculates the total number
302
+ of returns and total purchases for each customer, then computes the return percentage
303
+ and merges this information back into the original DataFrame as a new column.
304
+
305
+ Args:
306
+ df (pd.DataFrame): Input DataFrame containing the columns:
307
+ - 'Customer Store ID': The unique identifier for each customer.
308
+ - 'Total Customer Returns': The total number of returns made by each customer.
309
+ - 'Total Customer Purchases': The total number of purchases made by each customer.
310
+
311
+ Returns:
312
+ pd.DataFrame: Modified DataFrame with an additional column:
313
+ - 'Customer Return Percentage': The percentage of returns made by each customer,
314
+ rounded to 1 decimal place.
315
+ """
316
+
317
+ user_grouped = df.groupby('Customer Store ID')[
318
+ ['Total Customer Returns', 'Total Customer Purchases']
319
+ ].sum().reset_index()
320
+
321
+ user_grouped['Customer Return Percentage'] = (
322
+ (user_grouped['Total Customer Returns'] / user_grouped['Total Customer Purchases']) * 100
323
+ ).round(1)
324
+
325
+ df = pd.merge(df, user_grouped[['Customer Store ID', 'Customer Return Percentage']],
326
+ on='Customer Store ID', how='left')
327
+ return df
328
+
329
+
330
+ @app.command()
331
+ def main(
332
+ input_path: Path = RAW_DATA_DIR / "sales.xlsx",
333
+ train_path: Path = PROCESSED_DATA_DIR / "train.tsv",
334
+ test_path: Path = PROCESSED_DATA_DIR / "test.tsv",
335
+ json_file: Path = INTERIM_DATA_DIR / "colour_return_percentage.json",
336
+ # output_path: Path = PROCESSED_DATA_DIR / "features.tsv",
337
+ ):
338
+ from product_return_prediction.dataset import xlsx_to_tsv, drop_columns, split_data
339
+
340
+ logger.info("Generating features from dataset...")
341
+
342
+ # ---- Prepare data for feature engineering ----
343
+ tsv_path: Path = INTERIM_DATA_DIR / "dataset.tsv"
344
+ xlsx_to_tsv(input_path, tsv_path)
345
+
346
+ df = pd.read_csv(tsv_path, sep='\t')
347
+
348
+ cols_to_drop_1 = [
349
+ 'Year Gregorian',
350
+ 'Month Gregorian',
351
+ 'Month Gregorian Name',
352
+ 'Country',
353
+ 'Variant WCS',
354
+ 'Age Range',
355
+ 'Product Image Link',
356
+ 'Returns Value (FA)',
357
+ 'Returns Units (FA)',
358
+ 'Return Reason Group',
359
+ 'Return Reason'
360
+ ]
361
+
362
+ cols_to_drop_2 = [
363
+ 'Date (Date format)',
364
+ 'Customer Store ID',
365
+ 'Order Number',
366
+ 'Order Line Number',
367
+ 'Item Brand Model',
368
+ 'Item Brand Fabric',
369
+ 'Item Brand Colour',
370
+ 'Item Brand Model Fabric Colour',
371
+ 'Product Composition',
372
+ 'Product Top Category'
373
+ ]
374
+
375
+ # ---- Perform feature engineering pipeline ----
376
+ df = add_returned(df)
377
+ df = drop_columns(df, cols_to_drop_1)
378
+ df = add_product_order_count(df)
379
+ df = add_total_order_value(df)
380
+ df = add_main_material(df)
381
+ df = add_colour_return_percentage(df, json_file)
382
+ df = add_total_customer_purchases(df)
383
+ df = add_total_customer_returns(df)
384
+ df = add_customer_return_percentage(df)
385
+ df = extract_ready_to_wear(df)
386
+ split_data(df, train_path, test_path, id_column="Customer Store ID")
387
+
388
+ train = pd.read_csv(train_path, sep='\t')
389
+ test = pd.read_csv(test_path, sep='\t')
390
+
391
+ train = drop_columns(train, cols_to_drop_2)
392
+ test = drop_columns(test, cols_to_drop_2)
393
+
394
+ train.to_csv(train_path, sep='\t', index=False)
395
+ test.to_csv(test_path, sep='\t', index=False)
396
+
397
+ logger.success("Features generation complete.")
398
+
399
+
400
+ if __name__ == "__main__":
401
+ app()
product_return_prediction/modeling/__init__.py ADDED
File without changes
product_return_prediction/modeling/eval.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pickle
2
+ import typer
3
+ import json
4
+
5
+ import seaborn as sns
6
+ import pandas as pd
7
+ import matplotlib.pyplot as plt
8
+
9
+ from loguru import logger
10
+ from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
11
+ from pathlib import Path
12
+ from codecarbon import EmissionsTracker
13
+
14
+ from product_return_prediction.dataset import scale_data_with_trained_scaler
15
+ from product_return_prediction.config import (
16
+ MODELS_DIR,
17
+ PROCESSED_DATA_DIR,
18
+ TARGET_COLUMN,
19
+ REPORTS_DIR
20
+ )
21
+
22
+ app = typer.Typer()
23
+
24
+
25
+ def evaluate_model(test_data: pd.DataFrame, scaler_file: Path, model: any, model_name: str):
26
+ """
27
+ Evaluates the performance of a trained model on the provided test data. It includes scaling the features
28
+ using a pre-trained scaler, making predictions, computing accuracy, generating a classification report,
29
+ and visualizing the confusion matrix.
30
+
31
+ This function scales the test data using a pre-trained scaler, applies the trained model to make predictions,
32
+ and calculates key performance metrics, including accuracy. It then generates a detailed classification report,
33
+ saves the report to a JSON file, and plots the confusion matrix to visually assess model performance.
34
+
35
+ Args:
36
+ test_data (pd.DataFrame): The test dataset, which includes both features and the target column.
37
+ scaler_file (Path): Path to the pre-trained scaler file, used to scale the feature columns.
38
+ model (any): The trained model object, used to make predictions on the test data.
39
+ model_name (str): The name of the model, used for saving the evaluation report.
40
+
41
+ Example:
42
+ ```python
43
+ evaluate_model(test_data, scaler_file='scaler.pkl', model=model, model_name='log_reg')
44
+ ```
45
+ """
46
+
47
+ X_test = test_data.drop(columns=[TARGET_COLUMN]).copy()
48
+ y_test = test_data[TARGET_COLUMN].copy()
49
+
50
+ X_test = scale_data_with_trained_scaler(X_test, scaler_file)
51
+
52
+ cc_file = f"{model_name}_emissions.csv"
53
+ tracker = EmissionsTracker(project_name="eval", output_dir=REPORTS_DIR, output_file=cc_file)
54
+ tracker.start()
55
+
56
+ y_pred = model.predict(X_test)
57
+
58
+ tracker.stop()
59
+
60
+ accuracy = accuracy_score(y_test, y_pred)
61
+ logger.info(f"Accuracy: {accuracy * 100:.2f}%")
62
+
63
+ report = classification_report(y_test, y_pred)
64
+ logger.info(f"Classification Report:\n{report}")
65
+
66
+ report = classification_report(y_test, y_pred, output_dict=True)
67
+ with open(REPORTS_DIR / f"{model_name}.json", "w") as json_file:
68
+ json.dump(report, json_file, indent=4)
69
+
70
+ cm = confusion_matrix(y_test, y_pred)
71
+ sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=model.classes_, yticklabels=model.classes_)
72
+ plt.title("Confusion Matrix")
73
+ plt.xlabel("Predicted Labels")
74
+ plt.ylabel("True Labels")
75
+
76
+ # Saving the confusion matrix in the reports/figures directory
77
+ plt.savefig(REPORTS_DIR / f"figures/cm_{model_name}.png", dpi=300, bbox_inches='tight')
78
+ plt.close()
79
+
80
+
81
+ @app.command()
82
+ def main(
83
+ test_file: Path = PROCESSED_DATA_DIR / "test.tsv",
84
+ scaler_file: Path = MODELS_DIR / "scaler.pkl",
85
+ log_reg_model_path: Path = MODELS_DIR / "log_reg.pkl",
86
+ svm_model_path: Path = MODELS_DIR / "svm.pkl",
87
+ ):
88
+ test_data = pd.read_csv(test_file, sep='\t')
89
+
90
+ with open(log_reg_model_path, "rb") as f:
91
+ log_reg = pickle.load(f)
92
+
93
+ with open(svm_model_path, "rb") as f:
94
+ svm = pickle.load(f)
95
+
96
+ evaluate_model(test_data, scaler_file, log_reg, "log_reg_eval")
97
+ evaluate_model(test_data, scaler_file, svm, "svm_eval")
98
+
99
+
100
+ if __name__ == "__main__":
101
+ app()
product_return_prediction/modeling/predict.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+ import typer
4
+ import pickle
5
+ import json
6
+ import pandas as pd
7
+ from loguru import logger
8
+ from codecarbon import EmissionsTracker
9
+
10
+ from product_return_prediction.config import MODELS_DIR, INTERIM_DATA_DIR, EXTERNAL_DATA_DIR, REPORTS_DIR, RAW_DATA_DIR
11
+ from product_return_prediction.dataset import prepare_inventory, scale_data_with_trained_scaler
12
+
13
+ app = typer.Typer()
14
+
15
+
16
+ @app.command()
17
+ def main(
18
+ sales_path: Path = RAW_DATA_DIR / "sales.xlsx",
19
+ inventory_path: Path = EXTERNAL_DATA_DIR / "inventory.csv",
20
+ json_percentage: Path = INTERIM_DATA_DIR / "colour_return_percentage.json",
21
+ scaler_file: Path = MODELS_DIR / "scaler.pkl",
22
+ model_path: Path = MODELS_DIR / "svm.pkl",
23
+ ):
24
+ sales = pd.read_excel(sales_path)
25
+ inventory = pd.read_csv(inventory_path)
26
+
27
+ with open(json_percentage, 'r') as f:
28
+ percentages = json.load(f)
29
+
30
+ # ---- Prepare inventory data for inference ----
31
+ inventory = prepare_inventory(sales, inventory, percentages)
32
+
33
+ with open(model_path, "rb") as f:
34
+ model = pickle.load(f)
35
+
36
+ # ---- Scale 5 random rows from the inventory ----
37
+ random_row = inventory.sample(n=5)
38
+ logger.info(f"Your product:\n {random_row}")
39
+ random_row = scale_data_with_trained_scaler(random_row, scaler_file)
40
+
41
+ # ---- Compute predictions and probabilities ----
42
+ cc_file = "svm_predict_emissions.csv"
43
+ tracker = EmissionsTracker(project_name="eval", output_dir=REPORTS_DIR, output_file=cc_file)
44
+ tracker.start()
45
+
46
+ predictions = model.predict(random_row)
47
+ probabilities = model.predict_proba(random_row)
48
+
49
+ tracker.stop()
50
+
51
+ for pred, prob in zip(predictions, probabilities):
52
+ prob_confidence = prob.max()
53
+ if pred == 1:
54
+ logger.info(f"The product will be returned with {prob_confidence:.2f} confidence")
55
+ else:
56
+ logger.info(f"The product will NOT be returned with {prob_confidence:.2f} confidence")
57
+
58
+
59
+ if __name__ == "__main__":
60
+ app()
product_return_prediction/modeling/train.py ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pickle
2
+ from pathlib import Path
3
+
4
+ import dagshub
5
+ import mlflow
6
+ import pandas as pd
7
+ import typer
8
+ from loguru import logger
9
+ from sklearn.linear_model import LogisticRegression
10
+ from sklearn.model_selection import GridSearchCV
11
+ from sklearn.svm import SVC
12
+ from codecarbon import EmissionsTracker
13
+
14
+ from product_return_prediction.dataset import scale_data_with_trained_scaler
15
+ from product_return_prediction.config import (
16
+ MODELS_DIR,
17
+ PROCESSED_DATA_DIR,
18
+ TARGET_COLUMN,
19
+ REPORTS_DIR
20
+ )
21
+
22
+ dagshub.init(repo_owner='se4ai2425-uniba', repo_name='product-return-prediction', mlflow=True)
23
+
24
+ app = typer.Typer()
25
+
26
+
27
+ # TODO The training dataset must have the following columns:
28
+ # Product Type, Product Subtype, Product Gender, Net Sales (FA), Net Sales Units (FA)
29
+ # TARGET_COLUMN, Product Order Count, Total Order Value, Main Material, Colour Return Percentage
30
+ # Total Customer Purchases, Total Customer Returns, Customer Return Percentage
31
+ # TODO The scaler and model paths must be Pickle (.pkl) files
32
+ def train_log_reg(train_data: pd.DataFrame, scaler_file: Path, model_path: Path):
33
+ """
34
+ Trains a Logistic Regression model using the provided training data, applies feature scaling,
35
+ and saves the trained model to a specified file.
36
+
37
+ This function trains a Logistic Regression model using the training data. The feature columns are
38
+ scaled using a pre-trained scaler before fitting the model. The model is then saved to the specified
39
+ file path, and the training process is tracked using MLflow.
40
+
41
+ Args:
42
+ train_data (pd.DataFrame): The training data, including features and target column.
43
+ scaler_file (Path): Path to the pre-trained scaler file, used to scale the feature columns.
44
+ model_path (Path): Path where the trained Logistic Regression model will be saved.
45
+ """
46
+
47
+ run_name = model_path.stem
48
+ mlflow.start_run(run_name=run_name)
49
+ mlflow.sklearn.autolog()
50
+
51
+ # Apply scaling to the feature columns (excluding the target column)
52
+ X_train = train_data.drop(columns=[TARGET_COLUMN]).copy()
53
+ y_train = train_data[TARGET_COLUMN].copy()
54
+
55
+ # Scale X_train using the pre-trained scaler
56
+ X_train = scale_data_with_trained_scaler(X_train, scaler_file)
57
+
58
+ # Initialize the Logistic Regression model
59
+ model = LogisticRegression(max_iter=1000, class_weight="balanced")
60
+ logger.info(f"Model: {model}")
61
+
62
+ cc_file = "log_reg_train_emissions.csv"
63
+ tracker = EmissionsTracker(project_name="train", output_dir=REPORTS_DIR, output_file=cc_file)
64
+ tracker.start()
65
+
66
+ # Fit the model to the training data
67
+ model.fit(X_train, y_train)
68
+
69
+ tracker.stop()
70
+ mlflow.end_run()
71
+
72
+ # Save the trained model to disk
73
+ with open(model_path, "wb") as f:
74
+ pickle.dump(model, f)
75
+ logger.success(f"Model saved to {model_path}")
76
+
77
+
78
+ # TODO The training dataset must have the following columns:
79
+ # Product Type, Product Subtype, Product Gender, Net Sales (FA), Net Sales Units (FA)
80
+ # TARGET_COLUMN, Product Order Count, Total Order Value, Main Material, Colour Return Percentage
81
+ # Total Customer Purchases, Total Customer Returns, Customer Return Percentage
82
+ # TODO The scaler and model paths must be Pickle (.pkl) files
83
+ def train_svm(train_data: pd.DataFrame, scaler_file: Path, model_path: Path):
84
+ """
85
+ Trains a Support Vector Machine (SVM) classifier using the provided training data, applies feature scaling,
86
+ performs hyperparameter tuning via grid search, and saves the trained model to a specified file.
87
+
88
+ This function trains an SVM model with hyperparameter optimization using grid search. The feature columns
89
+ are scaled using a pre-trained scaler before fitting the model. The trained model is saved to the specified
90
+ file path, and the training process is tracked using MLflow.
91
+
92
+ Args:
93
+ train_data (pd.DataFrame): The training data, including features and target column.
94
+ scaler_file (Path): Path to the pre-trained scaler file, used to scale the feature columns.
95
+ model_path (Path): Path where the trained SVM model will be saved.
96
+ """
97
+
98
+ run_name = model_path.stem
99
+ mlflow.start_run(run_name=run_name)
100
+ mlflow.sklearn.autolog()
101
+
102
+ X_train = train_data.drop(columns=[TARGET_COLUMN]).copy()
103
+ y_train = train_data[TARGET_COLUMN].copy()
104
+
105
+ X_train = scale_data_with_trained_scaler(X_train, scaler_file)
106
+
107
+ param_grid = {"C": [0.1, 1, 10], "kernel": ["rbf"], "gamma": ["scale", "auto"]}
108
+
109
+ logger.info("Starting Grid Search for best hyperparameters")
110
+ grid_search = GridSearchCV(SVC(probability=True), param_grid, scoring="balanced_accuracy", cv=10)
111
+ grid_search.fit(X_train, y_train)
112
+ model = grid_search.best_estimator_
113
+
114
+ cc_file = "svm_train_emissions.csv"
115
+ tracker = EmissionsTracker(project_name="train", output_dir=REPORTS_DIR, output_file=cc_file)
116
+ tracker.start()
117
+
118
+ model.fit(X_train, y_train)
119
+
120
+ tracker.stop()
121
+ mlflow.end_run()
122
+
123
+ with open(model_path, "wb") as f:
124
+ pickle.dump(model, f)
125
+ logger.success(f"Model saved to {model_path}")
126
+
127
+
128
+ @app.command()
129
+ def main(
130
+ train_file: Path = PROCESSED_DATA_DIR / "train.tsv",
131
+ scaler_file: Path = MODELS_DIR / "scaler.pkl",
132
+ log_reg_model_path: Path = MODELS_DIR / "log_reg.pkl",
133
+ svm_model_path: Path = MODELS_DIR / "svm.pkl",
134
+ ):
135
+ train_data = pd.read_csv(train_file, sep='\t')
136
+
137
+ # ---- Train models ----
138
+ train_log_reg(train_data, scaler_file, log_reg_model_path)
139
+ train_svm(train_data, scaler_file, svm_model_path)
140
+
141
+
142
+ if __name__ == "__main__":
143
+ app()
product_return_prediction/plots.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+ import typer
4
+ from loguru import logger
5
+ from tqdm import tqdm
6
+
7
+ from product_return_prediction.config import FIGURES_DIR, PROCESSED_DATA_DIR
8
+
9
+ app = typer.Typer()
10
+
11
+
12
+ @app.command()
13
+ def main(
14
+ # ---- REPLACE DEFAULT PATHS AS APPROPRIATE ----
15
+ input_path: Path = PROCESSED_DATA_DIR / "dataset.csv",
16
+ output_path: Path = FIGURES_DIR / "plot.png",
17
+ # -----------------------------------------
18
+ ):
19
+ # ---- REPLACE THIS WITH YOUR OWN CODE ----
20
+ logger.info("Generating plot from data...")
21
+ for i in tqdm(range(10), total=10):
22
+ if i == 5:
23
+ logger.info("Something happened for iteration 5.")
24
+ logger.success("Plot generation complete.")
25
+ # -----------------------------------------
26
+
27
+
28
+ if __name__ == "__main__":
29
+ app()
pyproject.toml ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["flit_core >=3.2,<4"]
3
+ build-backend = "flit_core.buildapi"
4
+
5
+ [project]
6
+ name = "product_return_prediction"
7
+ version = "0.0.1"
8
+ description = "Analyze past orders and returns to predict which products are more likely to be returned."
9
+ authors = [
10
+ { name = "Molinari-Pinto-Tanzi" },
11
+ ]
12
+ license = { file = "LICENSE" }
13
+ readme = "README.md"
14
+ classifiers = [
15
+ "Programming Language :: Python :: 3",
16
+ "License :: OSI Approved :: MIT License"
17
+ ]
18
+ requires-python = "~=3.12"
19
+
20
+ [tool.black]
21
+ line-length = 99
22
+ include = '\.pyi?$'
23
+ exclude = '''
24
+ /(
25
+ \.git
26
+ | \.venv
27
+ )/
28
+ '''
29
+
30
+ [tool.ruff.lint.isort]
31
+ known_first_party = ["product_return_prediction"]
32
+ force_sort_within_sections = true
33
+
34
+ [tool.pytest.ini_options]
35
+ log_cli = true
36
+ log_cli_level = "INFO"
requirements.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ black
2
+ codecarbon
3
+ fastapi
4
+ flake8
5
+ ipython
6
+ isort
7
+ jupyterlab
8
+ loguru
9
+ matplotlib
10
+ mkdocs
11
+ notebook
12
+ numpy
13
+ pandas
14
+ pip
15
+ python-dotenv
16
+ scikit-learn
17
+ tqdm
18
+ typer
19
+ dvc
20
+ dvc-gdrive
21
+ mlflow
22
+ dagshub
23
+ great-expectations
24
+ pytest
25
+ openpyxl
26
+ uvicorn
27
+ seaborn