Spaces:

molinari135
/

product-return-prediction-api

Sleeping

App Files Files Community

molinari135 commited on 12 days ago

Commit

a1a7d89

•

1 Parent(s): 91a6159

Initial commit

Browse files

Files changed (23) hide show

Dockerfile +31 -0
LICENSE +10 -0
README.md +0 -12
data/README.md +158 -0
data/external/.gitignore +2 -0
data/external/.gitkeep +0 -0
data/external/inventory.csv.dvc +5 -0
models/.gitignore +3 -0
models/.gitkeep +0 -0
models/README.md +85 -0
product_return_prediction/__init__.py +1 -0
product_return_prediction/api.py +144 -0
product_return_prediction/app.py +79 -0
product_return_prediction/config.py +49 -0
product_return_prediction/dataset.py +506 -0
product_return_prediction/features.py +401 -0
product_return_prediction/modeling/__init__.py +0 -0
product_return_prediction/modeling/eval.py +101 -0
product_return_prediction/modeling/predict.py +60 -0
product_return_prediction/modeling/train.py +143 -0
product_return_prediction/plots.py +29 -0
pyproject.toml +36 -0
requirements.txt +27 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,31 @@

+FROM python:3.12-slim
+ARG WORKDIR=/app
+WORKDIR $WORKDIR
+RUN python -m pip install --upgrade pip==23.3.1
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential && \
+    rm -rf /var/lib/apt/lists/*
+COPY product_return_prediction $WORKDIR/product_return_prediction
+# COPY product_return_prediction/api.py $WORKDIR/product_return_prediction
+# COPY product_return_prediction/config.py $WORKDIR/product_return_prediction
+# COPY product_return_prediction/dataset.py $WORKDIR/product_return_prediction
+# COPY product_return_prediction/features.py $WORKDIR/product_return_prediction
+COPY README.md $WORKDIR/
+COPY requirements.txt $WORKDIR/
+COPY pyproject.toml $WORKDIR/
+COPY data/external/inventory.tsv $WORKDIR/data/external/
+COPY models/scaler.pkl $WORKDIR/models/
+COPY models/svm.pkl $WORKDIR/models/
+RUN pip install --no-cache-dir -r requirements.txt
+# RUN pip install --no-cache-dir .
+EXPOSE 7860
+CMD ["uvicorn", "product_return_prediction.api:app", "--host", "0.0.0.0", "--port", "7860", "--reload"]

LICENSE ADDED Viewed

	@@ -0,0 +1,10 @@

+The MIT License (MIT)
+Copyright (c) 2024, Molinari-Pinto-Tanzi
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md DELETED Viewed

@@ -1,12 +0,0 @@
----
-title: Product Return Prediction
-emoji: 🏆
-colorFrom: green
-colorTo: red
-sdk: docker
-pinned: false
-license: mit
-short_description: 'No'
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

data/README.md ADDED Viewed

	@@ -0,0 +1,158 @@

+<!-- ---
+# For reference on dataset card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/datasets-cards
+{{ card_data }}
+--- -->
+# Sales Dataset Card
+<!-- Provide a quick summary of the dataset. -->
+This dataset contains 28,692 rows of data related to Emporio Armani's e-commerce order lines. Each row represents an order line, detailing whether the product was returned or not. The dataset includes features such as:
+- **Date and Time Information**: Year (Gregorian), month (Gregorian), month name, and the exact purchase date (in date format).
+- **Customer Information**: Store ID of the customer associated with the transaction.
+- **Order Line Details**: Order number and order line number to uniquely identify each purchase.
+- **Geographical Information**: Country where the purchase was made.
+- **Product Details**: Variant code (WCS), Item Brand model, Fabric, Color, a combination of model, fabric, and color, and the product composition. Additionally, the dataset specifies the product's top category, type, and subtype, along with its gender target, age range, and a link to the product image.
+- **Return Information**: Return reason group and detailed reason for the return (if applicable).
+- **Financial and Quantitative Data**: Net sales (value and units), return value, and return units for each transaction.
+The dataset is slightly unbalanced, with only 23% of transactions involving returned products.
+<!-- ## Dataset Details -->
+<!-- ### Dataset Description -->
+<!-- Provide a longer summary of what this dataset is. -->
+<!-- - **Curated by**: Molinari-Pinto-Tanzi -->
+<!-- - **Funded by**: Armani -->
+<!-- - **Shared by [optional]:** {{ shared_by | default("[More Information Needed]", true)}}
+- **Language(s) (NLP):** {{ language | default("[More Information Needed]", true)}} -->
+<!-- - **License:** {{ license | default("[More Information Needed]", true)}} -->
+<!-- ## Dataset Sources -->
+<!-- Provide the basic links for the dataset. -->
+<!-- - **GitHub Repository**: [Product Return Prediction on GitHub](https://github.com/se4ai2425-uniba/product-return-prediction) -->
+<!-- - **DagsHub Repository**: [Product Return Prediction on DagsHub](https://dagshub.com/se4ai2425-uniba/product-return-prediction) -->
+<!-- - **Demo [optional]:** {{ demo | default("[More Information Needed]", true)}} -->
+## Uses
+<!-- Address questions around how the dataset is intended to be used. -->
+The dataset can be used to support a variety of use cases in the field of e-commerce analytics and machine learning. Some of the main intended uses are:
+- **Predictive Modeling**: Training and evaluating machine learning models for binary classification tasks, such as predicting the likelihood of a product being returned based on its characteristics and transaction details.
+- **Exploratory Data Analysis (EDA)**: Analyzing patterns and trends in product returns to identify factors that influence customer behavior, such as product type, composition, or age range.
+- **Feature Engineering**: Using the dataset to develop and test new features for predictive models, such as aggregating return reasons or combining product composition data.
+- **Unbalanced Data Research**: Studying machine learning techniques and strategies to handle imbalanced datasets, as the target variable is not evenly distributed.
+### Direct Use
+<!-- This section describes suitable use cases for the dataset. -->
+The dataset can be used in the following cases:
+- Train a **binary classification** model to predict if a product will be returned
+- Train a **regression** model to predict the probability of restitution of a model
+- Train a **multi-class classification** model to predict the motivation of a return
+## Dataset Structure
+<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
+The dataset presents the following features:
+| Feature name | Description |
+| --- | --- |
+| Year Gregorian | Gregorian year of the purchase (e.g.: `2023`) |
+| Month Gregorian | Gregorian month of the purchase (e.g.: `01/2023`, indicating january of the year 2023) |
+| Month Gregorian Name | Gregorian name abbreviation of the month of the purchase (e.g.: `Jan`) |
+| Date (Date Format) | Date of the purchase (e.g.: `2023-01-02`, indicating the first of january 2023) |
+| Customer Store ID | Numerical code that identifies the user |
+| Order Number | Alphanumerical code that identifies the receipt to which the purchase belongs |
+| Order Line Number | Integer which corresponds to the purchase order of the product in question within the receipt (if the product is part of a recept in which there are 5 products, the corresponding order line number could be a value in between 1 and 5) |
+| Country | Country in which the product was purchased |
+| Variant WCS | Alternative identifier of the receipt, with 1:1 correspondence with Order Number |
+| Item Brand Model | Alphanumerical code indicating the model of the purchased product |
+| Item Brand Fabric | Alphanumerical code indicating the fabric of the purchased product |
+| Item Brand Colour | Alphanumerical code indicating the colour of the purchased product |
+| Item Brand Model Fabric Colour | Alphanumerical code, the combination of the codes of Model, Fabric, and Colour |
+| Product Composition | Information on the percentage of materials that make up the purchased product (e.g.: `43% COTTON 29% WOOL 28% ACRYLIC`) |
+| Product Top Category | Macrocategory to which the purchased product belongs (e.g.: `READY TO WEAR`) |
+| Product Type | Type of garment corresponding to the product purchased (e.g.: `JACKETS`) |
+| Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`) |
+| Age Range | Value that could be `ADULT`, `JUNIOR` or `BABY` |
+| Product Gender | Value that could be `MALE` or `FEMALE` |
+| Product Image Link | URL address to images of the purchased product (the 20% of the product do not have the corresponding link) |
+| Return Reason Group | Set of 9 values, identified by a string, corresponding to the macrocategory of the product return reason. A `#N/A#` value corresponds to an unreturned product |
+| Return Reason | Set of 26 values, identified by a string, corresponding to the specific category of the product return reason. A `#N/A#` value corresponds to an unreturned product |
+| Net Sales (FA) | Value, in Euros, of the product purchased |
+| Net Sales Units (FA) | Value describing whether the product was returned or not (`-1` means the product is returned, otherwise the value corresponds to 1) |
+| Returns Value (FA) | Corresponds to the same value of the net sales, it is corroborated only if the product is returned |
+| Return Units (FA) | Value is `1.0` only if the product is returned, otherwise it is null |
+## Dataset Creation
+### Data Collection and Processing
+<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
+A feature engineering pipeline has been performed on the dataset as it follows:
+1. Added a new column named `Returned` that contains a flag to identify if a product has been returned based on `Return Units (FA)` column
+2. Removed `Year Gregorian`, `Month Gregorian`, `Month Gregorian Name`, `Country`, `Age Range`, `Product Image Link`, `Returns Value (FA)`, `Returns Units (FA)`, `Return Reason Group` and `Return Reason` because they were not useful for training
+3. Removed `Variant WCS` to remove additional IDs
+4. Added a new column named `Product Order Count` that tells the number of products belonging to the same order as the selected product based on `Order Number` and `Order Line Number`
+5. Added a new column named `Total Order Value` performing the sum of every product belonging to the same order based on `Order Number` and `Net Sales (FA)` columns
+6. Added a new column named `Main Material` which contains the first material that can be found in the `Product Composition` column
+7. Added a new column named `Colour Return Percentage` that estimates the return likelihood of a product based on its `Item Brand Model` and `Item Brand Colour`
+   - This operation also produced a JSON file that helps obtaining known values starting from `Item Brand Model` and `Item Brand Colour`, otherwise a median value will be found  using `Product Top Category` of the product
+8. Added a new column named `Total Customer Purchases` that tells the number of purchases, within the year, of a customer that has purchased that product
+9. Added a new column named `Total Customer` Returns that tells the number of returns, within the year, of a customer that has purchased that product
+10. Added a new column named `Customer Return Percentage` that shows the likelihood of the returns made by customer that has bought that product
+11. Selected only those rows belonging to `READY TO WEAR` as `Product Top Category`
+12. Removed `Date (Date format)`, `Customer Store ID`, `Order Number`, `Order Line Number`, `Item Brand Model`, `Item Brand Fabric`, `Item Brand Colour`, `Item Brand Model Fabric Colour`, `Product Composition`, `Product Top Category`
+After performing all these operations, all the categorical features have been converted into numerical ones performing a **Target Encoding** technique with smoothing to avoid partial ordering issues during training. A `StandardScaler` trained only on the train data split has been applied at the end of the whole process to prepare data for training, evaluation and inference.
+The new dataset will contain the following features
+| Feature | Description |
+|---|---|
+| Product Type | Type of garment corresponding to the product purchased (e.g.: `JACKETS`) |
+| Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`) |
+| Product Gender | Value that could be `MALE` or `FEMALE` |
+| Net Sales (FA) | Value, in Euros, of the product purchased |
+| Net Sales Units (FA) | Value describing the number of products purchased or returned (always `1`) |
+| Returned | `1` if the product has been returned, `0` otherwise |
+| Product Order Count | Number of products belonging to the same order |
+| Total Order Value | Sum of every product belonging to the same order, in Euros |
+| Main Material | Material of which the product is mainly made of |
+| Colour Return Percentage | Likelihood of the product return based on the colour of the product |
+| Total Customer Purchases | Number of purchases made by the user that bought or returned that product |
+| Total Customer Returns | Number of returns made by the user that bought or returned that product |
+| Customer Return Percentage | Likelihood of the product return based on the customer behavior |
+This new dataset has been splitted into two files, `train.tsv` and `test.tsv`, performing a 80-20 split.
+### Personal and Sensitive Information
+<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
+The dataset does not contain personal or sensitive information. The only reference to customers is their customer ID related to orders but no sensitive informations about coustomers are involved.
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+The dataset presents an unbalanced distribution of returns and purchases, showing more purchases than returns, which are only 23% of the entire records. Training models using this dataset could result in slightly biased results, if not taken into account. Additionally, data exploration shows that there is no correlation between features before applying feature engineering.
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Deeper data exploration and feature engineering is suggested to achieve better training results using this dataset.

data/external/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ /inventory.csv
2	+ /inventory.tsv

data/external/.gitkeep ADDED Viewed

File without changes

data/external/inventory.csv.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: b2d0ccf46d96499bcaa47052bb57bbba
+  size: 4413122
+  hash: md5
+  path: inventory.csv

models/.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+/log_reg.pkl
+/svm.pkl
+/scaler.pkl

models/.gitkeep ADDED Viewed

File without changes

models/README.md ADDED Viewed

	@@ -0,0 +1,85 @@

+# Model Card for Product Return Prediction
+## model details
+- **person or organization developing model**: team product-return-prediction
+- **model date**: 24/11/2024
+- **model version**: v1.4
+- **model type**: Support Vector Machine
+<!-- algorithm description -->
+This model is a **Support Vector Machine** classifier designed to predict whether a product will be returned or not, based on various product and transaction features. Hyperparameters (C, kernel type and gamma) are chosen using a grid search, with a 10-fold cross validation.
+## intended use
+### primary intended uses
+<!-- description of the model's use -->
+The purpose of the model is to assist e-commerce owners (Armani) in identifying possible returns among their purchases in order to reorganize inventories to optimize product handling and transportation costs
+### primary intended users
+<!-- description of the users -->
+The model was developed for Armani. Specifically, the purpose is to support professional figures involved in logistics, product management, and marketing
+<!-- ### out-of scope use cases -->
+## factors
+### relevant factors
+<!-- factors to consider -->
+Some factors to be considered that involve the model are the following:
+- **product features**: characteristics like model, fabric, colour, composition, and product category may have a significant impact on the likelihood of a product being returned
+- **imbalanced classes**: the class imbalance is a relevant factor that may affect the model's ability to predict the minority class (returns) accurately
+### decision thresholds
+<!-- description of selected thresholds -->
+The default decision threshold for the SVM model is 0.5, where probabilities greater than or equal to 0.5 indicate a "returned" prediction, and probabilities below 0.5 indicate "not returned."
+## Train and Test data
+### dataset description
+- **dataset**: *German Sales 2023 EA*
+the model was trained and tested on this dataset, following appropriate splitting and pre-processing steps.
+### split
+Dataset splitting is as follows:
+- **training**: 80%
+- **validation and test**: 20%
+the splitting is performed by using the corresponding sklearn function. The chosen random state is 42.
+### pre-processing
+To be adapted to the binary classification task, and further adapted to a numerical model such as SVM, the model underwent an important pre-processing phase. Pre-processing steps are the following:
+1.  Dataset conversion from Excel to TSV
+2. Specific columns removal from dataframe
+3. Train and test data splitting
+4. Train and save scaler
+5. Scaling data with a pre-trained scaler
+6. Target encoding of categorical columns
+7. Preparation of inventory with sales data
+8. Population of missing values
+9. Calculation and application of return percentages by color
+10. Final cleaning and processing
+## Quantitative analysis
+|           | PRECISION | RECALL    | F1-SCORE  | Support   |
+|-----------|-----------|-----------|-----------|-----------|
+| No return | 0.95      | 0.95      | 0.95      | 2086      |
+| Return    | 0.89      | 0.90      | 0.89      | 960       |
+| Accuracy  |           |           |           |0.93       |
+<!-- ### unitary results -->
+<!-- ### intersectional results -->

product_return_prediction/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from product_return_prediction import config # noqa: F401

product_return_prediction/api.py ADDED Viewed

	@@ -0,0 +1,144 @@

+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel, Field
+import pandas as pd
+import json
+import pickle
+from pathlib import Path
+from product_return_prediction.dataset import prepare_inventory, scale_data_with_trained_scaler
+from product_return_prediction.config import MODELS_DIR, EXTERNAL_DATA_DIR
+app = FastAPI(
+    title="Product Return Prediction API",
+    description="This API predicts whether a product will be returned based on products and user behavior.",
+    version="0.1.0"
+)
+class ProductRequest(BaseModel):
+    models: list[str] = Field(
+        ...,
+        example=["01CA9T", "0NG3DT"]
+    )
+    fabrics: list[str] = Field(
+        ...,
+        example=["0130C", "02003"]
+    )
+    colours: list[str] = Field(
+        ...,
+        example=["922", "999"]
+    )
+    total_customer_purchases: int = Field(
+        ...,
+        example=1
+    )
+    total_customer_returns: int = Field(
+        ...,
+        example=0
+    )
+def load_json(file_path: Path) -> dict:
+    """Load a JSON file and return its content."""
+    try:
+        with open(file_path, 'r') as f:
+            return json.load(f)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error reading JSON file {file_path}: {e}")
+def filter_inventory_by_combinations(inventory: pd.DataFrame, models: list, fabrics: list, colours: list) -> pd.DataFrame:
+    """Filter inventory based on the product combinations."""
+    filtered_inventory = pd.DataFrame()
+    for model, fabric, colour in zip(models, fabrics, colours):
+        matching_rows = inventory[
+            (inventory['Item Brand Model'] == model) & (inventory['Item Brand Fabric'] == fabric) & (inventory['Item Brand Colour'] == colour)
+        ]
+        filtered_inventory = pd.concat([filtered_inventory, matching_rows])
+    return filtered_inventory
+def load_model(model_path: Path):
+    """Load the trained model and scaler."""
+    try:
+        with open(model_path, 'rb') as f:
+            model = pickle.load(f)
+        return model
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error loading model: {e}")
+def apply_scaling(data: pd.DataFrame, scaler) -> pd.DataFrame:
+    """Scale the data using the pre-trained scaler."""
+    try:
+        return scale_data_with_trained_scaler(data, scaler)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error scaling data: {e}")
+def make_predictions(model, scaled_data: pd.DataFrame):
+    """Make predictions using the trained model."""
+    try:
+        predictions = model.predict(scaled_data)
+        probabilities = model.predict_proba(scaled_data)
+        return predictions, probabilities
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error making predictions: {e}")
+def prepare_inventory_data(filtered_inventory: pd.DataFrame, total_customer_purchases: int, total_customer_returns: int) -> pd.DataFrame:
+    """Prepare and filter inventory data based on provided sales and percentages."""
+    prepared_inventory = prepare_inventory(filtered_inventory)
+    num_items = len(filtered_inventory)
+    prepared_inventory['Product Order Count'] = num_items
+    prepared_inventory['Total Order Value'] = prepared_inventory['Net Sales Units (FA)'].sum()
+    if total_customer_purchases != 0:
+        prepared_inventory['Customer Return Percentage'] = (total_customer_returns / total_customer_purchases) * 100
+    else:
+        prepared_inventory['Customer Return Percentage'] = 0.0
+    return prepared_inventory
+@app.get("/")
+async def root():
+    return {
+        "message": "Welcome to the Product Return Prediction API! Use /predict to make predictions."
+    }
+@app.post("/predict/")
+async def predict(products: ProductRequest):
+    inventory_path: Path = EXTERNAL_DATA_DIR / "inventory.tsv"
+    model_path: Path = MODELS_DIR / "svm.pkl"
+    scaler_file: Path = MODELS_DIR / "scaler.pkl"
+    inventory = pd.read_csv(inventory_path, sep='\t')
+    filtered_inventory = filter_inventory_by_combinations(
+        inventory, products.models, products.fabrics, products.colours
+    )
+    if filtered_inventory.empty:
+        raise HTTPException(status_code=404, detail="No matching products found")
+    prepared_inventory = prepare_inventory_data(
+        filtered_inventory, products.total_customer_purchases, products.total_customer_returns
+    )
+    model = load_model(model_path)
+    scaled_inventory = apply_scaling(prepared_inventory, scaler_file)
+    predictions, probabilities = make_predictions(model, scaled_inventory)
+    result = [
+        {
+            "product": f"{row[0]}-{row[1]}-{row[2]}",
+            "prediction": "Return" if pred == 1 else "No Return",
+            "confidence": f"{prob.max():.2f}"
+        }
+        for row, pred, prob in zip(filtered_inventory.itertuples(index=False), predictions, probabilities)
+    ]
+    return {"predictions": result}

product_return_prediction/app.py ADDED Viewed

	@@ -0,0 +1,79 @@

+import gradio as gr
+import requests
+# FastAPI endpoint URL
+API_URL = "http://localhost:8000/predict/"
+# Gradio Interface function
+def predict_return(selected_products, total_customer_purchases, total_customer_returns):
+    # Input validation for returns (must be <= purchases)
+    if total_customer_returns > total_customer_purchases:
+        return "Error: Total returns cannot be greater than total purchases."
+    # Prepare the request data
+    models = []
+    fabrics = []
+    colours = []
+    for selected_product in selected_products:
+        # Split each selected product into model, fabric, and color
+        model, fabric, color = selected_product.split("-")
+        models.append(model)
+        fabrics.append(fabric)
+        colours.append(color)
+    # Prepare the data to send to the API
+    data = {
+        "models": models,
+        "fabrics": fabrics,
+        "colours": colours,
+        "total_customer_purchases": total_customer_purchases,
+        "total_customer_returns": total_customer_returns
+    }
+    print(data)
+    try:
+        # Make the POST request to the FastAPI endpoint
+        response = requests.post(API_URL, json=data)
+        response.raise_for_status()  # Raise an error for bad responses
+        # Get the predictions and return them
+        result = response.json()
+        predictions = result.get('predictions', [])
+        if not predictions:
+            return "Error: No predictions found."
+        # Format the output to display nicely
+        formatted_result = "\n".join([f"Prediction: {pred['prediction']} | Confidence: {pred['confidence']}%" for pred in predictions])
+        return formatted_result
+    except requests.exceptions.RequestException as e:
+        return f"Error: {str(e)}"
+# Predefined list of model-fabric-color combinations
+combinations = [
+    "01CA9T-0130C-922",
+    "0NG3DT-02003-999",
+    "3R1F67-1JCYZ-0092",
+    "211740-3R419-06935",
+    "6R1J75-1DQSZ-0943"
+]
+# Gradio interface elements
+interface = gr.Interface(
+    fn=predict_return,  # Function that handles the prediction logic
+    inputs=[
+        gr.CheckboxGroup(choices=combinations, label="Select Products"),  # Allow multiple product selections
+        gr.Slider(0, 10, step=1, label="Total Customer Purchases", value=0),
+        gr.Slider(0, 10, step=1, label="Total Customer Returns", value=0)
+    ],
+    outputs="text",  # Display predictions as text
+    live=True  # To enable the interface to interact live
+)
+# Launch the Gradio interface
+interface.launch()

product_return_prediction/config.py ADDED Viewed

	@@ -0,0 +1,49 @@

+from pathlib import Path
+from dotenv import load_dotenv
+from loguru import logger
+# Load environment variables from .env file if it exists
+load_dotenv()
+# Paths
+PROJ_ROOT = Path(__file__).resolve().parents[1]
+logger.info(f"PROJ_ROOT path is: {PROJ_ROOT}")
+DATA_DIR = PROJ_ROOT / "data"
+RAW_DATA_DIR = DATA_DIR / "raw"
+INTERIM_DATA_DIR = DATA_DIR / "interim"
+PROCESSED_DATA_DIR = DATA_DIR / "processed"
+EXTERNAL_DATA_DIR = DATA_DIR / "external"
+CATEGORICAL_DATA_DIR = PROCESSED_DATA_DIR / "cat_dataset"
+NUMERICAL_DATA_DIR = PROCESSED_DATA_DIR / "num_dataset"
+CATEGORICAL_TRAIN_DATA_FILE = CATEGORICAL_DATA_DIR / "train.tsv"
+CATEGORICAL_VAL_DATA_FILE = CATEGORICAL_DATA_DIR / "val.tsv"
+CATEGORICAL_TEST_DATA_FILE = CATEGORICAL_DATA_DIR / "test.tsv"
+NUMERICAL_TRAIN_DATA_FILE = NUMERICAL_DATA_DIR / "train.tsv"
+NUMERICAL_VAL_DATA_FILE = NUMERICAL_DATA_DIR / "val.tsv"
+NUMERICAL_TEST_DATA_FILE = NUMERICAL_DATA_DIR / "test.tsv"
+LABELS_DIR = INTERIM_DATA_DIR / "labels"
+MODELS_DIR = PROJ_ROOT / "models"
+REPORTS_DIR = PROJ_ROOT / "reports"
+FIGURES_DIR = REPORTS_DIR / "figures"
+RANDOM_SEED = 42
+TARGET_COLUMN = "Returned"
+# If tqdm is installed, configure loguru with tqdm.write
+# https://github.com/Delgan/loguru/issues/135
+try:
+    from tqdm import tqdm
+    logger.remove(0)
+    logger.add(lambda msg: tqdm.write(msg, end=""), colorize=True)
+except ModuleNotFoundError:
+    pass

product_return_prediction/dataset.py ADDED Viewed

	@@ -0,0 +1,506 @@

+import pickle
+import json
+from pathlib import Path
+import pandas as pd
+import typer
+from loguru import logger
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+from product_return_prediction.features import add_main_material, extract_ready_to_wear
+from product_return_prediction.config import (
+    PROCESSED_DATA_DIR,
+    MODELS_DIR,
+    RANDOM_SEED,
+    TARGET_COLUMN,
+    EXTERNAL_DATA_DIR,
+    INTERIM_DATA_DIR,
+    RAW_DATA_DIR
+)
+app = typer.Typer()
+# TODO The input file must be the path to an Excel file (.xlsx)
+# TODO The output file must be the path where the resulting TSV file will be saved
+def xlsx_to_tsv(input_file: Path, output_file: Path):
+    """
+    Converts an Excel (.xlsx) file to a Tab-Separated Values (.tsv) file.
+    The function reads data from an Excel file, then writes the data to a TSV file
+    (using tab as the delimiter). It logs any errors that occur during reading
+    or writing the files.
+    Args:
+        input_file (Path): The path to the input Excel file (.xlsx).
+        output_file (Path): The path where the output TSV file should be saved.
+    """
+    try:
+        xlsx_data = pd.read_excel(input_file)
+    except Exception as e:
+        logger.error(f"Error reading {input_file}: {e}")
+        return
+    try:
+        xlsx_data.to_csv(output_file, sep='\t', index=False)
+    except Exception as e:
+        logger.error(f"Error writing to {output_file}: {e}")
+# TODO The columns to drop must exist in the input dataframe
+def drop_columns(df: pd.DataFrame, columns_to_drop: list) -> pd.DataFrame:
+    """
+    Removes specified columns from the DataFrame.
+    This function takes a DataFrame and a list of column names to be dropped,
+    and returns a new DataFrame with those columns removed.
+    Args:
+        df (pd.DataFrame): The input DataFrame from which columns will be removed.
+        columns_to_drop (list): A list of column names (strings) to be removed from the DataFrame.
+    Returns:
+        pd.DataFrame: A new DataFrame with the specified columns removed.
+    """
+    return df.drop(columns=columns_to_drop)
+def split_data(df: pd.DataFrame, train_file: Path, test_file: Path, id_column: str = "Customer Store ID"):
+    """
+    Splits the input DataFrame into training and testing datasets based on unique values
+    of a specified column, and saves them as TSV files.
+    Args:
+        df (pd.DataFrame): The input DataFrame to be split.
+        train_file (Path): The file path where the training dataset should be saved as a TSV file.
+        test_file (Path): The file path where the testing dataset should be saved as a TSV file.
+        id_column (str): The column name used for splitting the DataFrame into groups.
+    """
+    unique_ids = df[id_column].unique()
+    train_ids, test_ids = train_test_split(unique_ids, test_size=0.2, random_state=RANDOM_SEED)
+    train_df = df[df[id_column].isin(train_ids)]
+    test_df = df[df[id_column].isin(test_ids)]
+    train_df.to_csv(train_file, sep='\t', index=False)
+    test_df.to_csv(test_file, sep='\t', index=False)
+    logger.info(f"Training data saved to {train_file}")
+    logger.info(f"Testing data saved to {test_file}")
+# TODO The scaler file must be the path where the trained scaler will be salved
+def train_and_save_scaler(train_df: pd.DataFrame, scaler_file: Path):
+    """
+    Trains a scaler on the training data and saves it to a file.
+    This function applies target encoding to specific categorical columns in the training
+    dataset, scales the numeric columns using `StandardScaler`, and then saves the trained
+    scaler to a file for later use.
+    Args:
+        train_df (pd.DataFrame): The training DataFrame containing the data to be scaled.
+        scaler_file (Path): The file path where the trained scaler will be saved.
+    """
+    scaler = StandardScaler()
+    train_df = target_encode_columns(train_df, [
+        'Product Type', 'Product Subtype', 'Product Gender', 'Main Material'
+    ], 'Colour Return Percentage')
+    train_df = scaler.fit_transform(train_df.drop(columns=[TARGET_COLUMN]))
+    with open(scaler_file, 'wb') as f:
+        pickle.dump(scaler, f)
+    logger.info(f"Scaler trained and saved to {scaler_file}")
+# TODO The scaler file must be the path of the scaler in a Pickle (.pkl) format
+def scale_data_with_trained_scaler(df: pd.DataFrame, scaler_file: Path) -> pd.DataFrame:
+    """
+    Scales the input DataFrame using a previously trained scaler.
+    This function loads a pre-trained `StandardScaler` from a file, applies target encoding
+    to specific categorical columns, and then scales the numeric columns in the DataFrame
+    using the loaded scaler.
+    Args:
+        df (pd.DataFrame): The input DataFrame to be scaled, containing both categorical and numeric features.
+        scaler_file (Path): The file path from which the pre-trained scaler will be loaded.
+    Returns:
+        pd.DataFrame: A DataFrame with the numeric columns scaled using the loaded scaler.
+    """
+    with open(scaler_file, 'rb') as f:
+        scaler = pickle.load(f)
+    df = target_encode_columns(df, [
+        'Product Type', 'Product Subtype', 'Product Gender', 'Main Material'
+    ], 'Colour Return Percentage')
+    if TARGET_COLUMN in df.columns:
+        df = scaler.transform(df.drop(columns=[TARGET_COLUMN]))
+    else:
+        df = scaler.transform(df)
+    logger.info(f"Data scaled using scaler from {scaler_file}")
+    return df
+# TODO The column names and the target must exist in the dataframe
+def target_encode_columns(df: pd.DataFrame, column_names: list, target: str, smoothing: float = 1.0) -> pd.DataFrame:
+    """
+    Applies target encoding to specified categorical columns in the DataFrame.
+    Target encoding involves replacing the categorical values with the mean of the target variable,
+    smoothed by a global mean. This helps to reduce overfitting, especially when dealing with sparse categories.
+    Args:
+        df (pd.DataFrame): The input DataFrame containing the columns to encode and the target variable.
+        column_names (list): A list of categorical column names in the DataFrame that need to be target-encoded.
+        target (str): The name of the target column to calculate the encoding based on.
+        smoothing (float, optional): The smoothing factor to control the weight between category mean and global mean. Default is 1.0.
+    Returns:
+        pd.DataFrame: The DataFrame with the target-encoded columns. The original columns are overwritten.
+    Example:
+        ```python
+        df = target_encode_columns(df, ['Product Type', 'Product Subtype'], 'Sales')
+        ```
+    """
+    for column_name in column_names:
+        if column_name in df.columns:
+            logger.info(f"Applying Target Encoding to '{column_name}'")
+            # Compute global mean of the target
+            global_mean = df[target].mean()
+            # Group by the categorical column and compute target mean and count
+            agg = df.groupby(column_name)[target].agg(['mean', 'count'])
+            agg.columns = ['mean', 'count']
+            # Apply smoothing: weighted average between category mean and global mean
+            agg['smooth_mean'] = (agg['mean'] * agg['count'] + global_mean * smoothing) / (agg['count'] + smoothing)
+            # Map the smoothed means back to the original column (overwrite)
+            df[column_name] = df[column_name].map(agg['smooth_mean'])
+            logger.success(f"Target Encoding applied to '{column_name}' and overwritten in place")
+        else:
+            logger.warning(f"Column '{column_name}' not found in the DataFrame.")
+    return df
+# TODO Sales must have the following columns:
+# Item Brand Model, Item Brand Fabric, Net Sales (FA), Product Type, Product Subtype, Product Top Category
+# TODO Inventory must have the following columns:
+# MODEL, FABRIC; COLOUR, MFC, BRAND, item_brand_modelname, item_age_range_category
+# product_brand, composition, product_gender_unified, product_top_category, product_type,
+# product_subtype, sales_season_unified, product_sale_line, image_url_new
+def prepare_inventory(inventory: pd.DataFrame) -> pd.DataFrame:
+    # inventory.rename(columns={
+    #     'MODEL': 'Item Brand Model',
+    #     'FABRIC': 'Item Brand Fabric',
+    #     'COLOUR': 'Item Brand Colour',
+    #     'MFC': 'Item Brand Model Fabric Colour',
+    #     'BRAND': 'Brand',
+    #     'item_brand_modelname': 'Item Brand Model Name',
+    #     'item_age_range_category': 'Age Range',
+    #     'product_brand': 'Product Brand',
+    #     'composition': 'Product Composition',
+    #     'product_gender_unified': 'Product Gender',
+    #     'product_top_category': 'Product Top Category',
+    #     'product_type': 'Product Type',
+    #     'product_subtype': 'Product Subtype',
+    #     'sales_season_unified': 'Sales Season',
+    #     'product_sale_line': 'Product Sale Line',
+    #     'image_url_new': 'Product Image Link'
+    # }, inplace=True)
+    # inventory = inventory[inventory['Product Brand'] == 'EMPORIO ARMANI']
+    # inventory = drop_columns(inventory, [
+    #     'Brand', 'Sales Season', 'Product Sale Line', 'Product Image Link', 'Product Brand'
+    # ])
+    # sales['MF'] = sales['Item Brand Model'] + '' + sales['Item Brand Fabric']
+    # inventory['MF'] = inventory['Item Brand Model'] + '' + inventory['Item Brand Fabric']
+    # sales['Net Sales (FA)'] = sales['Net Sales (FA)'].abs()
+    # median_prices = sales.groupby('MF')['Net Sales (FA)'].first()
+    # inventory['Net Sales (FA)'] = inventory['MF'].map(median_prices)
+    # category_medians = sales.groupby(['Product Type', 'Product Subtype'])['Net Sales (FA)'].median()
+    # top_category_medians = sales.groupby('Product Top Category')['Net Sales (FA)'].median()
+    # inventory['Net Sales (FA)'] = inventory.apply(fill_missing_price, axis=1, category_medians=category_medians)
+    # inventory['Net Sales (FA)'] = inventory.apply(fill_missing_price_with_top_category, axis=1, top_category_medians=top_category_medians)
+    # inventory['Product Composition'] = inventory['Product Composition'].str.upper()
+    # inventory = add_main_material(inventory)
+    # inventory['Colour Return Percentage'] = 15.0
+    # inventory['Net Sales Units (FA)'] = 1
+    # inventory['Product Order Count'] = 1
+    # inventory['Total Order Value'] = 1
+    # inventory['Total Customer Returns'] = 1
+    # inventory['Total Customer Purchases'] = 1
+    # inventory['Customer Return Percentage'] = 15.0
+    # inventory['Colour Return Percentage'] = inventory.apply(
+    #     lambda row: json_percentages.get(f"{row['Item Brand Model']} - {row['Item Brand Colour']}", 15.0),
+    #     axis=1
+    # )
+    # inventory = extract_ready_to_wear(inventory)
+    # inventory = drop_columns(inventory, [
+    #     'Item Brand Model', 'Item Brand Fabric', 'Item Brand Colour',
+    #     'Item Brand Model Fabric Colour', 'Item Brand Model Name', 'Age Range',
+    #     'MF', 'Product Composition', 'Product Top Category'
+    # ])
+    inventory = drop_columns(inventory, [
+        'Item Brand Model', 'Item Brand Fabric', 'Item Brand Colour'
+    ])
+    inventory = inventory.reindex(columns=[
+        'Product Type', 'Product Subtype', 'Product Gender', 'Net Sales (FA)',
+        'Net Sales Units (FA)', 'Product Order Count', 'Total Order Value',
+        'Main Material', 'Colour Return Percentage', 'Total Customer Purchases',
+        'Total Customer Returns', 'Customer Return Percentage'
+    ])
+    logger.info(f"Dataset columns: {inventory.columns}")
+    return inventory
+def map_inventory(sales: pd.DataFrame, inventory: pd.DataFrame, json_percentages: dict, mapped_inventory_path: Path):
+    """
+    Prepares the inventory dataset by processing and enriching it with sales data and additional columns.
+    > This operation works only on a particular formatted file (see dataset documentation)
+    This function performs several transformations to clean and enrich the inventory data:
+    - Renames columns for consistency.
+    - Filters the inventory to include only the 'EMPORIO ARMANI' brand.
+    - Merges sales data with inventory to fill in missing price information based on model and fabric.
+    - Fills missing price values based on median prices from the sales data, grouped by product categories.
+    - Adds the 'Main Material' column based on the product composition.
+    - Assigns default values to certain columns.
+    - Calculates the 'Colour Return Percentage' for each inventory item using a provided dictionary.
+    - Filters inventory for 'READY TO WEAR' products.
+    - Drops unnecessary columns and reorders the remaining columns.
+    Args:
+        sales (pd.DataFrame): A DataFrame containing sales data with product details and sales information.
+        inventory (pd.DataFrame): A DataFrame containing inventory data to be enriched and transformed.
+        json_percentages (dict): A dictionary containing the colour return percentages, with the model and colour as keys.
+    Returns:
+        pd.DataFrame: The prepared and enriched inventory DataFrame.
+    Example:
+        ```python
+        mapped_inventory_path = "data/mapped_inventory.tsv"
+        sales_df = pd.read_csv('sales.tsv', sep='\\t')
+        inventory_df = pd.read_csv('inventory.tsv', sep='\\t')
+        with open(json_percentage, 'r') as f:
+            percentages = json.load(f)
+        inventory = prepare_inventory(sales_df, inventory_df, colour_return_percentages_dict, mapped_inventory_path)
+        ```
+    """
+    inventory.rename(columns={
+        'MODEL': 'Item Brand Model',
+        'FABRIC': 'Item Brand Fabric',
+        'COLOUR': 'Item Brand Colour',
+        'MFC': 'Item Brand Model Fabric Colour',
+        'BRAND': 'Brand',
+        'item_brand_modelname': 'Item Brand Model Name',
+        'item_age_range_category': 'Age Range',
+        'product_brand': 'Product Brand',
+        'composition': 'Product Composition',
+        'product_gender_unified': 'Product Gender',
+        'product_top_category': 'Product Top Category',
+        'product_type': 'Product Type',
+        'product_subtype': 'Product Subtype',
+        'sales_season_unified': 'Sales Season',
+        'product_sale_line': 'Product Sale Line',
+        'image_url_new': 'Product Image Link'
+    }, inplace=True)
+    inventory = inventory[inventory['Product Brand'] == 'EMPORIO ARMANI']
+    inventory = drop_columns(inventory, [
+        'Brand', 'Sales Season', 'Product Sale Line', 'Product Image Link', 'Product Brand'
+    ])
+    sales['MF'] = sales['Item Brand Model'] + '' + sales['Item Brand Fabric']
+    inventory['MF'] = inventory['Item Brand Model'] + '' + inventory['Item Brand Fabric']
+    sales['Net Sales (FA)'] = sales['Net Sales (FA)'].abs()
+    median_prices = sales.groupby('MF')['Net Sales (FA)'].first()
+    inventory['Net Sales (FA)'] = inventory['MF'].map(median_prices)
+    category_medians = sales.groupby(['Product Type', 'Product Subtype'])['Net Sales (FA)'].median()
+    top_category_medians = sales.groupby('Product Top Category')['Net Sales (FA)'].median()
+    inventory['Net Sales (FA)'] = inventory.apply(fill_missing_price, axis=1, category_medians=category_medians)
+    inventory['Net Sales (FA)'] = inventory.apply(fill_missing_price_with_top_category, axis=1, top_category_medians=top_category_medians)
+    inventory['Product Composition'] = inventory['Product Composition'].str.upper()
+    inventory = add_main_material(inventory)
+    inventory['Colour Return Percentage'] = 15.0
+    inventory['Net Sales Units (FA)'] = 1
+    inventory['Product Order Count'] = 1
+    inventory['Total Order Value'] = 1
+    inventory['Total Customer Returns'] = 1
+    inventory['Total Customer Purchases'] = 1
+    inventory['Customer Return Percentage'] = 15.0
+    inventory['Colour Return Percentage'] = inventory.apply(
+        lambda row: json_percentages.get(f"{row['Item Brand Model']} - {row['Item Brand Colour']}", 15.0),
+        axis=1
+    )
+    inventory = extract_ready_to_wear(inventory)
+    inventory = drop_columns(inventory, [
+        'Item Brand Model Fabric Colour', 'Item Brand Model Name', 'Age Range',
+        'MF', 'Product Composition', 'Product Top Category'
+    ])
+    inventory = inventory.reindex(columns=[
+        'Item Brand Model', 'Item Brand Fabric', 'Item Brand Colour',
+        'Product Type', 'Product Subtype', 'Product Gender', 'Net Sales (FA)',
+        'Net Sales Units (FA)', 'Product Order Count', 'Total Order Value',
+        'Main Material', 'Colour Return Percentage', 'Total Customer Purchases',
+        'Total Customer Returns', 'Customer Return Percentage'
+    ])
+    inventory.to_csv(mapped_inventory_path, sep='\t', index=False)
+# TODO The input row must have Net Sales (FA), Product Type and Product Subtype columns
+def fill_missing_price(row: pd.Series, category_medians: dict):
+    """
+    Fills missing 'Net Sales (FA)' values based on the median price of the product category.
+    This function checks if the 'Net Sales (FA)' value is missing (NaN) in the row. If it is,
+    it attempts to fill the missing value using the median price for the product category,
+    which is determined by the combination of 'Product Type' and 'Product Subtype'.
+    The median values are provided in the `category_medians` dictionary, where the key is a tuple
+    of ('Product Type', 'Product Subtype') and the value is the corresponding median price.
+    Args:
+        row (pd.Series): A row of the DataFrame containing product data, including 'Net Sales (FA)',
+                         'Product Type', and 'Product Subtype'.
+        category_medians (dict): A dictionary with keys as tuples of ('Product Type', 'Product Subtype')
+                                 and values as the median price for that category.
+    Returns:
+        float: The 'Net Sales (FA)' value if it is not missing, or the median price for the product category
+               if it is missing, or None if no median is found for the category.
+    """
+    if pd.isna(row['Net Sales (FA)']):
+        product_type = row['Product Type']
+        product_subtype = row['Product Subtype']
+        return category_medians.get((product_type, product_subtype), None)
+    return row['Net Sales (FA)']
+# TODO The input row must have Net Sales (FA) and Product Top Category columns
+def fill_missing_price_with_top_category(row: pd.Series, top_category_medians: dict):
+    """
+    Fills missing 'Net Sales (FA)' values based on the median price of the product's top category.
+    This function checks if the 'Net Sales (FA)' value is missing (NaN) in the row. If it is,
+    it attempts to fill the missing value using the median price for the product's 'Product Top Category'.
+    The median prices for each 'Product Top Category' are provided in the `top_category_medians` dictionary,
+    where the key is the 'Product Top Category' and the value is the corresponding median price.
+    Args:
+        row (pd.Series): A row of the DataFrame containing product data, including 'Net Sales (FA)' and
+                         'Product Top Category'.
+        top_category_medians (dict): A dictionary with keys as 'Product Top Category' and values as thes
+                                     median price for that category.
+    Returns:
+        float: The 'Net Sales (FA)' value if it is not missing, or the median price for the 'Product Top Category'
+               if it is missing, or None if no median is found for the top category.
+    """
+    if pd.isna(row['Net Sales (FA)']):
+        product_top_category = row['Product Top Category']
+        return top_category_medians.get(product_top_category, None)
+    return row['Net Sales (FA)']
+#####################################################################################
+@app.command()
+def main(
+    # input_path: Path = PROCESSED_DATA_DIR / "features.tsv",
+    scaler_path: Path = MODELS_DIR / "scaler.pkl",
+    train_path: Path = PROCESSED_DATA_DIR / "train.tsv",
+    sales_path: Path = RAW_DATA_DIR / "sales.xlsx",
+    inventory_path: Path = EXTERNAL_DATA_DIR / "inventory.csv",
+    json_percentages_file: Path = INTERIM_DATA_DIR / "colour_return_percentage.json"
+    # test_path: Path = PROCESSED_DATA_DIR / "test.tsv"
+):
+    # ---- Split dataset into train and test ----
+    # try:
+    #     data = pd.read_csv(input_path, sep='\t')
+    #     split_data(data, train_path, test_path)
+    # except Exception as e:
+    #     logger.error(f"Error during dataset split: {e}")
+    #     return
+    # ---- Train and save the scaler ----
+    try:
+        train_data = pd.read_csv(train_path, sep='\t')
+        train_and_save_scaler(train_data, scaler_path)
+    except Exception as e:
+        logger.error(f"Error during scaler training: {e}")
+        return
+    # ---- Prepare inference file ----
+    try:
+        sales_data = pd.read_excel(sales_path)
+        inventory_data = pd.read_csv(inventory_path)
+        with open(json_percentages_file, 'r') as f:
+            percentages = json.load(f)
+        mapped_inventory_path = EXTERNAL_DATA_DIR / "inventory.tsv"
+        map_inventory(sales_data, inventory_data, percentages, mapped_inventory_path)
+    except Exception as e:
+        logger.error(f"Error during inventory preparation: {e}")
+        return
+if __name__ == "__main__":
+    app()

product_return_prediction/features.py ADDED Viewed

	@@ -0,0 +1,401 @@

+from pathlib import Path
+import typer
+import json
+import pandas as pd
+from loguru import logger
+from product_return_prediction.config import (
+    RAW_DATA_DIR, PROCESSED_DATA_DIR, INTERIM_DATA_DIR, TARGET_COLUMN
+)
+app = typer.Typer()
+# TODO The input dataframe must contain Customer Store ID, Item Brand Model Fabric Colour and Return Units (FA) columns
+# TODO The returned dataframe must have an additional column named as the TARGET_COLUMN
+def add_returned(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Adds a column with the name of the TARGET_COLUMN to the DataFrame based on the 'Returns Units (FA)' column
+    and performs data cleaning operations.
+    The function calculates whether an item has been returned (indicated by a value > 0
+    in the 'Returns Units (FA)' column) and groups by 'Customer Store ID' and
+    'Item Brand Model Fabric Colour' to ensure consistency. It then removes rows where
+    'Returns Units (FA)' equals 1.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing at least the columns:
+            - 'Customer Store ID'
+            - 'Item Brand Model Fabric Colour'
+            - 'Returns Units (FA)'
+    Returns:
+        pd.DataFrame: Modified DataFrame with the following updates:
+            - A new column 'Returned' indicating if the item was returned (1) or not (0).
+            - Rows where 'Returns Units (FA)' equals 1 are removed.
+    """
+    df[TARGET_COLUMN] = df.groupby(
+        ['Customer Store ID', 'Item Brand Model Fabric Colour']
+    )['Returns Units (FA)'].transform(lambda x: (x.fillna(0) > 0).astype(int))
+    df[TARGET_COLUMN] = df.groupby(
+        ['Customer Store ID', 'Item Brand Model Fabric Colour']
+    )['Returns Units (FA)'].transform('max')
+    df[TARGET_COLUMN] = df[TARGET_COLUMN].fillna(0).astype(int)
+    df = df[df['Returns Units (FA)'] != 1]
+    return df
+# TODO The input dataframe must contain the Order Number column
+# TODO The returned dataframe must have an additional column named Product Order Count
+def add_product_order_count(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Adds a 'Product Order Count' column to the DataFrame, indicating the number
+    of products associated with each order.
+    The function groups the data by 'Order Number', calculates the count of
+    products for each order, and merges this information back into the original
+    DataFrame as a new column.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing the column:
+            - 'Order Number': Identifier for each order.
+    Returns:
+        pd.DataFrame: Modified DataFrame with an additional column:
+            - 'Product Order Count': The count of products per order.
+    """
+    order_product_count = df.groupby('Order Number').size().reset_index(name='Product Order Count')
+    df = df.merge(order_product_count, on='Order Number', how='left')
+    return df
+# TODO The input dataframe must contain the Order Number and Net Sales (FA) columns
+# TODO The returned dataframe must have an additional column named Total Order Value
+def add_total_order_value(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Adds a 'Total Order Value' column to the DataFrame, representing the total
+    net sales value for each order.
+    The function groups the data by 'Order Number', calculates the sum of
+    'Net Sales (FA)' for each order, and merges this information back into the
+    original DataFrame as a new column. The total order values are rounded to
+    4 decimal places.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing the columns:
+            - 'Order Number': Identifier for each order.
+            - 'Net Sales (FA)': Net sales value for each item.
+    Returns:
+        pd.DataFrame: Modified DataFrame with an additional column:
+            - 'Total Order Value': The total net sales value for each order,
+              rounded to 4 decimal places.
+    """
+    order_total = df.groupby('Order Number')['Net Sales (FA)'].sum().reset_index().round(4)
+    order_total = order_total.rename(columns={'Net Sales (FA)': 'Total Order Value'})
+    df = df.merge(order_total, on='Order Number', how='left')
+    return df
+# TODO The input must be a string formatted with percentages followed by material names (e.g. 70% Wool, 30% Nylon)
+# TODO The returned dataframe must be a string, if successfully extracted, otherwise Unknown
+def extract_main_material(composition: str) -> str:
+    """
+    Extracts the main material from a product composition string.
+    The function identifies the material associated with the first percentage value
+    in the input string and returns it. If no percentage or material is found,
+    'Unknown' is returned. Handles cases where the material name spans multiple words.
+    Args:
+        composition (str): A string describing the product composition, typically
+            containing percentages followed by material names (e.g., "50% Cotton, 50% Polyester").
+    Returns:
+        str: The name of the main material (e.g., "Cotton"), or 'Unknown' if the input
+        is not a string or if no material can be extracted.
+    """
+    if isinstance(composition, str):
+        parts = composition.split()
+        material = []
+        for i, part in enumerate(parts):
+            if "%" in part:
+                if i + 1 < len(parts):
+                    material.append(parts[i + 1])
+                    j = i + 2
+                    while j < len(parts) and not any(c.isdigit() for c in parts[j]):
+                        material.append(parts[j])
+                        j += 1
+                    break
+        return " ".join(material) if material else 'Unknown'
+    return 'Unknown'
+# TODO The input dataframe must contain the Product Top Category column
+# TODO The returned dataframe must have only records with Product Top Category equal to READY TO WEAR
+def extract_ready_to_wear(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Filters the DataFrame to include only rows where the product belongs to
+    the 'READY TO WEAR' category.
+    The function selects rows where the 'Product Top Category' column is equal to
+    'READY TO WEAR' and returns the filtered DataFrame.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing the column:
+            - 'Product Top Category': The top-level category of the product.
+    Returns:
+        pd.DataFrame: A filtered DataFrame containing only rows where
+        'Product Top Category' equals 'READY TO WEAR'.
+    """
+    df = df[df['Product Top Category'] == 'READY TO WEAR']
+    return df
+# TODO The input dataframe must contain the Product Composition column
+# TODO The returned dataframe must have an additional column named Main Material
+def add_main_material(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Adds a 'Main Material' column to the DataFrame by extracting the primary material
+    from the 'Product Composition' column.
+    The function applies the `extract_main_material` function to each value in the
+    'Product Composition' column to determine the main material and stores the result
+    in a new column, 'Main Material'.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing the column:
+            - 'Product Composition': A string describing the composition of the product,
+              typically with percentages followed by material names.
+    Returns:
+        pd.DataFrame: Modified DataFrame with an additional column:
+            - 'Main Material': The extracted primary material from 'Product Composition'.
+              If the composition is invalid or a material cannot be determined, the
+              value will be 'Unknown'.
+    """
+    df['Main Material'] = df['Product Composition'].apply(extract_main_material)
+    return df
+# TODO The input dataframe must contain Item Brand Model, Item Brand Colour and TARGET_COLUMN columns
+# TODO The returned dataframe must have an additional column named Colour Return Percentage
+# TODO The json_file must be a valid Path object specifying the location to save the JSON file
+# TODO The JSON file must have the number of rows equal to the number of distinct Model-Colour items in the dataframe
+def add_colour_return_percentage(df: pd.DataFrame, json_file: Path) -> pd.DataFrame:
+    """
+    Adds a 'Colour Return Percentage' column to the DataFrame and saves the results
+    as a JSON file. The percentage is calculated for each combination of
+    'Item Brand Model' and 'Item Brand Colour' based on the returned and total counts.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing the columns:
+            - 'Item Brand Model': The model identifier for the item.
+            - 'Item Brand Colour': The color associated with the item.
+            - TARGET_COLUMN (e.g., 'Returned'): Indicates whether an item was returned (1) or not (0).
+        json_file (Path): Path to save the JSON file containing the 'Colour Return Percentage'
+            for each 'Model - Colour' combination.
+    Returns:
+        pd.DataFrame: Modified DataFrame with an additional column:
+            - 'Colour Return Percentage': The percentage of returns for each
+              'Item Brand Model' and 'Item Brand Colour' combination, rounded to 1 decimal place.
+    """
+    grouped = df.groupby(['Item Brand Model', 'Item Brand Colour'])
+    returned_count = grouped[TARGET_COLUMN].sum().reset_index(name='Returned Count')
+    total_count = grouped.size().reset_index(name='Total Count')
+    result = pd.merge(returned_count, total_count, on=['Item Brand Model', 'Item Brand Colour'])
+    result['Colour Return Percentage'] = (
+        (result['Returned Count'] / result['Total Count']) * 100
+    ).round(1)
+    df = pd.merge(df, result[
+        ['Item Brand Model', 'Item Brand Colour', 'Colour Return Percentage']
+    ], on=['Item Brand Model', 'Item Brand Colour'], how='left')
+    # Create a dictionary for JSON output
+    colour_return_percentage_json = df.assign(
+        Model_Colour=df['Item Brand Model'] + ' - ' + df['Item Brand Colour']
+    ).set_index('Model_Colour')['Colour Return Percentage'].to_dict()
+    with open(json_file, 'w') as f:
+        json.dump(colour_return_percentage_json, f, indent=4)
+    return df
+# TODO The input dataframe must contain Customer Store ID column
+# TODO The returned dataframe must have an additional column named Total Customer Purchases
+def add_total_customer_purchases(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Adds a 'Total Customer Purchases' column to the DataFrame, representing the total
+    number of purchases made by each customer.
+    The function groups the data by 'Customer Store ID', calculates the total count
+    of purchases for each customer, and merges this information back into the original
+    DataFrame as a new column.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing the column:
+            - 'Customer Store ID': The unique identifier for each customer.
+    Returns:
+        pd.DataFrame: Modified DataFrame with an additional column:
+            - 'Total Customer Purchases': The total number of purchases for each
+              customer.
+    """
+    total_purchases = df.groupby('Customer Store ID').size().reset_index(name='Total Customer Purchases')
+    df = pd.merge(df, total_purchases, on='Customer Store ID', how='left')
+    return df
+# TODO The input dataframe must contain Customer Store ID and TARGET_COLUMN columns
+# TODO The returned dataframe must have an additional column named Total Customer Returns
+def add_total_customer_returns(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Adds a 'Total Customer Returns' column to the DataFrame, representing the total
+    number of returns made by each customer.
+    The function groups the data by 'Customer Store ID', calculates the sum of
+    the returns (as indicated by the `TARGET_COLUMN`), and merges this information
+    back into the original DataFrame as a new column.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing the columns:
+            - 'Customer Store ID': The unique identifier for each customer.
+            - TARGET_COLUMN (e.g., 'Returned'): A column indicating the return status
+              of a purchase (typically 1 for returned, 0 for not returned).
+    Returns:
+        pd.DataFrame: Modified DataFrame with an additional column:
+            - 'Total Customer Returns': The total number of returns made by each
+              customer based on the `TARGET_COLUMN`.
+    """
+    total_returns = df.groupby('Customer Store ID')[TARGET_COLUMN].sum().reset_index(name='Total Customer Returns')
+    df = df.merge(total_returns, on='Customer Store ID', how='left')
+    return df
+# TODO The input dataframe must contain Customer Store ID, Total Customer Returns and Total Customer Purchases columns
+# TODO The returned dataframe must have an additional column named Customer Return Percentage
+def add_customer_return_percentage(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Adds a 'Customer Return Percentage' column to the DataFrame, representing the
+    percentage of returns made by each customer relative to their total purchases.
+    The function groups the data by 'Customer Store ID', calculates the total number
+    of returns and total purchases for each customer, then computes the return percentage
+    and merges this information back into the original DataFrame as a new column.
+    Args:
+        df (pd.DataFrame): Input DataFrame containing the columns:
+            - 'Customer Store ID': The unique identifier for each customer.
+            - 'Total Customer Returns': The total number of returns made by each customer.
+            - 'Total Customer Purchases': The total number of purchases made by each customer.
+    Returns:
+        pd.DataFrame: Modified DataFrame with an additional column:
+            - 'Customer Return Percentage': The percentage of returns made by each customer,
+              rounded to 1 decimal place.
+    """
+    user_grouped = df.groupby('Customer Store ID')[
+        ['Total Customer Returns', 'Total Customer Purchases']
+    ].sum().reset_index()
+    user_grouped['Customer Return Percentage'] = (
+        (user_grouped['Total Customer Returns'] / user_grouped['Total Customer Purchases']) * 100
+    ).round(1)
+    df = pd.merge(df, user_grouped[['Customer Store ID', 'Customer Return Percentage']],
+                  on='Customer Store ID', how='left')
+    return df
+@app.command()
+def main(
+    input_path: Path = RAW_DATA_DIR / "sales.xlsx",
+    train_path: Path = PROCESSED_DATA_DIR / "train.tsv",
+    test_path: Path = PROCESSED_DATA_DIR / "test.tsv",
+    json_file: Path = INTERIM_DATA_DIR / "colour_return_percentage.json",
+    # output_path: Path = PROCESSED_DATA_DIR / "features.tsv",
+):
+    from product_return_prediction.dataset import xlsx_to_tsv, drop_columns, split_data
+    logger.info("Generating features from dataset...")
+    # ---- Prepare data for feature engineering ----
+    tsv_path: Path = INTERIM_DATA_DIR / "dataset.tsv"
+    xlsx_to_tsv(input_path, tsv_path)
+    df = pd.read_csv(tsv_path, sep='\t')
+    cols_to_drop_1 = [
+        'Year Gregorian',
+        'Month Gregorian',
+        'Month Gregorian Name',
+        'Country',
+        'Variant WCS',
+        'Age Range',
+        'Product Image Link',
+        'Returns Value (FA)',
+        'Returns Units (FA)',
+        'Return Reason Group',
+        'Return Reason'
+    ]
+    cols_to_drop_2 = [
+        'Date (Date format)',
+        'Customer Store ID',
+        'Order Number',
+        'Order Line Number',
+        'Item Brand Model',
+        'Item Brand Fabric',
+        'Item Brand Colour',
+        'Item Brand Model Fabric Colour',
+        'Product Composition',
+        'Product Top Category'
+    ]
+    # ---- Perform feature engineering pipeline ----
+    df = add_returned(df)
+    df = drop_columns(df, cols_to_drop_1)
+    df = add_product_order_count(df)
+    df = add_total_order_value(df)
+    df = add_main_material(df)
+    df = add_colour_return_percentage(df, json_file)
+    df = add_total_customer_purchases(df)
+    df = add_total_customer_returns(df)
+    df = add_customer_return_percentage(df)
+    df = extract_ready_to_wear(df)
+    split_data(df, train_path, test_path, id_column="Customer Store ID")
+    train = pd.read_csv(train_path, sep='\t')
+    test = pd.read_csv(test_path, sep='\t')
+    train = drop_columns(train, cols_to_drop_2)
+    test = drop_columns(test, cols_to_drop_2)
+    train.to_csv(train_path, sep='\t', index=False)
+    test.to_csv(test_path, sep='\t', index=False)
+    logger.success("Features generation complete.")
+if __name__ == "__main__":
+    app()

product_return_prediction/modeling/__init__.py ADDED Viewed

File without changes

product_return_prediction/modeling/eval.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import pickle
+import typer
+import json
+import seaborn as sns
+import pandas as pd
+import matplotlib.pyplot as plt
+from loguru import logger
+from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
+from pathlib import Path
+from codecarbon import EmissionsTracker
+from product_return_prediction.dataset import scale_data_with_trained_scaler
+from product_return_prediction.config import (
+    MODELS_DIR,
+    PROCESSED_DATA_DIR,
+    TARGET_COLUMN,
+    REPORTS_DIR
+)
+app = typer.Typer()
+def evaluate_model(test_data: pd.DataFrame, scaler_file: Path, model: any, model_name: str):
+    """
+    Evaluates the performance of a trained model on the provided test data. It includes scaling the features
+    using a pre-trained scaler, making predictions, computing accuracy, generating a classification report,
+    and visualizing the confusion matrix.
+    This function scales the test data using a pre-trained scaler, applies the trained model to make predictions,
+    and calculates key performance metrics, including accuracy. It then generates a detailed classification report,
+    saves the report to a JSON file, and plots the confusion matrix to visually assess model performance.
+    Args:
+        test_data (pd.DataFrame): The test dataset, which includes both features and the target column.
+        scaler_file (Path): Path to the pre-trained scaler file, used to scale the feature columns.
+        model (any): The trained model object, used to make predictions on the test data.
+        model_name (str): The name of the model, used for saving the evaluation report.
+    Example:
+        ```python
+        evaluate_model(test_data, scaler_file='scaler.pkl', model=model, model_name='log_reg')
+        ```
+    """
+    X_test = test_data.drop(columns=[TARGET_COLUMN]).copy()
+    y_test = test_data[TARGET_COLUMN].copy()
+    X_test = scale_data_with_trained_scaler(X_test, scaler_file)
+    cc_file = f"{model_name}_emissions.csv"
+    tracker = EmissionsTracker(project_name="eval", output_dir=REPORTS_DIR, output_file=cc_file)
+    tracker.start()
+    y_pred = model.predict(X_test)
+    tracker.stop()
+    accuracy = accuracy_score(y_test, y_pred)
+    logger.info(f"Accuracy: {accuracy * 100:.2f}%")
+    report = classification_report(y_test, y_pred)
+    logger.info(f"Classification Report:\n{report}")
+    report = classification_report(y_test, y_pred, output_dict=True)
+    with open(REPORTS_DIR / f"{model_name}.json", "w") as json_file:
+        json.dump(report, json_file, indent=4)
+    cm = confusion_matrix(y_test, y_pred)
+    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=model.classes_, yticklabels=model.classes_)
+    plt.title("Confusion Matrix")
+    plt.xlabel("Predicted Labels")
+    plt.ylabel("True Labels")
+    # Saving the confusion matrix in the reports/figures directory
+    plt.savefig(REPORTS_DIR / f"figures/cm_{model_name}.png", dpi=300, bbox_inches='tight')
+    plt.close()
+@app.command()
+def main(
+    test_file: Path = PROCESSED_DATA_DIR / "test.tsv",
+    scaler_file: Path = MODELS_DIR / "scaler.pkl",
+    log_reg_model_path: Path = MODELS_DIR / "log_reg.pkl",
+    svm_model_path: Path = MODELS_DIR / "svm.pkl",
+):
+    test_data = pd.read_csv(test_file, sep='\t')
+    with open(log_reg_model_path, "rb") as f:
+        log_reg = pickle.load(f)
+    with open(svm_model_path, "rb") as f:
+        svm = pickle.load(f)
+    evaluate_model(test_data, scaler_file, log_reg, "log_reg_eval")
+    evaluate_model(test_data, scaler_file, svm, "svm_eval")
+if __name__ == "__main__":
+    app()

product_return_prediction/modeling/predict.py ADDED Viewed

	@@ -0,0 +1,60 @@

+from pathlib import Path
+import typer
+import pickle
+import json
+import pandas as pd
+from loguru import logger
+from codecarbon import EmissionsTracker
+from product_return_prediction.config import MODELS_DIR, INTERIM_DATA_DIR, EXTERNAL_DATA_DIR, REPORTS_DIR, RAW_DATA_DIR
+from product_return_prediction.dataset import prepare_inventory, scale_data_with_trained_scaler
+app = typer.Typer()
+@app.command()
+def main(
+    sales_path: Path = RAW_DATA_DIR / "sales.xlsx",
+    inventory_path: Path = EXTERNAL_DATA_DIR / "inventory.csv",
+    json_percentage: Path = INTERIM_DATA_DIR / "colour_return_percentage.json",
+    scaler_file: Path = MODELS_DIR / "scaler.pkl",
+    model_path: Path = MODELS_DIR / "svm.pkl",
+):
+    sales = pd.read_excel(sales_path)
+    inventory = pd.read_csv(inventory_path)
+    with open(json_percentage, 'r') as f:
+        percentages = json.load(f)
+    # ---- Prepare inventory data for inference ----
+    inventory = prepare_inventory(sales, inventory, percentages)
+    with open(model_path, "rb") as f:
+        model = pickle.load(f)
+    # ---- Scale 5 random rows from the inventory ----
+    random_row = inventory.sample(n=5)
+    logger.info(f"Your product:\n {random_row}")
+    random_row = scale_data_with_trained_scaler(random_row, scaler_file)
+    # ---- Compute predictions and probabilities ----
+    cc_file = "svm_predict_emissions.csv"
+    tracker = EmissionsTracker(project_name="eval", output_dir=REPORTS_DIR, output_file=cc_file)
+    tracker.start()
+    predictions = model.predict(random_row)
+    probabilities = model.predict_proba(random_row)
+    tracker.stop()
+    for pred, prob in zip(predictions, probabilities):
+        prob_confidence = prob.max()
+        if pred == 1:
+            logger.info(f"The product will be returned with {prob_confidence:.2f} confidence")
+        else:
+            logger.info(f"The product will NOT be returned with {prob_confidence:.2f} confidence")
+if __name__ == "__main__":
+    app()

product_return_prediction/modeling/train.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import pickle
+from pathlib import Path
+import dagshub
+import mlflow
+import pandas as pd
+import typer
+from loguru import logger
+from sklearn.linear_model import LogisticRegression
+from sklearn.model_selection import GridSearchCV
+from sklearn.svm import SVC
+from codecarbon import EmissionsTracker
+from product_return_prediction.dataset import scale_data_with_trained_scaler
+from product_return_prediction.config import (
+    MODELS_DIR,
+    PROCESSED_DATA_DIR,
+    TARGET_COLUMN,
+    REPORTS_DIR
+)
+dagshub.init(repo_owner='se4ai2425-uniba', repo_name='product-return-prediction', mlflow=True)
+app = typer.Typer()
+# TODO The training dataset must have the following columns:
+# Product Type, Product Subtype, Product Gender, Net Sales (FA), Net Sales Units (FA)
+# TARGET_COLUMN, Product Order Count, Total Order Value, Main Material, Colour Return Percentage
+# Total Customer Purchases, Total Customer Returns, Customer Return Percentage
+# TODO The scaler and model paths must be Pickle (.pkl) files
+def train_log_reg(train_data: pd.DataFrame, scaler_file: Path, model_path: Path):
+    """
+    Trains a Logistic Regression model using the provided training data, applies feature scaling,
+    and saves the trained model to a specified file.
+    This function trains a Logistic Regression model using the training data. The feature columns are
+    scaled using a pre-trained scaler before fitting the model. The model is then saved to the specified
+    file path, and the training process is tracked using MLflow.
+    Args:
+        train_data (pd.DataFrame): The training data, including features and target column.
+        scaler_file (Path): Path to the pre-trained scaler file, used to scale the feature columns.
+        model_path (Path): Path where the trained Logistic Regression model will be saved.
+    """
+    run_name = model_path.stem
+    mlflow.start_run(run_name=run_name)
+    mlflow.sklearn.autolog()
+    # Apply scaling to the feature columns (excluding the target column)
+    X_train = train_data.drop(columns=[TARGET_COLUMN]).copy()
+    y_train = train_data[TARGET_COLUMN].copy()
+    # Scale X_train using the pre-trained scaler
+    X_train = scale_data_with_trained_scaler(X_train, scaler_file)
+    # Initialize the Logistic Regression model
+    model = LogisticRegression(max_iter=1000, class_weight="balanced")
+    logger.info(f"Model: {model}")
+    cc_file = "log_reg_train_emissions.csv"
+    tracker = EmissionsTracker(project_name="train", output_dir=REPORTS_DIR, output_file=cc_file)
+    tracker.start()
+    # Fit the model to the training data
+    model.fit(X_train, y_train)
+    tracker.stop()
+    mlflow.end_run()
+    # Save the trained model to disk
+    with open(model_path, "wb") as f:
+        pickle.dump(model, f)
+    logger.success(f"Model saved to {model_path}")
+# TODO The training dataset must have the following columns:
+# Product Type, Product Subtype, Product Gender, Net Sales (FA), Net Sales Units (FA)
+# TARGET_COLUMN, Product Order Count, Total Order Value, Main Material, Colour Return Percentage
+# Total Customer Purchases, Total Customer Returns, Customer Return Percentage
+# TODO The scaler and model paths must be Pickle (.pkl) files
+def train_svm(train_data: pd.DataFrame, scaler_file: Path, model_path: Path):
+    """
+    Trains a Support Vector Machine (SVM) classifier using the provided training data, applies feature scaling,
+    performs hyperparameter tuning via grid search, and saves the trained model to a specified file.
+    This function trains an SVM model with hyperparameter optimization using grid search. The feature columns
+    are scaled using a pre-trained scaler before fitting the model. The trained model is saved to the specified
+    file path, and the training process is tracked using MLflow.
+    Args:
+        train_data (pd.DataFrame): The training data, including features and target column.
+        scaler_file (Path): Path to the pre-trained scaler file, used to scale the feature columns.
+        model_path (Path): Path where the trained SVM model will be saved.
+    """
+    run_name = model_path.stem
+    mlflow.start_run(run_name=run_name)
+    mlflow.sklearn.autolog()
+    X_train = train_data.drop(columns=[TARGET_COLUMN]).copy()
+    y_train = train_data[TARGET_COLUMN].copy()
+    X_train = scale_data_with_trained_scaler(X_train, scaler_file)
+    param_grid = {"C": [0.1, 1, 10], "kernel": ["rbf"], "gamma": ["scale", "auto"]}
+    logger.info("Starting Grid Search for best hyperparameters")
+    grid_search = GridSearchCV(SVC(probability=True), param_grid, scoring="balanced_accuracy", cv=10)
+    grid_search.fit(X_train, y_train)
+    model = grid_search.best_estimator_
+    cc_file = "svm_train_emissions.csv"
+    tracker = EmissionsTracker(project_name="train", output_dir=REPORTS_DIR, output_file=cc_file)
+    tracker.start()
+    model.fit(X_train, y_train)
+    tracker.stop()
+    mlflow.end_run()
+    with open(model_path, "wb") as f:
+        pickle.dump(model, f)
+    logger.success(f"Model saved to {model_path}")
+@app.command()
+def main(
+    train_file: Path = PROCESSED_DATA_DIR / "train.tsv",
+    scaler_file: Path = MODELS_DIR / "scaler.pkl",
+    log_reg_model_path: Path = MODELS_DIR / "log_reg.pkl",
+    svm_model_path: Path = MODELS_DIR / "svm.pkl",
+):
+    train_data = pd.read_csv(train_file, sep='\t')
+    # ---- Train models ----
+    train_log_reg(train_data, scaler_file, log_reg_model_path)
+    train_svm(train_data, scaler_file, svm_model_path)
+if __name__ == "__main__":
+    app()

product_return_prediction/plots.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from pathlib import Path
+import typer
+from loguru import logger
+from tqdm import tqdm
+from product_return_prediction.config import FIGURES_DIR, PROCESSED_DATA_DIR
+app = typer.Typer()
+@app.command()
+def main(
+    # ---- REPLACE DEFAULT PATHS AS APPROPRIATE ----
+    input_path: Path = PROCESSED_DATA_DIR / "dataset.csv",
+    output_path: Path = FIGURES_DIR / "plot.png",
+    # -----------------------------------------
+):
+    # ---- REPLACE THIS WITH YOUR OWN CODE ----
+    logger.info("Generating plot from data...")
+    for i in tqdm(range(10), total=10):
+        if i == 5:
+            logger.info("Something happened for iteration 5.")
+    logger.success("Plot generation complete.")
+    # -----------------------------------------
+if __name__ == "__main__":
+    app()

pyproject.toml ADDED Viewed

	@@ -0,0 +1,36 @@

+[build-system]
+requires = ["flit_core >=3.2,<4"]
+build-backend = "flit_core.buildapi"
+[project]
+name = "product_return_prediction"
+version = "0.0.1"
+description = "Analyze past orders and returns to predict which products are more likely to be returned."
+authors = [
+  { name = "Molinari-Pinto-Tanzi" },
+]
+license = { file = "LICENSE" }
+readme = "README.md"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: MIT License"
+]
+requires-python = "~=3.12"
+[tool.black]
+line-length = 99
+include = '\.pyi?$'
+exclude = '''
+/(
+    \.git
+  | \.venv
+)/
+'''
+[tool.ruff.lint.isort]
+known_first_party = ["product_return_prediction"]
+force_sort_within_sections = true
+[tool.pytest.ini_options]
+log_cli = true
+log_cli_level = "INFO"

requirements.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+black
+codecarbon
+fastapi
+flake8
+ipython
+isort
+jupyterlab
+loguru
+matplotlib
+mkdocs
+notebook
+numpy
+pandas
+pip
+python-dotenv
+scikit-learn
+tqdm
+typer
+dvc
+dvc-gdrive
+mlflow
+dagshub
+great-expectations
+pytest
+openpyxl
+uvicorn
+seaborn