# Sales Dataset Card This dataset contains 28,692 rows of data related to Emporio Armani's e-commerce order lines. Each row represents an order line, detailing whether the product was returned or not. The dataset includes features such as: - **Date and Time Information**: Year (Gregorian), month (Gregorian), month name, and the exact purchase date (in date format). - **Customer Information**: Store ID of the customer associated with the transaction. - **Order Line Details**: Order number and order line number to uniquely identify each purchase. - **Geographical Information**: Country where the purchase was made. - **Product Details**: Variant code (WCS), Item Brand model, Fabric, Color, a combination of model, fabric, and color, and the product composition. Additionally, the dataset specifies the product's top category, type, and subtype, along with its gender target, age range, and a link to the product image. - **Return Information**: Return reason group and detailed reason for the return (if applicable). - **Financial and Quantitative Data**: Net sales (value and units), return value, and return units for each transaction. The dataset is slightly unbalanced, with only 23% of transactions involving returned products. ## Uses The dataset can be used to support a variety of use cases in the field of e-commerce analytics and machine learning. Some of the main intended uses are: - **Predictive Modeling**: Training and evaluating machine learning models for binary classification tasks, such as predicting the likelihood of a product being returned based on its characteristics and transaction details. - **Exploratory Data Analysis (EDA)**: Analyzing patterns and trends in product returns to identify factors that influence customer behavior, such as product type, composition, or age range. - **Feature Engineering**: Using the dataset to develop and test new features for predictive models, such as aggregating return reasons or combining product composition data. - **Unbalanced Data Research**: Studying machine learning techniques and strategies to handle imbalanced datasets, as the target variable is not evenly distributed. ### Direct Use The dataset can be used in the following cases: - Train a **binary classification** model to predict if a product will be returned - Train a **regression** model to predict the probability of restitution of a model - Train a **multi-class classification** model to predict the motivation of a return ## Dataset Structure The dataset presents the following features: | Feature name | Description | | --- | --- | | Year Gregorian | Gregorian year of the purchase (e.g.: `2023`) | | Month Gregorian | Gregorian month of the purchase (e.g.: `01/2023`, indicating january of the year 2023) | | Month Gregorian Name | Gregorian name abbreviation of the month of the purchase (e.g.: `Jan`) | | Date (Date Format) | Date of the purchase (e.g.: `2023-01-02`, indicating the first of january 2023) | | Customer Store ID | Numerical code that identifies the user | | Order Number | Alphanumerical code that identifies the receipt to which the purchase belongs | | Order Line Number | Integer which corresponds to the purchase order of the product in question within the receipt (if the product is part of a recept in which there are 5 products, the corresponding order line number could be a value in between 1 and 5) | | Country | Country in which the product was purchased | | Variant WCS | Alternative identifier of the receipt, with 1:1 correspondence with Order Number | | Item Brand Model | Alphanumerical code indicating the model of the purchased product | | Item Brand Fabric | Alphanumerical code indicating the fabric of the purchased product | | Item Brand Colour | Alphanumerical code indicating the colour of the purchased product | | Item Brand Model Fabric Colour | Alphanumerical code, the combination of the codes of Model, Fabric, and Colour | | Product Composition | Information on the percentage of materials that make up the purchased product (e.g.: `43% COTTON 29% WOOL 28% ACRYLIC`) | | Product Top Category | Macrocategory to which the purchased product belongs (e.g.: `READY TO WEAR`) | | Product Type | Type of garment corresponding to the product purchased (e.g.: `JACKETS`) | | Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`) | | Age Range | Value that could be `ADULT`, `JUNIOR` or `BABY` | | Product Gender | Value that could be `MALE` or `FEMALE` | | Product Image Link | URL address to images of the purchased product (the 20% of the product do not have the corresponding link) | | Return Reason Group | Set of 9 values, identified by a string, corresponding to the macrocategory of the product return reason. A `#N/A#` value corresponds to an unreturned product | | Return Reason | Set of 26 values, identified by a string, corresponding to the specific category of the product return reason. A `#N/A#` value corresponds to an unreturned product | | Net Sales (FA) | Value, in Euros, of the product purchased | | Net Sales Units (FA) | Value describing whether the product was returned or not (`-1` means the product is returned, otherwise the value corresponds to 1) | | Returns Value (FA) | Corresponds to the same value of the net sales, it is corroborated only if the product is returned | | Return Units (FA) | Value is `1.0` only if the product is returned, otherwise it is null | ## Dataset Creation ### Data Collection and Processing A feature engineering pipeline has been performed on the dataset as it follows: 1. Added a new column named `Returned` that contains a flag to identify if a product has been returned based on `Return Units (FA)` column 2. Removed `Year Gregorian`, `Month Gregorian`, `Month Gregorian Name`, `Country`, `Age Range`, `Product Image Link`, `Returns Value (FA)`, `Returns Units (FA)`, `Return Reason Group` and `Return Reason` because they were not useful for training 3. Removed `Variant WCS` to remove additional IDs 4. Added a new column named `Product Order Count` that tells the number of products belonging to the same order as the selected product based on `Order Number` and `Order Line Number` 5. Added a new column named `Total Order Value` performing the sum of every product belonging to the same order based on `Order Number` and `Net Sales (FA)` columns 6. Added a new column named `Main Material` which contains the first material that can be found in the `Product Composition` column 7. Added a new column named `Colour Return Percentage` that estimates the return likelihood of a product based on its `Item Brand Model` and `Item Brand Colour` - This operation also produced a JSON file that helps obtaining known values starting from `Item Brand Model` and `Item Brand Colour`, otherwise a median value will be found using `Product Top Category` of the product 8. Added a new column named `Total Customer Purchases` that tells the number of purchases, within the year, of a customer that has purchased that product 9. Added a new column named `Total Customer` Returns that tells the number of returns, within the year, of a customer that has purchased that product 10. Added a new column named `Customer Return Percentage` that shows the likelihood of the returns made by customer that has bought that product 11. Selected only those rows belonging to `READY TO WEAR` as `Product Top Category` 12. Removed `Date (Date format)`, `Customer Store ID`, `Order Number`, `Order Line Number`, `Item Brand Model`, `Item Brand Fabric`, `Item Brand Colour`, `Item Brand Model Fabric Colour`, `Product Composition`, `Product Top Category` After performing all these operations, all the categorical features have been converted into numerical ones performing a **Target Encoding** technique with smoothing to avoid partial ordering issues during training. A `StandardScaler` trained only on the train data split has been applied at the end of the whole process to prepare data for training, evaluation and inference. The new dataset will contain the following features | Feature | Description | |---|---| | Product Type | Type of garment corresponding to the product purchased (e.g.: `JACKETS`) | | Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`) | | Product Gender | Value that could be `MALE` or `FEMALE` | | Net Sales (FA) | Value, in Euros, of the product purchased | | Net Sales Units (FA) | Value describing the number of products purchased or returned (always `1`) | | Returned | `1` if the product has been returned, `0` otherwise | | Product Order Count | Number of products belonging to the same order | | Total Order Value | Sum of every product belonging to the same order, in Euros | | Main Material | Material of which the product is mainly made of | | Colour Return Percentage | Likelihood of the product return based on the colour of the product | | Total Customer Purchases | Number of purchases made by the user that bought or returned that product | | Total Customer Returns | Number of returns made by the user that bought or returned that product | | Customer Return Percentage | Likelihood of the product return based on the customer behavior | This new dataset has been splitted into two files, `train.tsv` and `test.tsv`, performing a 80-20 split. ### Personal and Sensitive Information The dataset does not contain personal or sensitive information. The only reference to customers is their customer ID related to orders but no sensitive informations about coustomers are involved. ## Bias, Risks, and Limitations The dataset presents an unbalanced distribution of returns and purchases, showing more purchases than returns, which are only 23% of the entire records. Training models using this dataset could result in slightly biased results, if not taken into account. Additionally, data exploration shows that there is no correlation between features before applying feature engineering. ### Recommendations Deeper data exploration and feature engineering is suggested to achieve better training results using this dataset.