Sales Dataset Card

This dataset contains 28,692 rows of data related to Emporio Armani's e-commerce order lines. Each row represents an order line, detailing whether the product was returned or not. The dataset includes features such as:

Date and Time Information: Year (Gregorian), month (Gregorian), month name, and the exact purchase date (in date format).
Customer Information: Store ID of the customer associated with the transaction.
Order Line Details: Order number and order line number to uniquely identify each purchase.
Geographical Information: Country where the purchase was made.
Product Details: Variant code (WCS), Item Brand model, Fabric, Color, a combination of model, fabric, and color, and the product composition. Additionally, the dataset specifies the product's top category, type, and subtype, along with its gender target, age range, and a link to the product image.
Return Information: Return reason group and detailed reason for the return (if applicable).
Financial and Quantitative Data: Net sales (value and units), return value, and return units for each transaction.

The dataset is slightly unbalanced, with only 23% of transactions involving returned products.

Uses

The dataset can be used to support a variety of use cases in the field of e-commerce analytics and machine learning. Some of the main intended uses are:

Predictive Modeling: Training and evaluating machine learning models for binary classification tasks, such as predicting the likelihood of a product being returned based on its characteristics and transaction details.
Exploratory Data Analysis (EDA): Analyzing patterns and trends in product returns to identify factors that influence customer behavior, such as product type, composition, or age range.
Feature Engineering: Using the dataset to develop and test new features for predictive models, such as aggregating return reasons or combining product composition data.
Unbalanced Data Research: Studying machine learning techniques and strategies to handle imbalanced datasets, as the target variable is not evenly distributed.

Direct Use

The dataset can be used in the following cases:

Train a binary classification model to predict if a product will be returned
Train a regression model to predict the probability of restitution of a model
Train a multi-class classification model to predict the motivation of a return

Dataset Structure

The dataset presents the following features:

Feature name	Description
Year Gregorian	Gregorian year of the purchase (e.g.: `2023`)
Month Gregorian	Gregorian month of the purchase (e.g.: `01/2023`, indicating january of the year 2023)
Month Gregorian Name	Gregorian name abbreviation of the month of the purchase (e.g.: `Jan`)
Date (Date Format)	Date of the purchase (e.g.: `2023-01-02`, indicating the first of january 2023)
Customer Store ID	Numerical code that identifies the user
Order Number	Alphanumerical code that identifies the receipt to which the purchase belongs
Order Line Number	Integer which corresponds to the purchase order of the product in question within the receipt (if the product is part of a recept in which there are 5 products, the corresponding order line number could be a value in between 1 and 5)
Country	Country in which the product was purchased
Variant WCS	Alternative identifier of the receipt, with 1:1 correspondence with Order Number
Item Brand Model	Alphanumerical code indicating the model of the purchased product
Item Brand Fabric	Alphanumerical code indicating the fabric of the purchased product
Item Brand Colour	Alphanumerical code indicating the colour of the purchased product
Item Brand Model Fabric Colour	Alphanumerical code, the combination of the codes of Model, Fabric, and Colour
Product Composition	Information on the percentage of materials that make up the purchased product (e.g.: `43% COTTON 29% WOOL 28% ACRYLIC`)
Product Top Category	Macrocategory to which the purchased product belongs (e.g.: `READY TO WEAR`)
Product Type	Type of garment corresponding to the product purchased (e.g.: `JACKETS`)
Product Subtype	Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`)
Age Range	Value that could be `ADULT`, `JUNIOR` or `BABY`
Product Gender	Value that could be `MALE` or `FEMALE`
Product Image Link	URL address to images of the purchased product (the 20% of the product do not have the corresponding link)
Return Reason Group	Set of 9 values, identified by a string, corresponding to the macrocategory of the product return reason. A `#N/A#` value corresponds to an unreturned product
Return Reason	Set of 26 values, identified by a string, corresponding to the specific category of the product return reason. A `#N/A#` value corresponds to an unreturned product
Net Sales (FA)	Value, in Euros, of the product purchased
Net Sales Units (FA)	Value describing whether the product was returned or not (`-1` means the product is returned, otherwise the value corresponds to 1)
Returns Value (FA)	Corresponds to the same value of the net sales, it is corroborated only if the product is returned
Return Units (FA)	Value is `1.0` only if the product is returned, otherwise it is null

Dataset Creation

Data Collection and Processing

A feature engineering pipeline has been performed on the dataset as it follows:

Added a new column named Returned that contains a flag to identify if a product has been returned based on Return Units (FA) column
Removed Year Gregorian, Month Gregorian, Month Gregorian Name, Country, Age Range, Product Image Link, Returns Value (FA), Returns Units (FA), Return Reason Group and Return Reason because they were not useful for training
Removed Variant WCS to remove additional IDs
Added a new column named Product Order Count that tells the number of products belonging to the same order as the selected product based on Order Number and Order Line Number
Added a new column named Total Order Value performing the sum of every product belonging to the same order based on Order Number and Net Sales (FA) columns
Added a new column named Main Material which contains the first material that can be found in the Product Composition column
Added a new column named Colour Return Percentage that estimates the return likelihood of a product based on its Item Brand Model and Item Brand Colour
- This operation also produced a JSON file that helps obtaining known values starting from Item Brand Model and Item Brand Colour, otherwise a median value will be found using Product Top Category of the product
Added a new column named Total Customer Purchases that tells the number of purchases, within the year, of a customer that has purchased that product
Added a new column named Total Customer Returns that tells the number of returns, within the year, of a customer that has purchased that product
Added a new column named Customer Return Percentage that shows the likelihood of the returns made by customer that has bought that product
Selected only those rows belonging to READY TO WEAR as Product Top Category
Removed Date (Date format), Customer Store ID, Order Number, Order Line Number, Item Brand Model, Item Brand Fabric, Item Brand Colour, Item Brand Model Fabric Colour, Product Composition, Product Top Category

After performing all these operations, all the categorical features have been converted into numerical ones performing a Target Encoding technique with smoothing to avoid partial ordering issues during training. A StandardScaler trained only on the train data split has been applied at the end of the whole process to prepare data for training, evaluation and inference.

The new dataset will contain the following features

Feature	Description
Product Type	Type of garment corresponding to the product purchased (e.g.: `JACKETS`)
Product Subtype	Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`)
Product Gender	Value that could be `MALE` or `FEMALE`
Net Sales (FA)	Value, in Euros, of the product purchased
Net Sales Units (FA)	Value describing the number of products purchased or returned (always `1`)
Returned	`1` if the product has been returned, `0` otherwise
Product Order Count	Number of products belonging to the same order
Total Order Value	Sum of every product belonging to the same order, in Euros
Main Material	Material of which the product is mainly made of
Colour Return Percentage	Likelihood of the product return based on the colour of the product
Total Customer Purchases	Number of purchases made by the user that bought or returned that product
Total Customer Returns	Number of returns made by the user that bought or returned that product
Customer Return Percentage	Likelihood of the product return based on the customer behavior

This new dataset has been splitted into two files, train.tsv and test.tsv, performing a 80-20 split.

Personal and Sensitive Information

The dataset does not contain personal or sensitive information. The only reference to customers is their customer ID related to orders but no sensitive informations about coustomers are involved.

Bias, Risks, and Limitations

The dataset presents an unbalanced distribution of returns and purchases, showing more purchases than returns, which are only 23% of the entire records. Training models using this dataset could result in slightly biased results, if not taken into account. Additionally, data exploration shows that there is no correlation between features before applying feature engineering.

Recommendations

Deeper data exploration and feature engineering is suggested to achieve better training results using this dataset.