Sales Dataset Card
This dataset contains 28,692 rows of data related to Emporio Armani's e-commerce order lines. Each row represents an order line, detailing whether the product was returned or not. The dataset includes features such as:
- Date and Time Information: Year (Gregorian), month (Gregorian), month name, and the exact purchase date (in date format).
- Customer Information: Store ID of the customer associated with the transaction.
- Order Line Details: Order number and order line number to uniquely identify each purchase.
- Geographical Information: Country where the purchase was made.
- Product Details: Variant code (WCS), Item Brand model, Fabric, Color, a combination of model, fabric, and color, and the product composition. Additionally, the dataset specifies the product's top category, type, and subtype, along with its gender target, age range, and a link to the product image.
- Return Information: Return reason group and detailed reason for the return (if applicable).
- Financial and Quantitative Data: Net sales (value and units), return value, and return units for each transaction.
The dataset is slightly unbalanced, with only 23% of transactions involving returned products.
Uses
The dataset can be used to support a variety of use cases in the field of e-commerce analytics and machine learning. Some of the main intended uses are:
- Predictive Modeling: Training and evaluating machine learning models for binary classification tasks, such as predicting the likelihood of a product being returned based on its characteristics and transaction details.
- Exploratory Data Analysis (EDA): Analyzing patterns and trends in product returns to identify factors that influence customer behavior, such as product type, composition, or age range.
- Feature Engineering: Using the dataset to develop and test new features for predictive models, such as aggregating return reasons or combining product composition data.
- Unbalanced Data Research: Studying machine learning techniques and strategies to handle imbalanced datasets, as the target variable is not evenly distributed.
Direct Use
The dataset can be used in the following cases:
- Train a binary classification model to predict if a product will be returned
- Train a regression model to predict the probability of restitution of a model
- Train a multi-class classification model to predict the motivation of a return
Dataset Structure
The dataset presents the following features:
Feature name | Description |
---|---|
Year Gregorian | Gregorian year of the purchase (e.g.: 2023 ) |
Month Gregorian | Gregorian month of the purchase (e.g.: 01/2023 , indicating january of the year 2023) |
Month Gregorian Name | Gregorian name abbreviation of the month of the purchase (e.g.: Jan ) |
Date (Date Format) | Date of the purchase (e.g.: 2023-01-02 , indicating the first of january 2023) |
Customer Store ID | Numerical code that identifies the user |
Order Number | Alphanumerical code that identifies the receipt to which the purchase belongs |
Order Line Number | Integer which corresponds to the purchase order of the product in question within the receipt (if the product is part of a recept in which there are 5 products, the corresponding order line number could be a value in between 1 and 5) |
Country | Country in which the product was purchased |
Variant WCS | Alternative identifier of the receipt, with 1:1 correspondence with Order Number |
Item Brand Model | Alphanumerical code indicating the model of the purchased product |
Item Brand Fabric | Alphanumerical code indicating the fabric of the purchased product |
Item Brand Colour | Alphanumerical code indicating the colour of the purchased product |
Item Brand Model Fabric Colour | Alphanumerical code, the combination of the codes of Model, Fabric, and Colour |
Product Composition | Information on the percentage of materials that make up the purchased product (e.g.: 43% COTTON 29% WOOL 28% ACRYLIC ) |
Product Top Category | Macrocategory to which the purchased product belongs (e.g.: READY TO WEAR ) |
Product Type | Type of garment corresponding to the product purchased (e.g.: JACKETS ) |
Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: DOUBLE BREASTED JACKETS ) |
Age Range | Value that could be ADULT , JUNIOR or BABY |
Product Gender | Value that could be MALE or FEMALE |
Product Image Link | URL address to images of the purchased product (the 20% of the product do not have the corresponding link) |
Return Reason Group | Set of 9 values, identified by a string, corresponding to the macrocategory of the product return reason. A #N/A# value corresponds to an unreturned product |
Return Reason | Set of 26 values, identified by a string, corresponding to the specific category of the product return reason. A #N/A# value corresponds to an unreturned product |
Net Sales (FA) | Value, in Euros, of the product purchased |
Net Sales Units (FA) | Value describing whether the product was returned or not (-1 means the product is returned, otherwise the value corresponds to 1) |
Returns Value (FA) | Corresponds to the same value of the net sales, it is corroborated only if the product is returned |
Return Units (FA) | Value is 1.0 only if the product is returned, otherwise it is null |
Dataset Creation
Data Collection and Processing
A feature engineering pipeline has been performed on the dataset as it follows:
- Added a new column named
Returned
that contains a flag to identify if a product has been returned based onReturn Units (FA)
column - Removed
Year Gregorian
,Month Gregorian
,Month Gregorian Name
,Country
,Age Range
,Product Image Link
,Returns Value (FA)
,Returns Units (FA)
,Return Reason Group
andReturn Reason
because they were not useful for training - Removed
Variant WCS
to remove additional IDs - Added a new column named
Product Order Count
that tells the number of products belonging to the same order as the selected product based onOrder Number
andOrder Line Number
- Added a new column named
Total Order Value
performing the sum of every product belonging to the same order based onOrder Number
andNet Sales (FA)
columns - Added a new column named
Main Material
which contains the first material that can be found in theProduct Composition
column - Added a new column named
Colour Return Percentage
that estimates the return likelihood of a product based on itsItem Brand Model
andItem Brand Colour
- This operation also produced a JSON file that helps obtaining known values starting from
Item Brand Model
andItem Brand Colour
, otherwise a median value will be found usingProduct Top Category
of the product
- This operation also produced a JSON file that helps obtaining known values starting from
- Added a new column named
Total Customer Purchases
that tells the number of purchases, within the year, of a customer that has purchased that product - Added a new column named
Total Customer
Returns that tells the number of returns, within the year, of a customer that has purchased that product - Added a new column named
Customer Return Percentage
that shows the likelihood of the returns made by customer that has bought that product - Selected only those rows belonging to
READY TO WEAR
asProduct Top Category
- Removed
Date (Date format)
,Customer Store ID
,Order Number
,Order Line Number
,Item Brand Model
,Item Brand Fabric
,Item Brand Colour
,Item Brand Model Fabric Colour
,Product Composition
,Product Top Category
After performing all these operations, all the categorical features have been converted into numerical ones performing a Target Encoding technique with smoothing to avoid partial ordering issues during training. A StandardScaler
trained only on the train data split has been applied at the end of the whole process to prepare data for training, evaluation and inference.
The new dataset will contain the following features
Feature | Description |
---|---|
Product Type | Type of garment corresponding to the product purchased (e.g.: JACKETS ) |
Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: DOUBLE BREASTED JACKETS ) |
Product Gender | Value that could be MALE or FEMALE |
Net Sales (FA) | Value, in Euros, of the product purchased |
Net Sales Units (FA) | Value describing the number of products purchased or returned (always 1 ) |
Returned | 1 if the product has been returned, 0 otherwise |
Product Order Count | Number of products belonging to the same order |
Total Order Value | Sum of every product belonging to the same order, in Euros |
Main Material | Material of which the product is mainly made of |
Colour Return Percentage | Likelihood of the product return based on the colour of the product |
Total Customer Purchases | Number of purchases made by the user that bought or returned that product |
Total Customer Returns | Number of returns made by the user that bought or returned that product |
Customer Return Percentage | Likelihood of the product return based on the customer behavior |
This new dataset has been splitted into two files, train.tsv
and test.tsv
, performing a 80-20 split.
Personal and Sensitive Information
The dataset does not contain personal or sensitive information. The only reference to customers is their customer ID related to orders but no sensitive informations about coustomers are involved.
Bias, Risks, and Limitations
The dataset presents an unbalanced distribution of returns and purchases, showing more purchases than returns, which are only 23% of the entire records. Training models using this dataset could result in slightly biased results, if not taken into account. Additionally, data exploration shows that there is no correlation between features before applying feature engineering.
Recommendations
Deeper data exploration and feature engineering is suggested to achieve better training results using this dataset.