File size: 12,439 Bytes
a1a7d89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
<!-- ---
# For reference on dataset card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/datasets-cards
{{ card_data }}
--- -->

# Sales Dataset Card

<!-- Provide a quick summary of the dataset. -->
This dataset contains 28,692 rows of data related to Emporio Armani's e-commerce order lines. Each row represents an order line, detailing whether the product was returned or not. The dataset includes features such as:

- **Date and Time Information**: Year (Gregorian), month (Gregorian), month name, and the exact purchase date (in date format).
- **Customer Information**: Store ID of the customer associated with the transaction.
- **Order Line Details**: Order number and order line number to uniquely identify each purchase.
- **Geographical Information**: Country where the purchase was made.
- **Product Details**: Variant code (WCS), Item Brand model, Fabric, Color, a combination of model, fabric, and color, and the product composition. Additionally, the dataset specifies the product's top category, type, and subtype, along with its gender target, age range, and a link to the product image.
- **Return Information**: Return reason group and detailed reason for the return (if applicable).
- **Financial and Quantitative Data**: Net sales (value and units), return value, and return units for each transaction.

The dataset is slightly unbalanced, with only 23% of transactions involving returned products.


<!-- ## Dataset Details -->

<!-- ### Dataset Description -->

<!-- Provide a longer summary of what this dataset is. -->

<!-- - **Curated by**: Molinari-Pinto-Tanzi -->
<!-- - **Funded by**: Armani -->
<!-- - **Shared by [optional]:** {{ shared_by | default("[More Information Needed]", true)}}
- **Language(s) (NLP):** {{ language | default("[More Information Needed]", true)}} -->
<!-- - **License:** {{ license | default("[More Information Needed]", true)}} -->

<!-- ## Dataset Sources -->

<!-- Provide the basic links for the dataset. -->

<!-- - **GitHub Repository**: [Product Return Prediction on GitHub](https://github.com/se4ai2425-uniba/product-return-prediction) -->
<!-- - **DagsHub Repository**: [Product Return Prediction on DagsHub](https://dagshub.com/se4ai2425-uniba/product-return-prediction) -->
<!-- - **Demo [optional]:** {{ demo | default("[More Information Needed]", true)}} -->

## Uses

<!-- Address questions around how the dataset is intended to be used. -->
The dataset can be used to support a variety of use cases in the field of e-commerce analytics and machine learning. Some of the main intended uses are:

- **Predictive Modeling**: Training and evaluating machine learning models for binary classification tasks, such as predicting the likelihood of a product being returned based on its characteristics and transaction details.
- **Exploratory Data Analysis (EDA)**: Analyzing patterns and trends in product returns to identify factors that influence customer behavior, such as product type, composition, or age range.
- **Feature Engineering**: Using the dataset to develop and test new features for predictive models, such as aggregating return reasons or combining product composition data.
- **Unbalanced Data Research**: Studying machine learning techniques and strategies to handle imbalanced datasets, as the target variable is not evenly distributed.

### Direct Use

<!-- This section describes suitable use cases for the dataset. -->
The dataset can be used in the following cases:
- Train a **binary classification** model to predict if a product will be returned
- Train a **regression** model to predict the probability of restitution of a model
- Train a **multi-class classification** model to predict the motivation of a return


## Dataset Structure

<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->

The dataset presents the following features:

| Feature name | Description |
| --- | --- |
| Year Gregorian | Gregorian year of the purchase (e.g.: `2023`) |
| Month Gregorian | Gregorian month of the purchase (e.g.: `01/2023`, indicating january of the year 2023) |
| Month Gregorian Name | Gregorian name abbreviation of the month of the purchase (e.g.: `Jan`) |
| Date (Date Format) | Date of the purchase (e.g.: `2023-01-02`, indicating the first of january 2023) |
| Customer Store ID | Numerical code that identifies the user |
| Order Number | Alphanumerical code that identifies the receipt to which the purchase belongs |
| Order Line Number | Integer which corresponds to the purchase order of the product in question within the receipt (if the product is part of a recept in which there are 5 products, the corresponding order line number could be a value in between 1 and 5) |
| Country | Country in which the product was purchased |
| Variant WCS | Alternative identifier of the receipt, with 1:1 correspondence with Order Number |
| Item Brand Model | Alphanumerical code indicating the model of the purchased product |
| Item Brand Fabric | Alphanumerical code indicating the fabric of the purchased product |
| Item Brand Colour | Alphanumerical code indicating the colour of the purchased product |
| Item Brand Model Fabric Colour | Alphanumerical code, the combination of the codes of Model, Fabric, and Colour |
| Product Composition | Information on the percentage of materials that make up the purchased product (e.g.: `43% COTTON 29% WOOL 28% ACRYLIC`) |
| Product Top Category | Macrocategory to which the purchased product belongs (e.g.: `READY TO WEAR`) |
| Product Type | Type of garment corresponding to the product purchased (e.g.: `JACKETS`) |
| Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`) |
| Age Range | Value that could be `ADULT`, `JUNIOR` or `BABY` |
| Product Gender | Value that could be `MALE` or `FEMALE` |
| Product Image Link | URL address to images of the purchased product (the 20% of the product do not have the corresponding link) |
| Return Reason Group | Set of 9 values, identified by a string, corresponding to the macrocategory of the product return reason. A `#N/A#` value corresponds to an unreturned product |
| Return Reason | Set of 26 values, identified by a string, corresponding to the specific category of the product return reason. A `#N/A#` value corresponds to an unreturned product |
| Net Sales (FA) | Value, in Euros, of the product purchased |
| Net Sales Units (FA) | Value describing whether the product was returned or not (`-1` means the product is returned, otherwise the value corresponds to 1) |
| Returns Value (FA) | Corresponds to the same value of the net sales, it is corroborated only if the product is returned |
| Return Units (FA) | Value is `1.0` only if the product is returned, otherwise it is null |

## Dataset Creation

### Data Collection and Processing

<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->

A feature engineering pipeline has been performed on the dataset as it follows:

1. Added a new column named `Returned` that contains a flag to identify if a product has been returned based on `Return Units (FA)` column
2. Removed `Year Gregorian`, `Month Gregorian`, `Month Gregorian Name`, `Country`, `Age Range`, `Product Image Link`, `Returns Value (FA)`, `Returns Units (FA)`, `Return Reason Group` and `Return Reason` because they were not useful for training
3. Removed `Variant WCS` to remove additional IDs
4. Added a new column named `Product Order Count` that tells the number of products belonging to the same order as the selected product based on `Order Number` and `Order Line Number`
5. Added a new column named `Total Order Value` performing the sum of every product belonging to the same order based on `Order Number` and `Net Sales (FA)` columns
6. Added a new column named `Main Material` which contains the first material that can be found in the `Product Composition` column
7. Added a new column named `Colour Return Percentage` that estimates the return likelihood of a product based on its `Item Brand Model` and `Item Brand Colour`
   - This operation also produced a JSON file that helps obtaining known values starting from `Item Brand Model` and `Item Brand Colour`, otherwise a median value will be found  using `Product Top Category` of the product
8. Added a new column named `Total Customer Purchases` that tells the number of purchases, within the year, of a customer that has purchased that product
9. Added a new column named `Total Customer` Returns that tells the number of returns, within the year, of a customer that has purchased that product
10. Added a new column named `Customer Return Percentage` that shows the likelihood of the returns made by customer that has bought that product
11. Selected only those rows belonging to `READY TO WEAR` as `Product Top Category`
12. Removed `Date (Date format)`, `Customer Store ID`, `Order Number`, `Order Line Number`, `Item Brand Model`, `Item Brand Fabric`, `Item Brand Colour`, `Item Brand Model Fabric Colour`, `Product Composition`, `Product Top Category`

After performing all these operations, all the categorical features have been converted into numerical ones performing a **Target Encoding** technique with smoothing to avoid partial ordering issues during training. A `StandardScaler` trained only on the train data split has been applied at the end of the whole process to prepare data for training, evaluation and inference.

The new dataset will contain the following features

| Feature | Description |
|---|---|
| Product Type | Type of garment corresponding to the product purchased (e.g.: `JACKETS`) |
| Product Subtype | Subtype of garment corresponding to the product purchased (e.g.: `DOUBLE BREASTED JACKETS`) |
| Product Gender | Value that could be `MALE` or `FEMALE` |
| Net Sales (FA) | Value, in Euros, of the product purchased |
| Net Sales Units (FA) | Value describing the number of products purchased or returned (always `1`) |
| Returned | `1` if the product has been returned, `0` otherwise |
| Product Order Count | Number of products belonging to the same order |
| Total Order Value | Sum of every product belonging to the same order, in Euros |
| Main Material | Material of which the product is mainly made of |
| Colour Return Percentage | Likelihood of the product return based on the colour of the product |
| Total Customer Purchases | Number of purchases made by the user that bought or returned that product |
| Total Customer Returns | Number of returns made by the user that bought or returned that product |
| Customer Return Percentage | Likelihood of the product return based on the customer behavior |

This new dataset has been splitted into two files, `train.tsv` and `test.tsv`, performing a 80-20 split.


### Personal and Sensitive Information

<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
The dataset does not contain personal or sensitive information. The only reference to customers is their customer ID related to orders but no sensitive informations about coustomers are involved.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The dataset presents an unbalanced distribution of returns and purchases, showing more purchases than returns, which are only 23% of the entire records. Training models using this dataset could result in slightly biased results, if not taken into account. Additionally, data exploration shows that there is no correlation between features before applying feature engineering.


### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Deeper data exploration and feature engineering is suggested to achieve better training results using this dataset.