File size: 4,073 Bytes
7fe38b0
 
 
 
 
 
2388667
7fe38b0
 
 
 
 
f8abcb0
7fe38b0
 
 
 
f8abcb0
7fe38b0
 
 
 
 
 
 
 
f8abcb0
 
7fe38b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59c569f
 
7fe38b0
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: mit
language:
- gl
---

<div align="justify">

**Galician text [here](https://huggingface.co/proxectonos/Nos_D2T-gl/edit/main/README.md)**

**Model description**

Fine-tuning of a text-to-text model to accomplish the data-to-text task in Galician. Taking as base the [MT5-base](https://huggingface.co/google/mt5-base) multilingual pre-trained model, the fine-tuning technique is applied to train a model able to generate descriptions from structured data in Galician language.

**How to generate texts from tabular data**
+ Open bash terminal
+ Install [Python 3.10](https://www.python.org/downloads/release/python-3104/)
+ To generate a text from any of the 568 test records from our dataset, you must use the following command:
```bash
python generate_text.py -i <data_id> -o <output_path>
```
+ The **-i** argument is used to indicate the ID of the table from which we want our model to generate a text. Only IDs from 0 to 569 are valid, as these are the records from the test partition, i.e., examples the model was not trained on. 
+ The **-o** argument is used to indicate the path where the file with the generated text will be created. If not provided, the file will be created in the current directory.

**Training**

The first dataset for data-to-text in Galician was used to train the model. The dataset was released by Proxecto N贸s and it is available in the next repository: https://zenodo.org/record/7661650#.ZA8aZ3bMJro. The dataset is made up of 3,302 records of meteorological prediction tabular data along with handwritten textual descriptions in Galician.
The chosen base model uses as input text format instead of structured data, so we performed a "linearization" process on our tabular data before using them to fine-tune the model. The process consist of transforming the data tables of our dataset into labelled text format as follows:

<img src="https://huggingface.co/proxectonos/Nos_D2T-gl/resolve/main/exemplo-linearizacion.png" width="75%" style="margin:auto;">

This way, the text-to-text model can understand the content of the data tables and their related texts in order to learn how to generate descriptive texts from new data tables.

To perform the fine-tuning on the base model the following hypermarameters were used during training:
+ Batch size: 8
+ Optimizer: Adam
+ Learning rate: 1e-5
+ Training epochs: 1000

**Model License** 

MIT License

Copyright (c) 2023 Proxecto N贸s

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

**Funding**

This research was funded by the project "N贸s: o galego na sociedade e econom铆a da intelixencia artificial", the result of an agreement between Xunta de Galicia and Universidade de Santiago de Compostela, and by grants ED431G2019/04 and ED431C2022/19 from the Consellar铆a de Educaci贸n, Universidade e Formaci贸n Profesional and by the European Regional Development Fund (ERDF/FEDER programme). 

</div>

**Citation**

If you use this model in your work, please cite as follows:

Gonz谩lez Corbelle, Javier; Bugar铆n Diz, Alberto. 2023 Nos_D2T-gl. URL: https://huggingface.co/proxectonos/Nos_D2T-gl/