Upload 17 files

Browse files

Files changed (14) hide show

AIRA_FineTuning.ipynb +0 -0
Aira_emissions.csv +1 -1
README.md +28 -37
added_tokens.json +4 -2
config.json +2 -2
generation_config.json +1 -1
model.safetensors +2 -2
optimizer.pt +3 -0
pytorch_model.bin +2 -2
rng_state.pt +3 -0
scheduler.pt +3 -0
special_tokens_map.json +3 -2
tokenizer_config.json +3 -2
training_stats.parquet +2 -2

AIRA_FineTuning.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

Aira_emissions.csv CHANGED Viewed

	@@ -1,2 +1,2 @@
1	timestamp,project_name,run_id,duration,emissions,emissions_rate,cpu_power,gpu_power,ram_power,cpu_energy,gpu_energy,ram_energy,energy_consumed,country_name,country_iso_code,region,cloud_provider,cloud_region,os,python_version,codecarbon_version,cpu_count,cpu_model,gpu_count,gpu_model,longitude,latitude,ram_total_size,tracking_mode,on_cloud,pue
2	- 2023-06-~~26T22~~:38:01,Aira_emissions,~~bd08affb~~-~~b1e2~~-~~4849~~-~~8513~~-~~a85a02cf0f84~~,~~3690~~.~~1905386447906~~,0.~~0009893192359507477~~,2.~~6809435057358087e~~-07,42.5,~~296~~.~~394~~,31.~~30528450012207~~,0.~~04356464091208248~~,0.~~34052867170535045~~,0.~~03207338637952947~~,0.~~41616669899696207~~,~~Canada~~,~~CAN,quebec,,,~~Linux-5.15.~~107~~+-x86_64-with-glibc2.31,3.10.12,2.2.4,12,Intel(R) Xeon(R) CPU @ 2.20GHz,1,1 x NVIDIA A100-SXM4-40GB~~,-71~~.2,46.8,83.~~48075866699219~~,machine,N,1.0


1	timestamp,project_name,run_id,duration,emissions,emissions_rate,cpu_power,gpu_power,ram_power,cpu_energy,gpu_energy,ram_energy,energy_consumed,country_name,country_iso_code,region,cloud_provider,cloud_region,os,python_version,codecarbon_version,cpu_count,cpu_model,gpu_count,gpu_model,longitude,latitude,ram_total_size,tracking_mode,on_cloud,pue
2	+ 2023-09-05T14:47:28,Aira_emissions,cbb40a59-d1e6-4364-8dc4-b409f729cfa6,6527.2822506427765,0.35849483705483043,5.492252721559995e-05,42.5,333.6809648058492,31.305280208587646,0.077058052478234,0.600908513781982,0.056748852031142355,0.7347154182913581,Singapore,SGP,,,,Linux-5.15.109+-x86_64-with-glibc2.35,3.10.12,2.3.1,12,Intel(R) Xeon(R) CPU @ 2.20GHz,1,1 x NVIDIA A100-SXM4-40GB,103.8547,1.2929,83.48074722290039,machine,N,1.0

README.md CHANGED Viewed

@@ -1,14 +1,11 @@
 ---
 license: apache-2.0
 datasets:
-- Dahoas/synthetic-instruct-gptj-pairwise
-- databricks/databricks-dolly-15k
-- HuggingFaceH4/instruction-dataset
 - nicholasKluge/instruct-aira-dataset
 language:
 - pt
 metrics:
-- bleu
 library_name: transformers
 tags:
 - alignment
@@ -18,15 +15,13 @@ tags:
 - assistant
 pipeline_tag: text-generation
 widget:
-- text: <|startoftext|>Olá! Qual o seu nome?<|endoftext|>
   example_title: Olá
-- text: >-
-    <|startoftext|>Você pode me explicar o que é aprendizagem de
-    máquina?<|endoftext|>
-  example_title: Aprendizagem de máquina
-- text: <|startoftext|>Você sabe alguma coisa sobre ética das virtudes<|endoftext|>
-  example_title: Ética das virtudes
-- text: <|startoftext|>O que posso fazer para alegrar minha namorada?<|endoftext|>
   example_title: Conselho
 inference:
   parameters:
@@ -37,14 +32,18 @@ inference:
     max_length: 200
     length_penalty: 0.3
     early_stopping: true
 ---
-# Aira-Instruct-PT-124M (Portuguese)
-`Aira-Instruct-PT-124M` is a instruction-tuned GPT-style model based on [GPT-2](https://huggingface.co/pierreguillou/gpt2-small-portuguese). The model was trained with a dataset composed of `prompt`, `completions`, generated via the [Self-Instruct](https://github.com/yizhongw/self-instruct) framework. `Aira-Instruct-PT-124M` instruction-tuning was achieved via conditional text generation.
-The dataset used to train this model combines the following sources of data: the [`synthetic-instruct-gptj-pairwise`](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) dataset, the [`databricks_dolly_15k`](https://huggingface.co/datasets/HuggingFaceH4/databricks_dolly_15k) dataset, the [`instruction-dataset`](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset) dataset, and a subset of [Aira's](https://github.com/Nkluge-correa/Aira-EXPERT) fine-tuning dataset, focused on Q&A related to Ethics, AI, AI safety, and other related topics. The dataset is available in both Portuguese and English.
-Check our gradio-demo in [Spaces](https://huggingface.co/spaces/nicholasKluge/Aira-Demo).
 ## Details
@@ -52,27 +51,19 @@ Check our gradio-demo in [Spaces](https://huggingface.co/spaces/nicholasKluge/Ai
 - **Dataset:** [Instruct-Aira Dataset](https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset)
 - **Language:** Portuguese
 - **Number of Epochs:** 5
-- **Batch size:** 32
 - **Optimizer:** `torch.optim.AdamW` (warmup_steps = 1e2, learning_rate = 5e-4, epsilon = 1e-8)
 - **GPU:** 1 NVIDIA A100-SXM4-40GB
-- **Emissions:** 0.0009 KgCO2 (Canada)
-- **Total Energy Consumption:** 0.41 kWh
-| Epoch/Loss|Training|Validation|
-|---|---|---|
-| 1 |0.947100|0.774946|
-| 2 |0.737357|0.730962|
-| 3 |0.657410|0.710232|
-| 4 |0.597437|0.705064|
-| 5 |0.551684|0.704830|
-This repository has the notebook used to train this model.
 ## Usage
-Two special tokens are used to mark the user side of the interaction and the model's response:
-`<|startoftext|>`What is a language model?`<|endoftext|>`A language model is a probability distribution over a vocabulary.`<|endoftext|>`
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -80,8 +71,8 @@ import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-tokenizer = AutoTokenizer.from_pretrained('nicholasKluge/Aira-Instruct-PT-124M')
-aira = AutoModelForCausalLM.from_pretrained('nicholasKluge/Aira-Instruct-PT-124M')
 aira.eval()
 aira.to(device)
@@ -110,10 +101,10 @@ for i, response in  enumerate(responses):
 The model will output something like:
 ```markdown
->>> Question: 👤 Olá! Como você se chama?
->>>Response 1: 🤖 Olá! Meu nome é Aira e sou um chatbot projetado para conversar sobre Ética e Segurança da IA. Se você precisar de ajuda com um assunto diferente, por favor, peça "ajuda".
->>>Response 2: 🤖 Olá! Meu nome é Aira e sou um chatbot treinado para responder perguntas sobre Ética e Segurança da IA. Se você precisar de ajuda para navegar em nossa conversa, não hesite em pedir ajuda.
 ```
 ## Limitations
@@ -140,4 +131,4 @@ The model will output something like:
 ## License
-The `Aira-Instruct-PT-124M` is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.

 ---
 license: apache-2.0
 datasets:
 - nicholasKluge/instruct-aira-dataset
 language:
 - pt
 metrics:
+- accuracy
 library_name: transformers
 tags:
 - alignment
 - assistant
 pipeline_tag: text-generation
 widget:
+- text: "<|startofinstruction|>Olá! Como você se chama?<|endofinstruction|>"
   example_title: Olá
+- text: "<|startofinstruction|>Você pode me explicar o que é Aprendizagem de Máquina?<|endofinstruction|>"
+  example_title: Aprendizagem de Máquina
+- text: "<|startofinstruction|>Você sabe alguma coisa sobre Ética das Virtudes?<|endofinstruction|>"
+  example_title: Ética
+- text: "<|startofinstruction|>Como eu posso fazer a minha namorada feliz?<|endofinstruction|>"
   example_title: Conselho
 inference:
   parameters:
     max_length: 200
     length_penalty: 0.3
     early_stopping: true
+co2_eq_emissions:
+  emissions: 0.35
+  source: CodeCarbon
+  training_type: fine-tuning
+  geographical_location: Singapore
+  hardware_used: NVIDIA A100-SXM4-40GB
 ---
+# Aira-2-portuguese-124M
+`Aira-2-portuguese-124M` is the second version of the Aira instruction-tuned series. iAira is an instruction-tuned GPT-style model based on [GPT-2](https://huggingface.co/pierreguillou/gpt2-small-portuguese). The model was trained with a dataset composed of prompt, completions generated synthetically by prompting already-tuned models (ChatGPT, Llama, Open-Assistant, etc).
+Check our gradio-demo in [Spaces](https://huggingface.co/spaces/nicholasKluge/Aira-Demo-Portuguese).
 ## Details
 - **Dataset:** [Instruct-Aira Dataset](https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset)
 - **Language:** Portuguese
 - **Number of Epochs:** 5
+- **Batch size:** 24
 - **Optimizer:** `torch.optim.AdamW` (warmup_steps = 1e2, learning_rate = 5e-4, epsilon = 1e-8)
 - **GPU:** 1 NVIDIA A100-SXM4-40GB
+- **Emissions:** 0.35 KgCO2 (Singapore)
+- **Total Energy Consumption:** 0.73 kWh
+This repository has the [notebook](AIRA_FineTuning.ipynb) used to train this model.
 ## Usage
+Three special tokens are used to mark the user side of the interaction and the model's response:
+`<|startofinstruction|>`O que é um modelo de linguagem?`<|endofinstruction|>`Um modelo de linguagem é uma distribuição de probabilidade sobre um vocabulário.`<|endofcompletion|>`
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+tokenizer = AutoTokenizer.from_pretrained('nicholasKluge/Aira-2-portuguese-124M')
+aira = AutoModelForCausalLM.from_pretrained('nicholasKluge/Aira-2-portuguese-124M')
 aira.eval()
 aira.to(device)
 The model will output something like:
 ```markdown
+>>> Question: 👤 Qual a capital do Brasil?
+>>>Response 1: 🤖 A capital do Brasil é Brasília.
+>>>Response 2: 🤖 A capital do Brasil é Brasília.
 ```
 ## Limitations
 ## License
+The `Aira-2-portuguese-124M` is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.

added_tokens.json CHANGED Viewed

@@ -1,4 +1,6 @@
 {
-  "<|pad|>": 50258,
-  "<|startoftext|>": 50257
 }

 {
+  "<|endofcompletion|>": 50258,
+  "<|endofinstruction|>": 50259,
+  "<|pad|>": 50260,
+  "<|startofinstruction|>": 50257
 }

config.json CHANGED Viewed

@@ -33,7 +33,7 @@
     }
   },
   "torch_dtype": "float32",
-  "transformers_version": "4.30.2",
   "use_cache": true,
-  "vocab_size": 50259
 }

     }
   },
   "torch_dtype": "float32",
+  "transformers_version": "4.33.0",
   "use_cache": true,
+  "vocab_size": 50261
 }

generation_config.json CHANGED Viewed

@@ -2,5 +2,5 @@
   "_from_model_config": true,
   "bos_token_id": 50256,
   "eos_token_id": 50256,
-  "transformers_version": "4.30.2"
 }

   "_from_model_config": true,
   "bos_token_id": 50256,
   "eos_token_id": 50256,
+  "transformers_version": "4.33.0"
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3905b921577090aec25f844390d7ac19ffc510e41a4d6b8c08168316af58dbae
-size 497780352

 version https://git-lfs.github.com/spec/v1
+oid sha256:531f4ef9cb99dc3c7d2548cd1248f64d890303eb03de562861312c2b86844790
+size 497786496

optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0ef28bdf2e6ad48551faa7696cafc16e0c5e3e9c2f51cca544a0fc3b11523a1f
+size 649096325

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7f01da44af4eef5e609983099507e6c2e6c92bb149afa3723d555cdf3a32c4c5
-size 497813341

 version https://git-lfs.github.com/spec/v1
+oid sha256:5de6e1e3302d8628784844b4f0b29c5fcc01dc4a80e2859bd87b1bc43eab7def
+size 497819485

rng_state.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f709d0640d343650b38ff79894793392ca953fe40a2d1c7ac3adbb741ed50143
+size 5809

scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a37b5bfcf423368b830ec97a879837159978abb929495bf71afb2e98b98cf588
+size 563

special_tokens_map.json CHANGED Viewed

@@ -1,13 +1,13 @@
 {
   "bos_token": {
-    "content": "<|startoftext|>",
     "lstrip": false,
     "normalized": true,
     "rstrip": false,
     "single_word": false
   },
   "eos_token": {
-    "content": "<|endoftext|>",
     "lstrip": false,
     "normalized": true,
     "rstrip": false,
@@ -20,5 +20,6 @@
     "rstrip": false,
     "single_word": false
   },
   "unk_token": "<|endoftext|>"
 }

 {
   "bos_token": {
+    "content": "<|startofinstruction|>",
     "lstrip": false,
     "normalized": true,
     "rstrip": false,
     "single_word": false
   },
   "eos_token": {
+    "content": "<|endofcompletion|>",
     "lstrip": false,
     "normalized": true,
     "rstrip": false,
     "rstrip": false,
     "single_word": false
   },
+  "sep_token": "<|endofinstruction|>",
   "unk_token": "<|endoftext|>"
 }

tokenizer_config.json CHANGED Viewed

@@ -3,7 +3,7 @@
   "add_prefix_space": false,
   "bos_token": {
     "__type": "AddedToken",
-    "content": "<|startoftext|>",
     "lstrip": false,
     "normalized": true,
     "rstrip": false,
@@ -12,7 +12,7 @@
   "clean_up_tokenization_spaces": true,
   "eos_token": {
     "__type": "AddedToken",
-    "content": "<|endoftext|>",
     "lstrip": false,
     "normalized": true,
     "rstrip": false,
@@ -29,6 +29,7 @@
     "rstrip": false,
     "single_word": false
   },
   "tokenizer_class": "GPT2Tokenizer",
   "unk_token": {
     "__type": "AddedToken",

   "add_prefix_space": false,
   "bos_token": {
     "__type": "AddedToken",
+    "content": "<|startofinstruction|>",
     "lstrip": false,
     "normalized": true,
     "rstrip": false,
   "clean_up_tokenization_spaces": true,
   "eos_token": {
     "__type": "AddedToken",
+    "content": "<|endofcompletion|>",
     "lstrip": false,
     "normalized": true,
     "rstrip": false,
     "rstrip": false,
     "single_word": false
   },
+  "sep_token": "<|endofinstruction|>",
   "tokenizer_class": "GPT2Tokenizer",
   "unk_token": {
     "__type": "AddedToken",

training_stats.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:63cec774a93f84808183ddf0dacaca250ff645a5e6883cdfd4ea3f96a0cce3fa
-size 3108

 version https://git-lfs.github.com/spec/v1
+oid sha256:1fd2182fce4c23b82a8ddbaf441ebcca1c89cae6bf92c1283572fe28a0e7a2d0
+size 2355