File size: 3,578 Bytes
6b6386c
 
e9442f9
 
 
 
 
 
6b6386c
e9442f9
 
 
 
 
 
3aeecd0
e9442f9
dab5ba3
e9442f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dab5ba3
 
 
 
dbe0f13
 
dab5ba3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9442f9
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
license: llama2
widget:
  - example_title: "ALMA-Cymraeg-13B"
    text: "Cyfieithwch y testun Saesneg canlynol i'r Gymraeg.\n### Saesneg:\nFor the first time, GPs no longer have to physically print, sign and hand a green paper prescription form to the patient or wait for it to be taken to the pharmacy. Instead, the prescription is sent electronically from the surgery via the IT system to the patient’s chosen pharmacy - even without the patient needing to visit the surgery to pick up a repeat prescription form.\n\n### Cymraeg:\n"
    output:
      text: "Am y tro cyntaf, nid oes rhaid i feddygon teulu bellach argraffu, llofnodi a throsglwyddo ffurflen bresgripsiwn werdd i'r claf neu aros iddi gael ei chludo i'r fferyllfa. Yn lle hynny, caiff y presgripsiwn ei anfon yn electronig gan y practis drwy'r system TG at fferyllfa ddewisedig y claf - heb fod angen i'r claf ymweld â'r practis er mwyn casglu ffurflen bresgripsiwn ailadrodd."
pipeline_tag: text-generation
---

# ALMA-Cymraeg-13B
Fersiwn Gymraeg o fodel cyfieithu [ALMA](https://github.com/fe1ixxu/ALMA) a ddisgrifir yn [https://arxiv.org/abs/2309.11674](https://arxiv.org/abs/2309.11674). \
_This is a Welsh version of the [ALMA](https://github.com/fe1ixxu/ALMA) LLM-based translation model._

Mae'r model LLM yn seiliedig ar Lama-2-13B, gyda hyfforddiant parhaus ar ddata Gymreig [OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) am 3 Epoch
ac yna hyfforddiant cywrain pellach ar ddata Cofnod y Cynulliad a ddarparir gan [TechIaith](https://huggingface.co/techiaith).

Mae fersiwn cyflymach sydd wedi ei gywasgu i 4.0bpw er mwyn llwytho mewn cof GPU o 10GB ar gael [yma](https://huggingface.co/BangorAI/ALMA-Cymraeg-13B-0.1-4.0bpw-exl2).

### Fformat Sgwrs

Mae'r hyfforddiant cywrain wedi defnyddio'r fformat canlynol ar gyfer trosi o'r Saesneg i'r Gymraeg (a'r naill ffordd i'r llall).
```
Cyfieithwch y testun Saesneg canlynol i'r Gymraeg.
### Saesneg:
{prompt}

### Cymraeg:

```



#### Esiampl

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("BangorAI/ALMA-Cymraeg-13B-0.1", torch_dtype=torch.float16, load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("BangorAI/ALMA-Cymraeg-13B-0.1")

prompt = """Cyfieithwch y testun Saesneg canlynol i'r Gymraeg.
### Saesneg:
For the first time, GPs no longer have to physically print, sign and hand a green paper prescription form to the patient or wait for it to be taken to the pharmacy. Instead, the prescription is sent electronically from the surgery via the IT system to the patient’s chosen pharmacy - even without the patient needing to visit the surgery to pick up a repeat prescription form.

### Cymraeg:
"""

model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

generated_ids = model.generate(**model_inputs,
                               eos_token_id=tokenizer.eos_token_id,
                               top_k=90,
                               top_p=1.0,
                               temperature=0.3,
                               repetition_penalty=1.2,
                               max_new_tokens=500,
                               do_sample=True)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

```

## Hawlfraint

Mae'r model yn seiliedig ar Llama2 ac felly dan drwydded gan [Meta](https://ai.meta.com/llama/license/). \
Mae'r data Cofnod y Cynulliad dan drywdded [Llywodraeth Agored](https://www.nationalarchives.gov.uk/doc/open-government-licence-cymraeg/version/3/).