File size: 2,758 Bytes
6c78204
 
c8ca4c8
 
 
 
 
 
6c78204
c8ca4c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6360f99
c8ca4c8
 
 
 
 
9e48169
652423d
c8ca4c8
9e48169
2249c05
 
 
 
 
 
 
9e48169
c8ca4c8
 
 
 
 
9e48169
652423d
 
 
 
9e48169
c8ca4c8
 
6360f99
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
library_name: transformers
tags:
- indonesia
license: mit
language:
- id
inference: true
---
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document Title</title>
    <style>
        h1 {
            font-size: 36px;
            color: navy;
            font-family: 'Tahoma';
            text-align: center;
        }
    </style>
</head>
<body>
    <h1>How small can language models be?</h1>
</body>
</html>

<center>
    <img src="https://i.imgur.com/z9ey830.png" alt="Sasando" width="500" height="250">
    <p><em>Sasando-1 is a tiny, highly experimental text generator built using the Phi-3 architecture.</em></p>
    <p><strong><a href="https://huggingface.co/spaces/afrizalha/Sasando-1" style="color: blue; font-family: Tahoma;">❕Go straight to the gradio demo❕</a></strong></p>
    <p><em style="color: black; font-weight: bold;">This repo contains the 7M version.</em></p>
</center>

## 🎻 Welcome!
Sasando-1 is a tiny, highly experimental Indonesian text generator built using the Phi-3 architecture. It comes with two variations of microscopic sizes: 7M and 25M parameters. It is trained on a tightly-controlled Indo4B dataset filtered to only have 18000 unique words. The method is inspired by Microsoft's TinyStories paper which demonstrates that a tiny language model can produce fluent text when trained on tightly-controlled dataset.

## 🇮🇩 Context
Indonesia has +700 languages, and many of them are dying at an alarming rate. Language technologies like generative AI can play a massive role in language preservation. However, Indonesia has several contextual issues:

- Many languages, including those with millions of speakers, have low-volume digital resources
- Running large models can be costly, while Indonesia is a middle-income country with little funding

Overcoming these challenges require developers to work with what little data and money that they have. Sasando-1 is a prototypical demonstration that thinly-available resources can potentially still be leveraged to develop generative models with cheap compute.

## ✨ Specs
- Comes with 7M and 25M parameters
- Based on Phi-3 architecture
- Embedding vocab 4096
- Trained on ~257M tokens * 4 epoch

## 🔭 Out-of-Scope Use
This is a research preview base model. It is not intruction-tuned and has minimal safety curation. It is not intended for commercial or practical applications.

You are also not allowed to use this model without having fun.

## Acknowledgments

- **Developed by:** Afrizal Hasbi Azizy
- **License:** MIT

## Training log
<center>
    <img src="https://imgur.com/32NFAKm.png" alt="Training log" width="500" height="250">
</center>