aashish1904 commited on
Commit
6769b6d
1 Parent(s): a705aba

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +503 -0
README.md ADDED
@@ -0,0 +1,503 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ license: gemma
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ extra_gated_heading: Access Gemma on Hugging Face
8
+ extra_gated_prompt: >-
9
+ To access Gemma on Hugging Face, you’re required to review and agree to
10
+ Google’s usage license. To do this, please ensure you’re logged in to Hugging
11
+ Face and click below. Requests are processed immediately.
12
+ extra_gated_button_content: Acknowledge license
13
+ tags:
14
+ - conversational
15
+ base_model: google/gemma-2-2b-it
16
+ language:
17
+ - ja
18
+
19
+ ---
20
+
21
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
22
+
23
+
24
+ # QuantFactory/gemma-2-2b-jpn-it-GGUF
25
+ This is quantized version of [google/gemma-2-2b-jpn-it](https://huggingface.co/google/gemma-2-2b-jpn-it) created using llama.cpp
26
+
27
+ # Original Model Card
28
+
29
+
30
+ # Gemma 2 JPN model card
31
+
32
+ ### Resources and Technical Documentation:
33
+
34
+ - [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
35
+ - [Gemma 2 JPN on Kaggle](https://www.kaggle.com/models/google/gemma-2-2b-jpn-it)
36
+ - [Gemma 2 JPN on Hugging Face](https://huggingface.co/google/gemma-2-2b-jpn-it)
37
+
38
+ **Terms of Use**: [Terms](https://ai.google.dev/gemma/terms)\
39
+ **Authors**: Google
40
+
41
+ ## Model Information
42
+
43
+ Summary description and brief definition of inputs and outputs.
44
+
45
+ ### Description
46
+
47
+ Gemma is a series of best-in-class open models and draws inspiration and
48
+ technological lineage from the Gemini family of models. They are text-to-text,
49
+ decoder-only large language models with open weights. Gemma models are
50
+ well-suited for a variety of text generation tasks, including question
51
+ answering, summarization, and reasoning.
52
+
53
+ Gemma-2-JPN is a Gemma 2 2B model fine-tuned on Japanese text. It supports the
54
+ Japanese language with the same level of performance of English only queries on
55
+ Gemma 2.
56
+
57
+ ### Usage
58
+
59
+ Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:
60
+ ```sh
61
+ pip install -U transformers
62
+ ```
63
+
64
+ Then, copy the snippet from the section that is relevant for your usecase.
65
+
66
+ #### Running with the `pipeline` API
67
+
68
+ ```python
69
+ import torch
70
+ from transformers import pipeline
71
+
72
+ pipe = pipeline(
73
+ "text-generation",
74
+ model="google/gemma-2-2b-jpn-it",
75
+ model_kwargs={"torch_dtype": torch.bfloat16},
76
+ device="cuda", # replace with "mps" to run on a Mac device
77
+ )
78
+
79
+ messages = [
80
+ {"role": "user", "content": "マシーンラーニングについての詩を書いてください。"},
81
+ ]
82
+
83
+ outputs = pipe(messages, return_full_text=False, max_new_tokens=256)
84
+ assistant_response = outputs[0]["generated_text"].strip()
85
+ print(assistant_response)
86
+ ```
87
+
88
+ <details>
89
+ <summary>Example output</summary>
90
+
91
+ ```
92
+ ## マシーンラーニングの詩
93
+
94
+ **1.**
95
+ データの海、深淵の広がり、
96
+ 複雑なパターン、隠された知識。
97
+ 機械学習、その力強さ、
98
+ 未来を予測、その道を開く。
99
+
100
+ **2.**
101
+ ニューラルネットワーク、複雑な枝、
102
+ 学習の旅、その過程は静か。
103
+ データから学び、進化する姿、
104
+ 予測の精度、その力強さ。
105
+
106
+ **3.**
107
+ 教師あり学習、正解を導く、
108
+ 教師なし学習、未知の世界へ。
109
+ 機械学習、その進化は止まらない、
110
+ 未来の扉を開く、新たな時代へ。
111
+
112
+ **4.**
113
+ 画像認識、音声認識、
114
+ 複雑なタスク、その答えを見つける。
115
+ 機械学習、その力強さ、
116
+ 未来の技術、その可能性を語る。
117
+ ```
118
+
119
+ </details>
120
+
121
+ It can also be used for translation, as follows:
122
+
123
+ ```python
124
+ translation_input_text = f"Translate the following poem from Japanese to English:\n\n{assistant_response}"
125
+ messages = [
126
+ {"role": "user", "content": translation_input_text},
127
+ ]
128
+
129
+ outputs = pipe(messages, return_full_text=False, max_new_tokens=1024)
130
+ translated_response = outputs[0]["generated_text"].strip()
131
+ print(translated_response)
132
+ ```
133
+
134
+ <details>
135
+
136
+ <summary>Example output</summary>
137
+
138
+ ```
139
+ ## A Poem About Machine Learning
140
+
141
+ **1.**
142
+ A vast ocean of data, a deep expanse,
143
+ Complex patterns, hidden knowledge.
144
+ Machine learning, its strength so vast,
145
+ Predicting the future, opening the way.
146
+
147
+ **2.**
148
+ A neural network, with branches intricate,
149
+ A journey of learning, its process serene.
150
+ Learning from data, evolving in its form,
151
+ The precision of prediction, its strength.
152
+
153
+ **3.**
154
+ Supervised learning, guiding the correct answer,
155
+ Unsupervised learning, venturing into the unknown.
156
+ Machine learning, its evolution never ends,
157
+ Opening the doors to the future, a new era.
158
+
159
+ **4.**
160
+ Image recognition, speech recognition,
161
+ Complex tasks, finding the answer.
162
+ Machine learning, its strength so vast,
163
+ The possibilities of future technology, a story to be told.
164
+
165
+
166
+
167
+
168
+ **Explanation:**
169
+
170
+ The poem uses vivid imagery and metaphors to describe the power and potential of machine learning.
171
+
172
+ * **Data as an ocean:** Represents the vast amount of information available for learning.
173
+ * **Complex patterns:** Highlights the intricate nature of data and the challenges of extracting meaningful insights.
174
+ * **Future prediction:** Emphasizes the ability of machine learning to analyze data and make predictions about the future.
175
+ * **Neural network as a tree:** Represents the interconnectedness and complexity of the learning process.
176
+ * **Learning from data:** Focuses on the core principle of machine learning, where algorithms learn from data to improve their performance.
177
+
178
+
179
+
180
+ The poem concludes by highlighting the diverse applications of machine learning, such as image and speech recognition, and emphasizes its potential to shape the future of technology.
181
+ ```
182
+
183
+ </details>
184
+
185
+
186
+ #### Running the model on a single / multi GPU
187
+
188
+ ```python
189
+ # pip install accelerate
190
+ from transformers import AutoTokenizer, AutoModelForCausalLM
191
+ import torch
192
+
193
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-jpn-it")
194
+ model = AutoModelForCausalLM.from_pretrained(
195
+ "google/gemma-2-2b-jpn-it",
196
+ device_map="auto",
197
+ torch_dtype=torch.bfloat16,
198
+ )
199
+
200
+ messages = [
201
+ {"role": "user", "content": "マシーンラーニングについての詩を書いてください。"},
202
+ ]
203
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, return_dict=True).to(model.device)
204
+
205
+ outputs = model.generate(**inputs, max_new_tokens=256)
206
+ generated_text = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
207
+ print(generated_text.strip())
208
+ ```
209
+
210
+
211
+ <a name="precisions"></a>
212
+ #### Running the model on a GPU using different precisions
213
+
214
+ The native weights of this model were exported in `bfloat16` precision.
215
+
216
+ You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
217
+
218
+ * _Upcasting to `torch.float32`_
219
+
220
+ ```python
221
+ # pip install accelerate
222
+ from transformers import AutoTokenizer, AutoModelForCausalLM
223
+
224
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-jpn-it")
225
+ model = AutoModelForCausalLM.from_pretrained(
226
+ "google/gemma-2-2b-jpn-it",
227
+ device_map="auto",
228
+ )
229
+
230
+ messages = [
231
+ {"role": "user", "content": "マシーンラーニングについての詩を書いてください。"},
232
+ ]
233
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, return_dict=True).to(model.device)
234
+
235
+ outputs = model.generate(**inputs, max_new_tokens=256)
236
+ generated_text = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
237
+ print(generated_text.strip())
238
+ ```
239
+
240
+
241
+ ### Inputs and outputs
242
+
243
+ - **Input:** Text string, such as a question, a prompt, or a document to
244
+ be summarized.
245
+ - **Output:** Generated Japanese-language text in response to the input,
246
+ such as an answer to a question, or a summary of a document.
247
+
248
+ ## Model Data
249
+
250
+ Data used for model training and how the data was processed.
251
+
252
+ ### Training Dataset
253
+
254
+ These models were trained on a dataset of text data that includes a wide
255
+ variety of sources, totaling 8 trillion tokens. Here are the key components:
256
+
257
+ - Web Documents: A diverse collection of web text ensures the model is
258
+ exposed to a broad range of linguistic styles, topics, and vocabulary.
259
+ Primarily English-language content.
260
+ - Code: Exposing the model to code helps it to learn the syntax and
261
+ patterns of programming languages, which improves its ability to generate
262
+ code or understand code-related questions.
263
+ - Mathematics: Training on mathematical text helps the model learn logical
264
+ reasoning, symbolic representation, and to address mathematical queries.
265
+ - Instruction data set: large-scale and high-quality Japanese and
266
+ multilingual instruction data.
267
+
268
+ The combination of these diverse data sources is crucial for training a
269
+ powerful language model that can handle a wide variety of different tasks and
270
+ text formats.
271
+
272
+ ### Data Preprocessing
273
+
274
+ Here are the key data cleaning and filtering methods applied to the training
275
+ data:
276
+
277
+ - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering
278
+ was applied at multiple stages in the data preparation process to ensure
279
+ the exclusion of harmful and illegal content.
280
+ - Sensitive Data Filtering: As part of making Gemma pre-trained models
281
+ safe and reliable, we used automated techniques to filter out certain
282
+ personal information and other sensitive data from training sets.
283
+ - Additional methods: Filtering based on content quality and
284
+ safety in line with [our policies](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/2023_Google_AI_Principles_Progress_Update.pdf#page=11).
285
+
286
+ ## Implementation Information
287
+
288
+ Details about the model internals.
289
+
290
+ ### Hardware
291
+
292
+ Gemma was trained using the latest generation of [Tensor Processing Unit
293
+ (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu) hardware (TPUv5p).
294
+
295
+ Training large language models requires significant computational power. TPUs,
296
+ designed specifically for matrix operations common in machine learning, offer
297
+ several advantages in this domain:
298
+
299
+ - Performance: TPUs are specifically designed to handle the massive
300
+ computations involved in training LLMs. They can speed up training
301
+ considerably compared to CPUs.
302
+ - Memory: TPUs often come with large amounts of high-bandwidth memory,
303
+ allowing for the handling of large models and batch sizes during training.
304
+ This can lead to better model quality.
305
+ - Scalability: TPU Pods (large clusters of TPUs) provide a scalable
306
+ solution for handling the growing complexity of large foundation models.
307
+ You can distribute training across multiple TPU devices for faster and more
308
+ efficient processing.
309
+ - Cost-effectiveness: In many scenarios, TPUs can provide a more
310
+ cost-effective solution for training large models compared to CPU-based
311
+ infrastructure, especially when considering the time and resources saved
312
+ due to faster training.
313
+
314
+ These advantages are aligned with
315
+ [Google's commitments to operate sustainably](https://sustainability.google/operating-sustainably/).
316
+
317
+ ### Software
318
+
319
+ Training was done using [JAX](https://github.com/google/jax) and
320
+ [ML Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/).
321
+
322
+ JAX allows researchers to take advantage of the latest generation of hardware,
323
+ including TPUs, for faster and more efficient training of large models.
324
+
325
+ ML Pathways is Google's latest effort to build artificially intelligent systems
326
+ capable of generalizing across multiple tasks. This is specially suitable for
327
+ [foundation models](https://ai.google/discover/foundation-models/), including
328
+ large language models like these ones.
329
+
330
+ Together, JAX and ML Pathways are used as described in the [paper about the
331
+ Gemini family of models](https://goo.gle/gemma2report); "the 'single controller'
332
+ programming model of Jax and Pathways allows a single Python process to
333
+ orchestrate the entire training run, dramatically simplifying the development
334
+ workflow."
335
+
336
+ ## Evaluation
337
+
338
+ To assess the quality of this model, we collected a diverse set of Japanese
339
+ prompts and evaluated performance using an LLM-as-a-judge approach against
340
+ GPT-3.5. The rating system is based on a 7-scale assessments, which are
341
+ MuchBetterThan, BetterThan, SlightlyBetterThan, AboutTheSame, SlightlyWorse,
342
+ WorseThan, MuchWorseThan associated with the numerical scores 1.5, 1.0, 0.5, 0,
343
+ -0.5, -1.0, -1.5 respectively. We also tracked the ability of the model to
344
+ answer in the correct language: for a Japanese prompt, the model should
345
+ typically answer in Japanese rather than defaulting to English.
346
+
347
+ <table>
348
+ <thead>
349
+ <tr>
350
+ <th><br>
351
+ <strong>Benchmark</strong></th>
352
+ <th><br>
353
+ <strong>Gemma-2-IT</strong></th>
354
+ <th><br>
355
+ <strong>Gemma-2-IT-JPN</strong></th>
356
+ <th></th>
357
+ </tr>
358
+ </thead>
359
+ <tbody>
360
+ <tr>
361
+ <td><br>
362
+ Preference vs GPT-3.5</td>
363
+ <td><br>
364
+ -0.25 ± 0.05 </td>
365
+ <td><br>
366
+ 0.03 ± 0.04</td>
367
+ <td></td>
368
+ </tr>
369
+ <tr>
370
+ <td><br>
371
+ Language correctness</td>
372
+ <td><br>
373
+ 86.47%</td>
374
+ <td><br>
375
+ 98.24%</td>
376
+ <td></td>
377
+ </tr>
378
+ </tbody>
379
+ </table>
380
+
381
+ ## Ethics and Safety
382
+
383
+ Ethics and safety evaluation approach and results.
384
+
385
+ ### Evaluation Approach
386
+
387
+ Our evaluation methods include structured evaluations and internal red-teaming
388
+ testing of relevant content policies. Red-teaming was conducted by a number of
389
+ different teams, each with different goals and human evaluation metrics. These
390
+ models were evaluated against a number of different categories relevant to
391
+ ethics and safety, including:
392
+
393
+ - Text-to-Text Content Safety: Human evaluation on prompts covering
394
+ safety policies including child sexual abuse and exploitation, harassment,
395
+ violence and gore, and hate speech.
396
+ - Text-to-Text Representational Harms: Benchmark against relevant academic
397
+ datasets.
398
+ - Memorization: Automated evaluation of memorization of training data,
399
+ including the risk of personally identifiable information exposure.
400
+ - Large-scale harm: Tests for "dangerous capabilities," such as chemical,
401
+ biological, radiological, and nuclear (CBRN) risks.
402
+
403
+ ## Usage and Limitations
404
+
405
+ These models have certain limitations that users should be aware of.
406
+
407
+ ### Intended Usage
408
+
409
+ Open Large Language Models (LLMs) have a wide range of applications across
410
+ various industries and domains. The following list of potential uses is not
411
+ comprehensive. The purpose of this list is to provide contextual information
412
+ about the possible use-cases that the model creators considered as part of model
413
+ training and development.
414
+
415
+ - Content Creation and Communication
416
+ - Text Generation: These models can be used to generate creative
417
+ text formats such as poems, scripts, code, marketing copy, and email drafts.
418
+ - Chatbots and Conversational AI: Power conversational interfaces
419
+ for customer service, virtual assistants, or interactive applications.
420
+ - Text Summarization: Generate concise summaries of a text corpus,
421
+ research papers, or reports.
422
+ - Research and Education
423
+ - Natural Language Processing (NLP) Research: These models can
424
+ serve as a foundation for researchers to experiment with NLP
425
+ techniques, develop algorithms, and contribute to the advancement of the field.
426
+ - Language Learning Tools: Support interactive language learning
427
+ experiences, aiding in grammar correction or providing writing practice.
428
+ - Knowledge Exploration: Assist researchers in exploring large
429
+ bodies of text by generating summaries or answering questions about
430
+ specific topics.
431
+
432
+ ### Limitations
433
+
434
+ - Training Data
435
+ - The quality and diversity of the training data significantly
436
+ influence the model's capabilities. Biases or gaps in the training data
437
+ can lead to limitations in the model's responses.
438
+ - The scope of the training dataset determines the subject areas
439
+ the model can handle effectively.
440
+ - Context and Task Complexity
441
+ - LLMs are better at tasks that can be framed with clear prompts
442
+ and instructions. Open-ended or highly complex tasks might be challenging.
443
+ - A model's performance can be influenced by the amount of context
444
+ provided (longer context generally leads to better outputs, up to a
445
+ certain point).
446
+ - Language Ambiguity and Nuance
447
+ - Natural language is inherently complex. LLMs might struggle to
448
+ grasp subtle nuances, sarcasm, or figurative language.
449
+ - Factual Accuracy
450
+ - LLMs generate responses based on information they learned from
451
+ their training datasets, but they are not knowledge bases. They may
452
+ generate incorrect or outdated factual statements.
453
+ - Common Sense
454
+ - LLMs rely on statistical patterns in language. They might lack
455
+ the ability to apply common sense reasoning in certain situations.
456
+
457
+ ### Ethical Considerations and Risks
458
+
459
+ The development of large language models (LLMs) raises several ethical
460
+ concerns. In creating an open model, we have carefully considered the
461
+ following:
462
+
463
+ - Bias and Fairness
464
+ - LLMs trained on large-scale, real-world text data can reflect
465
+ socio-cultural biases embedded in the training material. These models
466
+ underwent careful scrutiny, input data pre-processing described and
467
+ posterior evaluations reported in this card.
468
+ - Misinformation and Misuse
469
+ - LLMs can be misused to generate text that is false, misleading,
470
+ or harmful.
471
+ - Guidelines are provided for responsible use with the model, see
472
+ the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
473
+ - Transparency and Accountability:
474
+ - This model card summarizes details on the models' architecture,
475
+ capabilities, limitations, and evaluation processes.
476
+ - A responsibly developed open model offers the opportunity to
477
+ share innovation by making LLM technology accessible to developers and
478
+ researchers across the AI ecosystem.
479
+
480
+ Risks identified and mitigations:
481
+
482
+ - Perpetuation of biases: It's encouraged to perform continuous
483
+ monitoring (using evaluation metrics, human review) and the exploration of
484
+ de-biasing techniques during model training, fine-tuning, and other use cases.
485
+ - Generation of harmful content: Mechanisms and guidelines for content
486
+ safety are essential. Developers are encouraged to exercise caution and
487
+ implement appropriate content safety safeguards based on their specific
488
+ product policies and application use cases.
489
+ - Misuse for malicious purposes: Technical limitations and developer and
490
+ end-user education can help mitigate against malicious applications of
491
+ LLMs. Educational resources and reporting mechanisms for users to flag
492
+ misuse are provided. Prohibited uses of Gemma models are outlined in the
493
+ [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
494
+ - Privacy violations: Models were trained on data filtered for removal of
495
+ PII (Personally Identifiable Information). Developers are encouraged to
496
+ adhere to privacy regulations with privacy-preserving techniques.
497
+
498
+ ### Benefits
499
+
500
+ At the time of release, this family of models provides high-performance open
501
+ large language model implementations designed from the ground up for Responsible
502
+ AI development compared to similarly sized models.
503
+