Update README.md

7d9976a verified 7 months ago

11.1 kB

	---
	library_name: transformers
	widget:
	- messages:
	- role: user
	content: How does the brain work?
	inference:
	parameters:
	max_new_tokens: 200
	extra_gated_heading: Access Gemma on Hugging Face
	extra_gated_prompt: >-
	To access Gemma on Hugging Face, you’re required to review and agree to
	Google’s usage license. To do this, please ensure you’re logged-in to Hugging
	Face and click below. Requests are processed immediately.
	extra_gated_button_content: Acknowledge license
	datasets:
	- yatharth97/10k_reports_gemma
	---

	# yatharth-gemma-7b-it-10k Model Card

	Reference Model Page: [Gemma](https://ai.google.dev/gemma/docs)

	This model card pertains to the version of the Gemma model that has been fine-tuned on a dataset of 10K reports, specifically to enhance performance on tasks related to answering questions about these reports


	Authors: Yatharth Mahesh Sant

	## Model Information

	Summary description and brief definition of inputs and outputs.

	### Description

	The model presented here is an advanced adaptation of the Gemma 7B-IT, a member of the Gemma family of lightweight yet state-of-the-art models developed by Google. Leveraging the breakthrough research and technology that brought forth the Gemini models, our fine-tuned iteration specializes in parsing and understanding financial texts, particularly those found in 10-K reports.

	Dubbed the "yatharth-gemma-7B-it-10k" this model retains the text-to-text, decoder-only architecture of its progenitors, functioning optimally in English. What sets it apart is its refined focus on question-answering tasks specific to the intricate domain of 10-K reports — an invaluable resource for financial analysts, investors, and regulatory professionals seeking AI-driven insights.

	Preserving the open-weights philosophy of the original Gemma models, this variant has been instruction-tuned with a curated dataset of 10-K reports. It not only demonstrates an enhanced proficiency in generating accurate, context-aware responses to user queries but also maintains the flexibility and efficiency that allow deployment in various settings, from personal computers to cloud-based environments.

	The "yatharth-gemma-7B-it-10k" upholds the Gemma tradition of facilitating text generation tasks such as summarization and complex reasoning. Its unique optimization for financial reports exemplifies our commitment to pushing the boundaries of specialized AI, providing an unparalleled tool for dissecting and interpreting one of the business world's most information-dense documents.

	By marrying the accessibility of the Gemma models with the niche expertise required to navigate 10-K reports, we extend the frontiers of what's possible with AI, democratizing cutting-edge technology to empower financial analysis and decision-making.

	### Usage

	Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.

	#### Fine-tuning the model

	You can find fine-tuning scripts and notebook under the [`examples/` directory](https://huggingface.co/google/gemma-7b/tree/main/examples) of [`google/gemma-7b`](https://huggingface.co/google/gemma-7b) repository. To adapt it to this model, simply change the model-id to `yatharth97/yatharth-gemma-7b-it-10k`.
	In that repository, we provide:

	* A script to perform Supervised Fine-Tuning (SFT) on UltraChat dataset using QLoRA
	* A script to perform SFT using FSDP on TPU devices
	* A notebook that you can run on a free-tier Google Colab instance to perform SFT on English quotes dataset


	#### Running the model on a CPU

	As explained below, we recommend `torch.bfloat16` as the default dtype. You can use [a different precision](#precisions) if necessary.

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
	model = AutoModelForCausalLM.from_pretrained(
	"yatharth97/yatharth-gemma-7b-it-10k",
	torch_dtype=torch.bfloat16
	)

	input_text = 'Can you tell me what the Total Debt was in 2023?'
	input_ids = tokenizer(input_text, return_tensors="pt")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```


	#### Running the model on a single / multi GPU


	```python
	# pip install accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
	model = AutoModelForCausalLM.from_pretrained(
	"yatharth97/yatharth-gemma-7b-it-10k",
	device_map="auto",
	torch_dtype=torch.bfloat16
	)

	input_text = 'Can you tell me what the Total Debt was in 2023?'
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	<a name="precisions"></a>
	#### Running the model on a GPU using different precisions

	The native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.

	You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.

	* _Using `torch.float16`_

	```python
	# pip install accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
	model = AutoModelForCausalLM.from_pretrained(
	"yatharth97/yatharth-gemma-7b-it-10k",
	device_map="auto",
	torch_dtype=torch.float16,
	revision="float16",
	)

	input_text = 'Can you tell me what the Total Debt was in 2023?'
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	* _Using `torch.bfloat16`_

	```python
	# pip install accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
	model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", device_map="auto", torch_dtype=torch.bfloat16)

	input_text = 'Can you tell me what the Total Debt was in 2023?'
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	* _Upcasting to `torch.float32`_

	```python
	# pip install accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
	model = AutoModelForCausalLM.from_pretrained(
	"yatharth97/yatharth-gemma-7b-it-10k",
	device_map="auto"
	)

	input_text = 'Can you tell me what the Total Debt was in 2023?'
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	#### Quantized Versions through `bitsandbytes`

	* _Using 8-bit precision (int8)_

	```python
	# pip install bitsandbytes accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(load_in_8bit=True)

	tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
	model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", quantization_config=quantization_config)

	input_text = 'Can you tell me what the Total Debt was in 2023?'
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	* _Using 4-bit precision_

	```python
	# pip install bitsandbytes accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(load_in_4bit=True)

	tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
	model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", quantization_config=quantization_config)

	input_text = 'Can you tell me what the Total Debt was in 2023?'
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```


	#### Other optimizations

	* _Flash Attention 2_

	First make sure to install `flash-attn` in your environment `pip install flash-attn`

	```diff
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	+ attn_implementation="flash_attention_2"
	).to(0)
	```

	### Chat Template

	The instruction-tuned models use a chat template that must be adhered to for conversational use.
	The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

	Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:

	```py
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import transformers
	import torch

	model_id = "yatharth97/yatharth-gemma-7b-it-10k"
	dtype = torch.bfloat16

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="cuda",
	torch_dtype=dtype,
	)

	chat = [
	{ "role": "user", "content": "Can you tell me what the Total Debt was in 2023?" },
	]
	prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
	```

	At this point, the prompt contains the following text:

	```
	<bos><start_of_turn>user
	Can you tell me what the Total Debt was in 2023?<end_of_turn>
	<start_of_turn>model
	```

	As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
	(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
	the `<end_of_turn>` token.

	You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
	chat template.

	After the prompt is ready, generation can be performed like this:

	```py
	inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
	outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
	print(tokenizer.decode(outputs[0]))
	```

	### Inputs and outputs

	* Input: Text string, such as a question, a prompt, or a 10K document to be
	summarized.
	* Output: Generated English-language text in response to the input, such
	as an answer to a question, or a summary of uploaded 10K document. For summarization currently a separate model is being used.

	## Model Data

	Data used for model training and how the data was processed.

	### Training Dataset

	This model is fine tuned on the dataset: "yatharth97/10k_reports_gemma" which has a conversational based format allowing the user to ask questions about the uploaded 10K report