xzyao commited on
Commit
678b3c0
·
1 Parent(s): 8cf442c

update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -12
README.md CHANGED
@@ -20,49 +20,111 @@ RedPajama-Base-INCITE-2.8B-v1, is a large transformer-based language model devel
20
  ## GPU Inference
21
 
22
  This requires a GPU with 8GB memory.
 
23
  ```python
 
 
24
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
 
25
  # init
26
  tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1")
27
  model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1", torch_dtype=torch.float16)
28
  model = model.to('cuda:0')
 
29
  # infer
30
- inputs = tokenizer("Hello", return_tensors='pt').to(model.device)
31
- outputs = model.generate(**inputs, max_new_tokens=10, do_sample=True, temperature=0.8)
32
- output_str = tokenizer.decode(outputs[0])
 
 
 
 
 
33
  print(output_str)
 
 
 
 
34
  ```
35
 
36
  ## GPU Inference in Int8
37
 
38
- This requires a GPU with 6GB memory.
 
 
 
 
 
 
 
39
 
40
  ```python
 
 
41
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
 
42
  # init
43
  tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1")
44
- model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1", device_map="auto", load_in_8bit=True)
 
45
  # infer
46
- inputs = tokenizer("Hello", return_tensors='pt').to(model.device)
47
- outputs = model.generate(**inputs, max_new_tokens=10, do_sample=True, temperature=0.8)
48
- output_str = tokenizer.decode(outputs[0])
 
 
 
 
 
49
  print(output_str)
 
 
 
50
  ```
51
 
52
  ## CPU Inference
53
 
 
 
54
  ```python
 
 
55
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
 
56
  # init
57
  tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1")
58
- model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1", torch_dtype=torch.bfloat16)
59
  # infer
60
- inputs = tokenizer("<human>: Hello!\n<bot>:", return_tensors='pt').to(model.device)
61
- outputs = model.generate(**inputs, max_new_tokens=10, do_sample=True, temperature=0.8)
62
- output_str = tokenizer.decode(outputs[0])
 
 
 
 
 
63
  print(output_str)
 
 
 
64
  ```
65
 
 
66
 
67
  # Uses
68
 
 
20
  ## GPU Inference
21
 
22
  This requires a GPU with 8GB memory.
23
+
24
  ```python
25
+ import torch
26
+ import transformers
27
  from transformers import AutoTokenizer, AutoModelForCausalLM
28
+
29
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
30
+
31
+ # check transformers version
32
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
33
+
34
  # init
35
  tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1")
36
  model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1", torch_dtype=torch.float16)
37
  model = model.to('cuda:0')
38
+
39
  # infer
40
+ prompt = "Alan Turing is"
41
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
42
+ input_length = inputs.input_ids.shape[1]
43
+ outputs = model.generate(
44
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True,
45
+ )
46
+ token = outputs.sequences[0, input_length:]
47
+ output_str = tokenizer.decode(token)
48
  print(output_str)
49
+ """
50
+ a name that has been synonymous with the computer age since the 1950s. The British mathematician, logician, and cryptanalyst is widely regarded as the father of modern computing. His contributions to the development of the modern computer and the theory of computation have had a profound impact on the world we live in today.
51
+ Turing’s contributions to the development of the modern computer were made in the 1940s and 1950s. He is most famous for his work on the Turing machine, a theoretical model of a computing machine that was able to perform all the mathematical operations of a computer. Turing’s work on the...
52
+ """
53
  ```
54
 
55
  ## GPU Inference in Int8
56
 
57
+ To run inference with int8, please ensure you have installed accelerate and bitandbytes. You can install them with the following command:
58
+
59
+ ```bash
60
+ pip install accelerate
61
+ pip install bitsandbytes
62
+ ```
63
+
64
+ Then you can run inference with int8 as follows:
65
 
66
  ```python
67
+ import torch
68
+ import transformers
69
  from transformers import AutoTokenizer, AutoModelForCausalLM
70
+
71
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
72
+
73
+ # check transformers version
74
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
75
+
76
  # init
77
  tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1")
78
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1", device_map='auto', torch_dtype=torch.float16, load_in_8bit=True)
79
+
80
  # infer
81
+ prompt = "Alan Turing is"
82
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
83
+ input_length = inputs.input_ids.shape[1]
84
+ outputs = model.generate(
85
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
86
+ )
87
+ token = outputs.sequences[0, input_length:]
88
+ output_str = tokenizer.decode(token)
89
  print(output_str)
90
+ """
91
+ the man who cracked the Enigma code during World War II, and who was later convicted of homosexual acts. He was a brilliant mathematician, and a visionary who foresaw the computer age....
92
+ """
93
  ```
94
 
95
  ## CPU Inference
96
 
97
+ You can run inference on CPU as follows:
98
+
99
  ```python
100
+ import torch
101
+ import transformers
102
  from transformers import AutoTokenizer, AutoModelForCausalLM
103
+
104
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
105
+
106
+ # check transformers version
107
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
108
+
109
  # init
110
  tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1")
111
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Base-INCITE-2.8B-v1", torch_dtype=torch.float32)
112
  # infer
113
+ prompt = "Alan Turing is"
114
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
115
+ input_length = inputs.input_ids.shape[1]
116
+ outputs = model.generate(
117
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
118
+ )
119
+ token = outputs.sequences[0, input_length:]
120
+ output_str = tokenizer.decode(token)
121
  print(output_str)
122
+ """
123
+ one of the most famous people to have come out of Cambridge. He is also one of the most famous people to have been arrested for homosexuality.
124
+ """
125
  ```
126
 
127
+ Please note that since `LayerNormKernelImpl` is not implemented in fp16 for CPU, we use fp32 for CPU inference.
128
 
129
  # Uses
130