Sreenington
/

Llama-3-8B-ChatQA-AWQ

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

Sreenington commited on May 3

Commit

deac860

•

1 Parent(s): 533b773

Update README.md

Files changed (1) hide show

README.md +26 -1

README.md CHANGED Viewed

@@ -73,8 +73,33 @@ Assistant:
 ## How to use
 ### take the whole document as context
-This can be applied to the scenario where the whole document can be fitted into the model, so that there is no need to run retrieval over the document.
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch

 ## How to use
+### using vLLM
+```python
+from vllm import LLM, SamplingParams
+# Sample prompts.
+prompts = [
+    "Hello, how are you?"
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(max_tokens=128)
+# Create an LLM.
+llm = LLM(model="Sreenington/Llama-3-8B-ChatQA-AWQ", quantization="AWQ")
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
 ### take the whole document as context
+This can be applied to the scenario where the whole document can be fitted into the model, so that there is no need to run retrieval over the document
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch