File size: 3,826 Bytes
9bf4892
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

- Original model is [yanolja/EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0)
- quantized using [llama.cpp](https://github.com/ggerganov/llama.cpp)


### Usage
requirements
```
# GPU model
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

# CPU
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

pip install huggingface_hub
```

```
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

import time
from pprint import pprint


# download model
model_name_or_path = "heegyu/EEVE-Korean-Instruct-10.8B-v1.0-GGUF" # repo id
# 4bit
model_basename = "ggml-model-Q4_K_M.gguf" # file name

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)
print(model_path)


# CPU
# lcpp_llm = Llama(
#     model_path=model_path,
#     n_threads=2,
#     )

# GPUμ—μ„œ μ‚¬μš©ν•˜λ €λ©΄ μ•„λž˜ μ½”λ“œλ‘œ μ‹€ν–‰
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=43, # Change this value based on your model and your GPU VRAM pool.
    n_ctx=4096, # Context window
)


prompt_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: {prompt}\nAssistant:\n"
text = 'ν•œκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μΈκ°€μš”? μ•„λž˜ 선택지 쀑 κ³¨λΌμ£Όμ„Έμš”.\n\n(A) κ²½μ„±\n(B) λΆ€μ‚°\n(C) 평양\n(D) μ„œμšΈ\n(E) μ „μ£Ό'

prompt = prompt_template.format(prompt=text)

start = time.time()
response = lcpp_llm(
    prompt=prompt,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    top_k=50,
    stop = ['</s>'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)
pprint(response)
print(time.time() - start)
```

μ‹€ν–‰κ²°κ³Ό (Colab T4 GPU)
```
llama_print_timings:        load time =     942.53 ms
llama_print_timings:      sample time =      27.60 ms /    37 runs   (    0.75 ms per token,  1340.43 tokens per second)
llama_print_timings: prompt eval time =     942.29 ms /    83 tokens (   11.35 ms per token,    88.08 tokens per second)
llama_print_timings:        eval time =    4530.31 ms /    36 runs   (  125.84 ms per token,     7.95 tokens per second)
llama_print_timings:       total time =    5648.42 ms /   119 tokens
{'choices': [{'finish_reason': 'stop',
              'index': 0,
              'logprobs': None,
              'text': 'A chat between a curious user and an artificial '
                      'intelligence assistant. The assistant gives helpful, '
                      "detailed, and polite answers to the user's questions.\n"
                      'Human: ν•œκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μΈκ°€μš”? μ•„λž˜ 선택지 쀑 κ³¨λΌμ£Όμ„Έμš”.\n'
                      '\n'
                      '(A) κ²½μ„±\n'
                      '(B) λΆ€μ‚°\n'
                      '(C) 평양\n'
                      '(D) μ„œμšΈ\n'
                      '(E) μ „μ£Ό\n'
                      'Assistant:\n'
                      'ν•œκ΅­μ€ λ™μ•„μ‹œμ•„μ— μœ„μΉ˜ν•œ κ΅­κ°€λ‘œ κ³΅μ‹μ μœΌλ‘œ λŒ€ν•œλ―Όκ΅­μ΄λΌκ³  λΆˆλ¦½λ‹ˆλ‹€. μ„œμšΈμ€ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„μž…λ‹ˆλ‹€. '
                      'λ”°λΌμ„œ 정닡은 (D) μ„œμšΈμž…λ‹ˆλ‹€.'}],
 'created': 1710404368,
 'id': 'cmpl-af889267-f64e-4516-b0a3-5c8b918d0e36',
 'model': '/root/.cache/huggingface/hub/models--heegyu--EEVE-Korean-Instruct-10.8B-v1.0-GGUF/snapshots/ff014aa6d73ffa8a2857085261cb7a4e6c630bfe/ggml-model-Q4_K_M.gguf',
 'object': 'text_completion',
 'usage': {'completion_tokens': 36, 'prompt_tokens': 83, 'total_tokens': 119}}
5.662428140640259
```