Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.14.0
metadata
title: German Llm Outputs
emoji: 🦀
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: mit
Dataset
The dataset usesd is https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
Preprocessing:
- filtered german conversations
- took first user prompt
- deleted short prompts (less than 70 chars)
dataset = load_dataset('lmsys/chatbot_arena_conversations')
def get_message(x):
x['message'] = [x['conversation_a'][0]]
return x
dataset = dataset.filter(lambda x: x['language'] == 'German')
dataset = dataset['train'].map(get_message)
dataset = dataset.filter(lambda x: len(x['message'][0]['content']) > 70)
Generation
I rely on the huggingface conversational
pipeline to generate the outputs. There are some issues with the chat template (esp. for the non-instruction tuned models) i'll fix later.
messages = json.loads(Path('messages.json').read_text())
outputs = []
pipe = pipeline(
"conversational",
model=model_name,
torch_dtype="auto",
device_map=device,
max_new_tokens=1024,
trust_remote_code=True
)
for message in tqdm(messages):
output = pipe([message])
outputs.append(output)