How to chat with the Q-RWKV-6?

by ljy77777777 - opened 4 days ago

4 days ago

I m really interested in our model! But it is my first time to know about the RWKV-structure model, could you please provide the envs requirements and the code to chat with it? Thanks!

wxgeorge

4 days ago

•

edited 4 days ago

hi @ljy77777777 - this model is available for inference on Featherless.ai

Check out here: https://featherless.ai/models/recursal/QRWKV6-32B-Instruct-Preview-v0.1

We provide inference via OpenAI compatible API endpoints, so you can chat with it with just about any chat client (e.g. TypingMind, SillyTavern to name a few).

We also provide an basic in-browser client Phoenix for fast experimentation.

ljy77777777

4 days ago

Thanks for your answer! If I want to deploy the Q-RWKV-6 in my device. Is it supported by vllm now?
And I tryed to use the PP (transformers &accelerate lib) to deploy the model on multi-GPU and I found it does not work.
Could you please release an official code for deploying the model? Thanks!

SmerkyG

recursal org 4 days ago

•

edited 4 days ago

We don't have vllm support yet, but we used this HF model a lot internally to do the evals using lm-eval-harness and accelerate, so it definitely should work for you. You'll need to install the latest version of the flash-linear-attention repo at https://github.com/sustcsonglin/flash-linear-attention and a recent version of Triton.

ljy77777777

3 days ago

Thanks for your answer! I deploy the model on my device successfully. However I find in many instructions or tasks, the model has completely answer the question, but it does not stop and generate the unrelated context. I reckon is I think there might be something wrong with the prompt? Because I find the RWKV-4world should write the prompt as follows:
"""Instruction: {instruction}

Input: {input}

Response:"""
Therefore could you please provide the prompt template when you eval the model in some benckmarks? Thank you!

ljy77777777

1 day ago

Hello, Q-RWKV-6 is an excellent model for the Linear attention LLM. However I find the performance in chinese chating of the model is not pretty good. Is it because our continue training just using the English data?

SmerkyG

recursal org 1 day ago

The chat template is built into the huggingface repo in https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1/blob/main/tokenizer_config.json
Unlike the world-tokenizer based RWKV models, it follows standard chatml e.g. "<|im_start|>", "<|im_end|>", like the base Qwen model it is adapted from.

SmerkyG

recursal org 1 day ago

Not too sure about chinese chatting - we did some minor checks and it seemed good on that, but it's definitely possible that the training data reduced its abilities there as we used DCLM.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment