How to chat with the Q-RWKV-6?

#3
by ljy77777777 - opened

I m really interested in our model! But it is my first time to know about the RWKV-structure model, could you please provide the envs requirements and the code to chat with it? Thanks!

hi @ljy77777777 - this model is available for inference on Featherless.ai

Check out here: https://featherless.ai/models/recursal/QRWKV6-32B-Instruct-Preview-v0.1

We provide inference via OpenAI compatible API endpoints, so you can chat with it with just about any chat client (e.g. TypingMind, SillyTavern to name a few).

We also provide an basic in-browser client Phoenix for fast experimentation.

Thanks for your answer! If I want to deploy the Q-RWKV-6 in my device. Is it supported by vllm now?
And I tryed to use the PP (transformers &accelerate lib) to deploy the model on multi-GPU and I found it does not work.
Could you please release an official code for deploying the model? Thanks!

We don't have vllm support yet, but we used this HF model a lot internally to do the evals using lm-eval-harness and accelerate, so it definitely should work for you. You'll need to install the latest version of the flash-linear-attention repo at https://github.com/sustcsonglin/flash-linear-attention and a recent version of Triton.

Thanks for your answer! I deploy the model on my device successfully. However I find in many instructions or tasks, the model has completely answer the question, but it does not stop and generate the unrelated context. I reckon is I think there might be something wrong with the prompt? Because I find the RWKV-4world should write the prompt as follows:
"""Instruction: {instruction}

Input: {input}

Response:"""
Therefore could you please provide the prompt template when you eval the model in some benckmarks? Thank you!

Hello, Q-RWKV-6 is an excellent model for the Linear attention LLM. However I find the performance in chinese chating of the model is not pretty good. Is it because our continue training just using the English data?

recursal org

The chat template is built into the huggingface repo in https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1/blob/main/tokenizer_config.json
Unlike the world-tokenizer based RWKV models, it follows standard chatml e.g. "<|im_start|>", "<|im_end|>", like the base Qwen model it is adapted from.

recursal org

Not too sure about chinese chatting - we did some minor checks and it seemed good on that, but it's definitely possible that the training data reduced its abilities there as we used DCLM.

Sign up or log in to comment