nvidia
/

NV-Llama2-13B-RLHF-RM

Text Generation

Model card Files Files and versions Community

jiaqiz commited on Feb 20

Commit

cc62a92

•

1 Parent(s): 69af399

update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -20,11 +20,11 @@ Llama2-13B-RLHF-RM is a 13 billion parameter language model (with context of up
 Starting from [Llama2-13B base model](https://huggingface.co/meta-llama/Llama-2-13b), it is first instruction-tuned with a combination of public and proprietary data and then trained on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) with reward modeling objective. Given a conversation with multiple turns between user and assistant, it assigns a score on overall helpfulness for the last assistant turn.
-Llama2-13B-RLHF-RM is trained with NVIDIA NeMo, an end-to-end, cloud-native framework to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI.
 ## Usage:
-Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) Training using [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).

 Starting from [Llama2-13B base model](https://huggingface.co/meta-llama/Llama-2-13b), it is first instruction-tuned with a combination of public and proprietary data and then trained on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) with reward modeling objective. Given a conversation with multiple turns between user and assistant, it assigns a score on overall helpfulness for the last assistant turn.
+Llama2-13B-RLHF-RM is trained with NVIDIA [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner), a scalable toolkit for efficient model alignment. NeMo-Aligner is built using the [NeMo Toolkit](https://github.com/NVIDIA/NeMo) which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross compatible with the NeMo ecosystem, allowing for inference deployment and further customization.
 ## Usage:
+Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).