Post
7705
🤗 Can We Train Chat Models with Raw Data? #1
The idea of training a chat model with desired raw data is incredibly appealing.
However, there is a significant problem with this process. Directly training a chat model with raw data can disrupt its output format.
To solve this issue, the common approach is to create Q/A-formatted datasets. However, this method is time-consuming, costly, and may result in information loss or bias during dataset creation.
So, how can we effectively train raw data? We can utilize the sequential structure of transformer models like Llama, which consists of multiple layers.
I intentionally form the layers responsible for handling the output format in the latter part of the model, and designate the middle to late layers as the starting point for raw training.
You may think that the method involves feeding chat data to the later layers and then training the middle to late layers with raw data, but that's not the case. Such an approach cannot properly address the problem and may even lead to increased model complexity.
The idea presented above doesn't seem bad, so how can we make good use of it? Let's try using a base model.
Read more - https://huggingface.co/blog/maywell/layer-aware-1
The idea of training a chat model with desired raw data is incredibly appealing.
However, there is a significant problem with this process. Directly training a chat model with raw data can disrupt its output format.
To solve this issue, the common approach is to create Q/A-formatted datasets. However, this method is time-consuming, costly, and may result in information loss or bias during dataset creation.
So, how can we effectively train raw data? We can utilize the sequential structure of transformer models like Llama, which consists of multiple layers.
I intentionally form the layers responsible for handling the output format in the latter part of the model, and designate the middle to late layers as the starting point for raw training.
You may think that the method involves feeding chat data to the later layers and then training the middle to late layers with raw data, but that's not the case. Such an approach cannot properly address the problem and may even lead to increased model complexity.
The idea presented above doesn't seem bad, so how can we make good use of it? Let's try using a base model.
Read more - https://huggingface.co/blog/maywell/layer-aware-1