Why choose qwen2-0.5B as backbone model?

#4
by alan - opened

Hi,

It is rare to convert SLM into embedding model, while it is more common to convert LLM into embedding model.

So I have two small questions want to ask:

  1. Why did you choose to use SLM as the backbone instead of the more common transformer encoder with multi-language capabilities, such as XLM-R? Have you tried to use your entire implementation method on a transformer encoder based model?
  2. Since you chose to use SLM as the embedding model, why did you choose masked mean pooling instead of the more common last token pooling? Because SLM is a unidirectional attention model, the embedding of the previous tokens don't have global information. When performing token embedding average, it is equivalent to increasing the weight of the previous token information, which seems a bit counterintuitive.

Looking forward to your response, thanks~

HITsz-Text Machine Group org

Hi,

It is rare to convert SLM into embedding model, while it is more common to convert LLM into embedding model.

So I have two small questions want to ask:

  1. Why did you choose to use SLM as the backbone instead of the more common transformer encoder with multi-language capabilities, such as XLM-R? Have you tried to use your entire implementation method on a transformer encoder based model?
  2. Since you chose to use SLM as the embedding model, why did you choose masked mean pooling instead of the more common last token pooling? Because SLM is a unidirectional attention model, the embedding of the previous tokens don't have global information. When performing token embedding average, it is equivalent to increasing the weight of the previous token information, which seems a bit counterintuitive.

Looking forward to your response, thanks~

First, let us address your questions:

  1. Yes, we have tried it. Under the same training data and training parameter settings, the performance of the decoder-only model and the encoder-only model is comparable (qwen2 vs XLM).

  2. This is a very interesting point.
    In fact, we have experimented with both last-token pooling and mean pooling, and found that mean pooling performs slightly better, regardless of whether it is our mini or max model.
    Of course, there are now some weighted mean pooling methods that might be more suitable for SLM (such as pooling_mode_weightedmean_tokens supported by sentence-transformers), which could be worth trying.
    As for why mean pooling works for SLM, I speculate that one possible reason is that the causal attention after training causes the hidden states at each token position to naturally focus on themselves while inherently carrying the semantics of the context. This results in less representation of preceding tokens, and even if there is an accumulation of subsequent tokens, it does not pose a significant problem.
    Of course, this also raises many interesting questions for discussion, such as:
    (1) Whether the model trained in this way would be insensitive to the order of contextual words (refer to https://jina.ai/news/text-embeddings-fail-to-capture-word-order-and-how-to-fix-it);
    (2) Whether the model might focus more on the information at the beginning of the text.
    And so on.

HITsz-Text Machine Group org

Returning to the question of why we adopted the SLM model, we have briefly discussed this in our technical report. Although our experiments indicate that decoder-only models and encoder-only models perform similarly under the same settings, there are several advantages to choosing an SLM like qwen2:

  1. As a more recent model, qwen2 exhibits better multilingual performance, particularly in tokenization (for instance, XLM cannot handle Tibetan for China).
  2. qwen2 supports flash attention, which facilitates faster inference.
  3. The position embedding in qwen2 inherently supports longer contexts, making it easier to extend the length (although our current version of the model has not been optimized for long texts).
  4. Using an SLM allows us to more easily scale up to larger models.

In summary, qwen2, as an updated SLM, offers more convenient features that are easier for developers to use and improve. Of course, encoder-only models like XLM may still have unique characteristics in text encoding, and we look forward to other developers exploring and optimizing these aspects further.

@YanshekWoo Thanks for your quick and detailed response!

For the pooling method: maybe it's as your techical report mentioned, it isn't the key point but the data.

For the reason you choose SLM: although, at the same parameter scale, there is currently no evidence from metrics that decoder-only models outperform encoder-only models in embedding tasks, SLM represents the latest trend with better ecosystem support. This allows the adoption of many new technological features, whereas BERT-based models like XLM-R may be somewhat outdated and lack maintenance or updates. Additionally, SLM can be seen as a transitional development version of embedding models towards LLM. Therefore, SLM was ultimately chosen as the backbone.

Finally, I’m curious if you have tried embedding model training methods unique to SLM. For instance, leveraging the generative capabilities of SLM to further enhance the performance of embedding tasks.

HITsz-Text Machine Group org

@alan Thanks for your consideration and response!

Finally, I’m curious if you have tried embedding model training methods unique to SLM. For instance, leveraging the generative capabilities of SLM to further enhance the performance of embedding tasks.

For brainstorming purposes, we have indeed considered some approaches but have not implemented them.
For example, training the dual-tower embeddings of SLM using next token prediction (or next phrase prediction). I also recall that there has been similar research work, although I do not remember the specific study.
Other potential directions might involve combining generative information retrieval with embeddings.

However, we have not currently considered other unique and useful features of SLM. If you have any interesting ideas (that can be shared publicly), we welcome further discussion and exchange.

Sign up or log in to comment