ncbi/MedCPT-Article-Encoder · Can I use MedCPT-Article-Encoder to encode chunks of medical articles and not Title-Abstract pairs?

Hello everyone, I am building a retrieval augmented generation application using a knowledge base of 150+ scientific/medical articles specifically from the biostatistics field. I am currently using an OpenAI embedding model but I am aiming at improving my pipeline's performance and I was wondering if it makes sense to use the MedCPT-Article-Encoder given that it was trained on domain specific data. Nonetheless, I am worried that since the model was trained on title abstract pairs this change would counterproductive as I would use it to embedd all of the chunked articles (with variable chunk size going from 200 to 500 words). Furthermore i was wondering if, in case MedCPT is not a valid option, I should look at other models like BioBert, PubMedBert or something else. Thank you very much for your help.

Hello everyone, I am building a retrieval augmented generation application using a knowledge base of 150+ scientific/medical articles specifically from the biostatistics field. I am currently using an OpenAI embedding model but I am aiming at improving my pipeline's performance and I was wondering if it makes sense to use the MedCPT-Article-Encoder given that it was trained on domain specific data. Nonetheless, I am worried that since the model was trained on title abstract pairs this change would counterproductive as I would use it to embedd all of the chunked articles (with variable chunk size going from 200 to 500 words). Furthermore i was wondering if, in case MedCPT is not a valid option, I should look at other models like BioBert, PubMedBert or something else. Thank you very much for your help.

Thank you for your interest in our work. Yes, you can use MedCPT to encode your chunks, and in our MedRAG evaluation (https://github.com/Teddy-XiongGZ/MedRAG) it achieved the best performance than other commonly used retrievers. Models like BioBERT and PubMedBERT are not constratively trained so they are not suitable for generating embeddings.