r1 fine tunning
what if fine tunning is done on r1 used for this model ?
what if fine tunning is done on r1 used for this model ?
I think you would have to be more specific. Using the same datasets on the DeepSeek-R1 to Qwen2.5 32B R1 distillation? Or some kind of offline logit distillation?
I'm not sure training on the 32b R1 would be worth it -- it'd likely catastrophically forget too hard to do the fancy CoT, only benefit would be a different flavor of prose. Could be interesting but I'm not sure it's worth spending
@Kearm @Fizzarolli on the 671b model. probably either v3 or r1.
Would love to get the money for that one lol
@Fizzarolli would love to do it from 17th feb to 30 feb. if it takes that much time to tune . probs use it to distill smaller 104b model like command r + maybe.