Token Count Calculation in SFT Data Distribution Curation
Regarding the curation of SFT data, including the curation of the data distribution, I would like to understand how you calculate the token count for each data entry when designing the distribution. Is the token count based only on the user tokens, or does it also include the assistant tokens? (The reason I ask is that I understand the SFT loss is calculated only based on the assistant tokens.)
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I'm not sure I fully understand what you mean. What I want to clarify is how we calculate the token count for each data sample when determining the domain distribution in our SFT dataset. Do we count both user and assistant tokens, or only the assistant tokens? Personally, I feel that counting only assistant tokens makes more sense, since only assistant tokens contribute to the loss. From the perspective of task balancing and learning dynamics, wouldn't that be more appropriate?
I’d really appreciate a more detailed explanation from you. Thank you!
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I'm not sure I fully understand what you mean. What I want to clarify is how we calculate the token count for each data sample when determining the domain distribution in our SFT dataset. Do we count both user and assistant tokens, or only the assistant tokens? Personally, I feel that counting only assistant tokens makes more sense, since only assistant tokens contribute to the loss. From the perspective of task balancing and learning dynamics, wouldn't that be more appropriate?
I’d really appreciate a more detailed explanation from you. Thank you!
Yes, I meant we count both user and assistant tokens. You are absolutely right about loss being calculated only for the assistant tokens, but the computation goes over all the tokens, since we don't only process the assistant tokens during the forward pass. And you also need it to determine batch size, as it is going to be user+assistant.
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I'm not sure I fully understand what you mean. What I want to clarify is how we calculate the token count for each data sample when determining the domain distribution in our SFT dataset. Do we count both user and assistant tokens, or only the assistant tokens? Personally, I feel that counting only assistant tokens makes more sense, since only assistant tokens contribute to the loss. From the perspective of task balancing and learning dynamics, wouldn't that be more appropriate?
I’d really appreciate a more detailed explanation from you. Thank you!
Yes, I meant we count both user and assistant tokens. You are absolutely right about loss being calculated only for the assistant tokens, but the computation goes over all the tokens, since we don't only process the assistant tokens during the forward pass. And you also need it to determine batch size, as it is going to be user+assistant.
Thanks for the detailed clarification — that makes sense and I really appreciate your explanation.
Just to confirm one more point: in the case where we only consider assistant tokens for the token count (instead of user + assistant), would that approach make more sense specifically from the perspective of task balancing and learning dynamics? In other words, could focusing on assistant tokens alone lead to a more balanced representation of the tasks actually being optimized during training?
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I'm not sure I fully understand what you mean. What I want to clarify is how we calculate the token count for each data sample when determining the domain distribution in our SFT dataset. Do we count both user and assistant tokens, or only the assistant tokens? Personally, I feel that counting only assistant tokens makes more sense, since only assistant tokens contribute to the loss. From the perspective of task balancing and learning dynamics, wouldn't that be more appropriate?
I’d really appreciate a more detailed explanation from you. Thank you!
Yes, I meant we count both user and assistant tokens. You are absolutely right about loss being calculated only for the assistant tokens, but the computation goes over all the tokens, since we don't only process the assistant tokens during the forward pass. And you also need it to determine batch size, as it is going to be user+assistant.
Thanks for the detailed clarification — that makes sense and I really appreciate your explanation.
Just to confirm one more point: in the case where we only consider assistant tokens for the token count (instead of user + assistant), would that approach make more sense specifically from the perspective of task balancing and learning dynamics? In other words, could focusing on assistant tokens alone lead to a more balanced representation of the tasks actually being optimized during training?
If I understand correctly what you mean by balanced that the completions are going to be independent of input size - indeed. Since in that case we would like to see a smaller std in distribution of assistant tokens. Making it more independent of prompt size.
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I'm not sure I fully understand what you mean. What I want to clarify is how we calculate the token count for each data sample when determining the domain distribution in our SFT dataset. Do we count both user and assistant tokens, or only the assistant tokens? Personally, I feel that counting only assistant tokens makes more sense, since only assistant tokens contribute to the loss. From the perspective of task balancing and learning dynamics, wouldn't that be more appropriate?
I’d really appreciate a more detailed explanation from you. Thank you!
Yes, I meant we count both user and assistant tokens. You are absolutely right about loss being calculated only for the assistant tokens, but the computation goes over all the tokens, since we don't only process the assistant tokens during the forward pass. And you also need it to determine batch size, as it is going to be user+assistant.
Thanks for the detailed clarification — that makes sense and I really appreciate your explanation.
Just to confirm one more point: in the case where we only consider assistant tokens for the token count (instead of user + assistant), would that approach make more sense specifically from the perspective of task balancing and learning dynamics? In other words, could focusing on assistant tokens alone lead to a more balanced representation of the tasks actually being optimized during training?
If I understand correctly what you mean by balanced that the completions are going to be independent of input size - indeed. Since in that case we would like to see a smaller std in distribution of assistant tokens. Making it more independent of prompt size.
Thanks again for the follow-up!
I understand your point about making the distribution independent of input size — that aligns with my thinking.
However, I’m not quite sure I follow the part about "see a smaller std in distribution of assistant tokens." Could you please elaborate a bit more on this? Specifically, why is a smaller std desirable in this context, and how does it relate to the task balancing?
I want to make sure I fully grasp the statistical implication you are describing here. Thanks!
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I'm not sure I fully understand what you mean. What I want to clarify is how we calculate the token count for each data sample when determining the domain distribution in our SFT dataset. Do we count both user and assistant tokens, or only the assistant tokens? Personally, I feel that counting only assistant tokens makes more sense, since only assistant tokens contribute to the loss. From the perspective of task balancing and learning dynamics, wouldn't that be more appropriate?
I’d really appreciate a more detailed explanation from you. Thank you!
Yes, I meant we count both user and assistant tokens. You are absolutely right about loss being calculated only for the assistant tokens, but the computation goes over all the tokens, since we don't only process the assistant tokens during the forward pass. And you also need it to determine batch size, as it is going to be user+assistant.
Thanks for the detailed clarification — that makes sense and I really appreciate your explanation.
Just to confirm one more point: in the case where we only consider assistant tokens for the token count (instead of user + assistant), would that approach make more sense specifically from the perspective of task balancing and learning dynamics? In other words, could focusing on assistant tokens alone lead to a more balanced representation of the tasks actually being optimized during training?
If I understand correctly what you mean by balanced that the completions are going to be independent of input size - indeed. Since in that case we would like to see a smaller std in distribution of assistant tokens. Making it more independent of prompt size.
Thanks again for the follow-up!
I understand your point about making the distribution independent of input size — that aligns with my thinking.
However, I’m not quite sure I follow the part about "see a smaller std in distribution of assistant tokens." Could you please elaborate a bit more on this? Specifically, why is a smaller std desirable in this context, and how does it relate to the task balancing?
I want to make sure I fully grasp the statistical implication you are describing here. Thanks!
So basically what i meant is that you could have a small std around ur mean length of assistant response, which would signify that mostly response length is clustered around the same length, so kind of independent of input size
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I'm not sure I fully understand what you mean. What I want to clarify is how we calculate the token count for each data sample when determining the domain distribution in our SFT dataset. Do we count both user and assistant tokens, or only the assistant tokens? Personally, I feel that counting only assistant tokens makes more sense, since only assistant tokens contribute to the loss. From the perspective of task balancing and learning dynamics, wouldn't that be more appropriate?
I’d really appreciate a more detailed explanation from you. Thank you!
Yes, I meant we count both user and assistant tokens. You are absolutely right about loss being calculated only for the assistant tokens, but the computation goes over all the tokens, since we don't only process the assistant tokens during the forward pass. And you also need it to determine batch size, as it is going to be user+assistant.
Thanks for the detailed clarification — that makes sense and I really appreciate your explanation.
Just to confirm one more point: in the case where we only consider assistant tokens for the token count (instead of user + assistant), would that approach make more sense specifically from the perspective of task balancing and learning dynamics? In other words, could focusing on assistant tokens alone lead to a more balanced representation of the tasks actually being optimized during training?
If I understand correctly what you mean by balanced that the completions are going to be independent of input size - indeed. Since in that case we would like to see a smaller std in distribution of assistant tokens. Making it more independent of prompt size.
Thanks again for the follow-up!
I understand your point about making the distribution independent of input size — that aligns with my thinking.
However, I’m not quite sure I follow the part about "see a smaller std in distribution of assistant tokens." Could you please elaborate a bit more on this? Specifically, why is a smaller std desirable in this context, and how does it relate to the task balancing?
I want to make sure I fully grasp the statistical implication you are describing here. Thanks!
So basically what i meant is that you could have a small std around ur mean length of assistant response, which would signify that mostly response length is clustered around the same length, so kind of independent of input size
Got it, thank you for clarifying. However, I’m actually not concerned about whether the standard deviation of the assistant responses is large or small. I just wanted to understand why you mentioned that point — were you trying to make a specific argument or illustrate something with it?
To confirm, my main question is: when we plan the domain distribution of our dataset, should we consider only the assistant responses and completely ignore the user inputs?
I would assume that both, user + assistant. Because loss is affected only by the generated tokens (assistant response), though input is processed as a whole, including the user tokens.
I'm not sure I fully understand what you mean. What I want to clarify is how we calculate the token count for each data sample when determining the domain distribution in our SFT dataset. Do we count both user and assistant tokens, or only the assistant tokens? Personally, I feel that counting only assistant tokens makes more sense, since only assistant tokens contribute to the loss. From the perspective of task balancing and learning dynamics, wouldn't that be more appropriate?
I’d really appreciate a more detailed explanation from you. Thank you!
Yes, I meant we count both user and assistant tokens. You are absolutely right about loss being calculated only for the assistant tokens, but the computation goes over all the tokens, since we don't only process the assistant tokens during the forward pass. And you also need it to determine batch size, as it is going to be user+assistant.
Thanks for the detailed clarification — that makes sense and I really appreciate your explanation.
Just to confirm one more point: in the case where we only consider assistant tokens for the token count (instead of user + assistant), would that approach make more sense specifically from the perspective of task balancing and learning dynamics? In other words, could focusing on assistant tokens alone lead to a more balanced representation of the tasks actually being optimized during training?
If I understand correctly what you mean by balanced that the completions are going to be independent of input size - indeed. Since in that case we would like to see a smaller std in distribution of assistant tokens. Making it more independent of prompt size.
Thanks again for the follow-up!
I understand your point about making the distribution independent of input size — that aligns with my thinking.
However, I’m not quite sure I follow the part about "see a smaller std in distribution of assistant tokens." Could you please elaborate a bit more on this? Specifically, why is a smaller std desirable in this context, and how does it relate to the task balancing?
I want to make sure I fully grasp the statistical implication you are describing here. Thanks!
So basically what i meant is that you could have a small std around ur mean length of assistant response, which would signify that mostly response length is clustered around the same length, so kind of independent of input size
Got it, thank you for clarifying. However, I’m actually not concerned about whether the standard deviation of the assistant responses is large or small. I just wanted to understand why you mentioned that point — were you trying to make a specific argument or illustrate something with it?
To confirm, my main question is: when we plan the domain distribution of our dataset, should we consider only the assistant responses and completely ignore the user inputs?
we should consider both, because it is important to consider the size and diversity of inputs as well. The only concern regarding response only length is when you want your model to be indifferent to input size of the user.