Unable to reproduce training
Hi. Thanks for open-sourcing the DFlash code.
I tried training a dflash speculator following "examples/train/dflash_qwen3_8b_sharegpt_online_5k.sh" around May 22nd and was successfully able to reproduce the results, though inference with released models was leading to repetitions.
With the recent merge of the PR mentioned into vllm:main, the inference now works well, though I am not able to train the model. I am encountering some shared memory issues. Is it due to the async? Switching the vllm codebase to commits on May 22nd, I am again able to train the models. Not sure how to resolve this.
Thanks!!
Could you share some more details on the issues you're facing?
While training, the data preparation and launching of the vllm server complete successfully, but I believe that calling the VLLM server during training to get the hidden states leads to some issues.
I get info messages like :INFO 05-27 17:26:35 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds
which cascade into :WARNING Request aborted (attempt 1/4): Request vllm_client.py:40 timed out.. Retrying in 2s...
with the request finally failing after 4 attempts.
The failed requests are interleaved with successful ones like"POST /v1/completions HTTP/1.1" 200 OK
but the data loader eventually dies due to multiple of the failed requests.
I also guess that hidden states are not being written due to the failed requests leading to errors like :/app/speculators/src/speculators/train/data.py:299: UserWarning: Failed to load/cache hidden states for sample 1332: Request timed out.
But these errors go away on switching to an older version of vllm (May 22nd). Not sure what exactly is happening.
Hope this clarifies. Would be happy to share any more details. Thanks!