vLLM Silently Truncated My Reasoning Traces and I Didn't Notice for a Week
The vLLM has a parameter called max_model_len. If your generated sequence exceeds it, vLLM cuts it off. No warning in the logs, no error raised. Your outputs just lose their tails.
I ran DeepSeek-R1-Distill-Qwen-32B at max_model_len=8192 for a week of experiments. This model can generate reasoning traces well over 8K tokens, especially on harder problems like AIME. I was cutting traces short without knowing it.
The numbers
Baseline accuracy with different max_model_len:
| max_model_len: 8192 | max_model_len: 16384 | |
|---|---|---|
| MATH-500 | 66.0% | 66.8% |
| AIME24 | 46.7% | 66.7% |
AIME24 jumped 20 points. Published benchmark for this model is ~72%. At 8192 I was getting 46.7% and treating it as the real baseline.
Note
One thing to note: our MATH-500 baseline (66.8% at 16384) is still 27pp below DeepSeek's published 94.3%. Truncation only accounts for 0.8pp of that. The rest is likely sampling: DeepSeek's official eval uses temperature=0.6, top_p=0.95 with 64 samples per query for pass@1, while we used greedy decoding (temperature=0.0, single sample). We also used a system prompt, which the official docs recommend against for R1 models. The AIME24 numbers are closer to published (66.7% vs 72.6%), so the truncation story holds up there. But our MATH-500 baseline has problems beyond truncation that we didn't diagnose.
AIME problems produce long reasoning traces. At 8192, about half the AIME traces got cut off before the model finished thinking. The final answer derivation tends to be near the end of the trace, so truncation disproportionately destroys the answer. MATH-500 was less affected (8% truncation) because the traces are shorter, but even the 0.8pp shift matters when you're looking for small intervention effects.
How this created a fake result
I was testing an entropy pruning method: score each reasoning step by how uncertain the model is at that point, remove the bottom 50%, and let the model re-derive the answer from what remains. I ran an ablation varying the continuation budget:
| max_tokens | MATH gain (8192 baseline) | MATH gain (16384 baseline) |
|---|---|---|
| 256 | -22.8pp | -25.4pp |
| 1024 | +2.6pp | -1.2pp |
| 2048 | +3.4pp | -1.2pp |
At the truncated baseline: a positive result with a clear sign flip. At the proper baseline: negative across the board.
Truncation cuts off the end of a trace. Pruning removes steps from the middle. So the pruned scaffold is structurally complete even though it's shorter, and the model can sometimes re-derive the answer that truncation had destroyed. The gain isn't from pruning; it's from giving the model a second chance to finish what truncation prevented.
AIME24 made this obvious. At 8192, the method showed +10.0pp gain. At 16384, it showed exactly +0.0pp, zero repairs, zero damage across all 30 problems. The method changed nothing when the traces weren't truncated.
How to check if this is happening to you
Compare your baseline against published numbers. If there's a gap of more than a few points, max_model_len is the first thing to check.
You can also look at the finish_reason field in vLLM's output. If it says "length" instead of "stop", the generation was cut short:
for output in outputs:
if output.outputs[0].finish_reason == "length":
print(f"WARNING: generation hit max length, likely truncated")
For DeepSeek-R1 distills, the official vllm serve command uses max_model_len=32768. I used 16384 as a hardware compromise on a 96GB GPU, and AIME24 was still 5.9pp below published at that setting. 32768 is safer but needs a lot of VRAM. On a 96GB GPU with the 32B model, I couldn't get 32768 to work with CUDA graphs enabled (they eat ~30GB on top of model weights). It did work with enforce_eager=True, which disables CUDA graphs, but throughput dropped to about 7.6 tok/s output per sequence. 16384 with CUDA graphs and FP8 KV cache was the sweet spot for this hardware.
Who else might have this problem
I'm not going to speculate about specific papers. But the setup that bit me is not unusual: vLLM is the default serving framework for open-weight models, and reasoning models are generating increasingly long traces. If you set max_model_len based on your available VRAM rather than the model's actual generation length, you can end up in the same situation. And any method that gives the model a "second pass" on shortened input will look like it's helping, when it's really just compensating for the truncation.
Also check your sampling settings. Published benchmarks often use non-greedy decoding with multiple samples, not temperature=0.0. The gap between greedy and published can be 20pp+ on math benchmarks.
Code and experiment logs: GitHub