vLLM Silently Truncated My Reasoning Traces and I Didn't Notice for a Week

The vLLM has a parameter called max_model_len. If your generated sequence exceeds it, vLLM cuts it off. No warning in the logs, no error raised. Your outputs just lose their tails.

I ran DeepSeek-R1-Distill-Qwen-32B at max_model_len=8192 for a week of experiments. This model can generate reasoning traces well over 8K tokens, especially on harder problems like AIME. I was cutting traces short without knowing it.

The numbers

Baseline accuracy with different max_model_len:

max_model_len: 8192 max_model_len: 16384
MATH-500 66.0% 66.8%
AIME24 46.7% 66.7%

AIME24 jumped 20 points. Published benchmark for this model is ~72%. At 8192 I was getting 46.7% and treating it as the real baseline.

Note

One thing to note: our MATH-500 baseline (66.8% at 16384) is still 27pp below DeepSeek's published 94.3%. Truncation only accounts for 0.8pp of that. The rest is likely sampling: DeepSeek's official eval uses temperature=0.6, top_p=0.95 with 64 samples per query for pass@1, while we used greedy decoding (temperature=0.0, single sample). We also used a system prompt, which the official docs recommend against for R1 models. The AIME24 numbers are closer to published (66.7% vs 72.6%), so the truncation story holds up there. But our MATH-500 baseline has problems beyond truncation that we didn't diagnose.

AIME problems produce long reasoning traces. At 8192, about half the AIME traces got cut off before the model finished thinking. The final answer derivation tends to be near the end of the trace, so truncation disproportionately destroys the answer. MATH-500 was less affected (8% truncation) because the traces are shorter, but even the 0.8pp shift matters when you're looking for small intervention effects.

How this created a fake result

I was testing an entropy pruning method: score each reasoning step by how uncertain the model is at that point, remove the bottom 50%, and let the model re-derive the answer from what remains. I ran an ablation varying the continuation budget:

max_tokens MATH gain (8192 baseline) MATH gain (16384 baseline)
256 -22.8pp -25.4pp
1024 +2.6pp -1.2pp
2048 +3.4pp -1.2pp

At the truncated baseline: a positive result with a clear sign flip. At the proper baseline: negative across the board.

Truncation cuts off the end of a trace. Pruning removes steps from the middle. So the pruned scaffold is structurally complete even though it's shorter, and the model can sometimes re-derive the answer that truncation had destroyed. The gain isn't from pruning; it's from giving the model a second chance to finish what truncation prevented.

AIME24 made this obvious. At 8192, the method showed +10.0pp gain. At 16384, it showed exactly +0.0pp, zero repairs, zero damage across all 30 problems. The method changed nothing when the traces weren't truncated.

How to check if this is happening to you

Compare your baseline against published numbers. If there's a gap of more than a few points, max_model_len is the first thing to check.

You can also look at the finish_reason field in vLLM's output. If it says "length" instead of "stop", the generation was cut short:

for output in outputs:
    if output.outputs[0].finish_reason == "length":
        print(f"WARNING: generation hit max length, likely truncated")

For DeepSeek-R1 distills, the official vllm serve command uses max_model_len=32768. I used 16384 as a hardware compromise on a 96GB GPU, and AIME24 was still 5.9pp below published at that setting. 32768 is safer but needs a lot of VRAM. On a 96GB GPU with the 32B model, I couldn't get 32768 to work with CUDA graphs enabled (they eat ~30GB on top of model weights). It did work with enforce_eager=True, which disables CUDA graphs, but throughput dropped to about 7.6 tok/s output per sequence. 16384 with CUDA graphs and FP8 KV cache was the sweet spot for this hardware.

Who else might have this problem

I'm not going to speculate about specific papers. But the setup that bit me is not unusual: vLLM is the default serving framework for open-weight models, and reasoning models are generating increasingly long traces. If you set max_model_len based on your available VRAM rather than the model's actual generation length, you can end up in the same situation. And any method that gives the model a "second pass" on shortened input will look like it's helping, when it's really just compensating for the truncation.

Also check your sampling settings. Published benchmarks often use non-greedy decoding with multiple samples, not temperature=0.0. The gap between greedy and published can be 20pp+ on math benchmarks.


Code and experiment logs: GitHub

Read more

RTX5090 체험 후기 (하) | Gcube 지큐브

이 글은 저번 체험에 대한 후기와 이어진다. RTX5090을 사용해 Gemma3를 튜닝했었는데, 그 결과가 소실된 것을 Gcube 측에서 안타까워 하셔 체험 기회를 한번 더 제공해주셨다. 그래서 RTX5090을 3일 더 사용해볼 수 있게 되었다! 이전과 동일한 환경(Axolotl, torch On docker: ghcr.io/deveworld/gpu-dev)에서 튜닝을 진행했다. 물론 이전과 다른 점은

By Dev. World

RTX5090 체험 후기 (상) | Gcube 지큐브

최근 gcube RTX 5090 체험 테스트에 선정되어 무상으로 체험해보게 되었다. 그 논란의 물량도 얼마 없어 돈이 있어도 구하기 어려운 RTX 5090을, 심지어 무료로 말이다! 게다가 5090뿐만 아니라 4090, 5080도 함께 제공받았다. 모두 현시점에서 가장 성능이 좋은 소비자용 그래픽카드 3종류이다. 메모리가 작고 대역폭 병목을 제외한 성능만 본다면 현존 최고 성능이다. 이들

By Dev. World