My LoRA Memorized a Document and Still Couldn't Answer Questions About It

Dev. World

17 4월 2026 — 3 min read

I trained a LoRA adapter on a document until the next-token loss dropped from 10.10 to 0.004. The adapter weights moved. At inference, it produced different logits from the base model.

Then I asked a question about the document. It answered: "I do not have access to the specific details."

The setup

The project was a test-time memory study. Given a long document, should we store information in a LoRA adapter (parametric write) or keep it as exact cached spans in the prompt? The whole thing rests on LoRA write working.

The test was small. Eight chunks of a synthetic research memo with scattered facts. Gemma 4 E2B-it, rank-16 LoRA on language_model q_proj and v_proj. Three conditions per query:

stable: 2 of 8 chunks as scaffold in the prompt, no training
write: same scaffold in the prompt, LoRA trained on all 8 chunks
full: all 8 chunks in the prompt, no training (upper bound)

Queries target facts in chunks 3 through 8. These facts are not in the scaffold. If the LoRA absorbed the document, write should beat stable.

The numbers

Three training configurations, 24 queries each. Metric is fraction of expected entities in the answer.

config	loss	full
10 steps, 1e-3	10.10 → 0.12	0.875
20 steps, 1e-3	10.10 → 0.04	0.875
30 steps, 5e-4	10.10 → 0.004	0.875

Loss converges. Write score is zero across every config and every query. The model answers 87.5% when the document is in the prompt and 0% when it's only in the adapter.

Why I didn't believe the zero at first

For four iterations I was sure something was mechanically broken. Each bug I found was enough to think "maybe that's why write isn't working."

Attachment. LoRA was being attached to every Linear with q_proj or v_proj in its name. Gemma 4 has those in its vision and audio towers. No text gradients reached those adapters, and the text-tower projections had no adapters attached. Fix: filter to language_model submodules only.

Precision. Parameters were bf16. Per-step updates were small enough that bf16 rounded them away when added to existing weights, so the optimizer step was effectively a no-op. Fix: cast LoRA parameters to fp32.

Optimizer. I had written a manual SGD loop thinking it was cleaner than bringing in Adam. It had a reference bug in the weight update. Fix: use torch.optim.Adam.

Reset. Between conditions I reset LoRA parameters to zero, including lora_A. LoRA computes x @ A @ B. B is zero-initialized, A is Kaiming. Zero both and the output is zero, so the gradient is zero, so nothing updates. Loss didn't move. Fix: re-initialize A with Kaiming on reset, zero only B.

With all four fixed, the loss curves above converged. Weights moved. The adapter was active at inference. Recall was still zero.

What NTP loss was teaching the adapter

Converging NTP loss on a document means the adapter learned

$$P(\text{next token} \mid \text{document prefix}).$$

The trained model will happily complete "Dr. Chen's team developed a novel computational pipeline called" with " MolDock-X".

At decode time the prompt looks like this:

[scaffold]
...
[query]
What is the name of the computational pipeline?
Answer:

This prompt never appeared in training. The adapter learned how the document continues from positions inside itself. At rank 16, q_proj and v_proj only, and at most 30 steps, that didn't produce answers to queries in a different format.

The LoRA learned what the loss measured. Answering questions wasn't in the loss. More rank, more target layers, more steps, or a different loss might recover it. The default configuration didn't.

How other test-time methods avoid this

Methods that extract something useful from test-time updates don't ask the adapter alone to do the work. qTTT keeps the full context in the KV cache and updates the query projection, so the facts live in cache rather than in weight deltas. PERK meta-learns the initial state of the LoRA adapter, training the initialization so later few-step memorization generalizes to retrieval. In-Place TTT repurposes the MLP down-projection as fast weights and updates it chunk-wise at inference time with an NTP-aligned objective, so the frozen-LM assumption no longer holds.

In contrast, Sakana AI's Doc-to-LoRA (March 2026) sidesteps the same failure mode by training with teacher-student distillation over query-style inputs instead of NTP on raw documents, an existence proof that the failure is about objective choice, not about LoRA-as-memory in general.

The claim that failed here is narrow: frozen LM plus NTP-trained LoRA on a document gives you usable query-conditioned memory.

Takeaway

If you're adapting a frozen LM at test time and you care about recall, evaluate with prompts whose format differs from what the adapter saw during training.

Before chasing a fifth bug, check the objective.

Code and full logs: tristore-bma-archived

My LoRA Memorized a Document and Still Couldn't Answer Questions About It

Dev. World

The setup

The numbers

Why I didn't believe the zero at first

What NTP loss was teaching the adapter

How other test-time methods avoid this

Takeaway

Read more

vLLM Silently Truncated My Reasoning Traces and I Didn't Notice for a Week

TRC와 함께한 MaxText 후기

RTX5090 체험 후기 (하) | Gcube 지큐브

RTX5090 체험 후기 (상) | Gcube 지큐브