My LoRA Memorized a Document and Still Couldn't Answer Questions About It

I trained a LoRA adapter on a document until the next-token loss dropped from 10.10 to 0.004. The adapter weights moved. At inference, it produced different logits from the base model.

Then I asked a question about the document. It answered: "I do not have access to the specific details."

The setup

The project was a test-time memory study. Given a long document, should we store information in a LoRA adapter (parametric write) or keep it as exact cached spans in the prompt? The whole thing rests on LoRA write working.

The test was small. Eight chunks of a synthetic research memo with scattered facts. Gemma 4 E2B-it, rank-16 LoRA on language_model q_proj and v_proj. Three conditions per query:

  • stable: 2 of 8 chunks as scaffold in the prompt, no training
  • write: same scaffold in the prompt, LoRA trained on all 8 chunks
  • full: all 8 chunks in the prompt, no training (upper bound)

Queries target facts in chunks 3 through 8. These facts are not in the scaffold. If the LoRA absorbed the document, write should beat stable.

The numbers

Three training configurations, 24 queries each. Metric is fraction of expected entities in the answer.

config loss stable write full
10 steps, 1e-3 10.10 → 0.12 0.000 0.000 0.875
20 steps, 1e-3 10.10 → 0.04 0.000 0.000 0.875
30 steps, 5e-4 10.10 → 0.004 0.000 0.000 0.875

Loss converges. Write score is zero across every config and every query. The model answers 87.5% when the document is in the prompt and 0% when it's only in the adapter.

Why I didn't believe the zero at first

For four iterations I was sure something was mechanically broken. Each bug I found was enough to think "maybe that's why write isn't working."

Attachment. LoRA was being attached to every Linear with q_proj or v_proj in its name. Gemma 4 has those in its vision and audio towers. No text gradients reached those adapters, and the text-tower projections had no adapters attached. Fix: filter to language_model submodules only.

Precision. Parameters were bf16. Per-step updates were small enough that bf16 rounded them away when added to existing weights, so the optimizer step was effectively a no-op. Fix: cast LoRA parameters to fp32.

Optimizer. I had written a manual SGD loop thinking it was cleaner than bringing in Adam. It had a reference bug in the weight update. Fix: use torch.optim.Adam.

Reset. Between conditions I reset LoRA parameters to zero, including lora_A. LoRA computes x @ A @ B. B is zero-initialized, A is Kaiming. Zero both and the output is zero, so the gradient is zero, so nothing updates. Loss didn't move. Fix: re-initialize A with Kaiming on reset, zero only B.

With all four fixed, the loss curves above converged. Weights moved. The adapter was active at inference. Recall was still zero.

What NTP loss was teaching the adapter

Converging NTP loss on a document means the adapter learned

$$P(\text{next token} \mid \text{document prefix}).$$

The trained model will happily complete "Dr. Chen's team developed a novel computational pipeline called" with " MolDock-X".

At decode time the prompt looks like this:

[scaffold]
...
[query]
What is the name of the computational pipeline?
Answer:

This prompt never appeared in training. The adapter learned how the document continues from positions inside itself. At rank 16, q_proj and v_proj only, and at most 30 steps, that didn't produce answers to queries in a different format.

The LoRA learned what the loss measured. Answering questions wasn't in the loss. More rank, more target layers, more steps, or a different loss might recover it. The default configuration didn't.

How other test-time methods avoid this

Methods that extract something useful from test-time updates don't ask the adapter alone to do the work. qTTT keeps the full context in the KV cache and updates the query projection, so the facts live in cache rather than in weight deltas. PERK meta-learns the initial state of the LoRA adapter, training the initialization so later few-step memorization generalizes to retrieval. In-Place TTT repurposes the MLP down-projection as fast weights and updates it chunk-wise at inference time with an NTP-aligned objective, so the frozen-LM assumption no longer holds.

In contrast, Sakana AI's Doc-to-LoRA (March 2026) sidesteps the same failure mode by training with teacher-student distillation over query-style inputs instead of NTP on raw documents, an existence proof that the failure is about objective choice, not about LoRA-as-memory in general.

The claim that failed here is narrow: frozen LM plus NTP-trained LoRA on a document gives you usable query-conditioned memory.

Takeaway

If you're adapting a frozen LM at test time and you care about recall, evaluate with prompts whose format differs from what the adapter saw during training.

Before chasing a fifth bug, check the objective.


Code and full logs: tristore-bma-archived

Read more

RTX5090 체험 후기 (하) | Gcube 지큐브

이 글은 저번 체험에 대한 후기와 이어진다. RTX5090을 사용해 Gemma3를 튜닝했었는데, 그 결과가 소실된 것을 Gcube 측에서 안타까워 하셔 체험 기회를 한번 더 제공해주셨다. 그래서 RTX5090을 3일 더 사용해볼 수 있게 되었다! 이전과 동일한 환경(Axolotl, torch On docker: ghcr.io/deveworld/gpu-dev)에서 튜닝을 진행했다. 물론 이전과 다른 점은

By Dev. World

RTX5090 체험 후기 (상) | Gcube 지큐브

최근 gcube RTX 5090 체험 테스트에 선정되어 무상으로 체험해보게 되었다. 그 논란의 물량도 얼마 없어 돈이 있어도 구하기 어려운 RTX 5090을, 심지어 무료로 말이다! 게다가 5090뿐만 아니라 4090, 5080도 함께 제공받았다. 모두 현시점에서 가장 성능이 좋은 소비자용 그래픽카드 3종류이다. 메모리가 작고 대역폭 병목을 제외한 성능만 본다면 현존 최고 성능이다. 이들

By Dev. World