<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Raghu Subramanian on Medium]]></title>
        <description><![CDATA[Stories by Raghu Subramanian on Medium]]></description>
        <link>https://medium.com/@ragoo?source=rss-8e4cf0775c8e------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*Cb5IAPgfz0CP-KudQx2-fg.jpeg</url>
            <title>Stories by Raghu Subramanian on Medium</title>
            <link>https://medium.com/@ragoo?source=rss-8e4cf0775c8e------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 23 Jun 2026 21:31:06 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@ragoo/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Copy-Pasting Your Prompt Twice Can 5× Your Accuracy (And There’s a Google Paper About It)]]></title>
            <link>https://medium.com/@ragoo/copy-pasting-your-prompt-twice-can-5-your-accuracy-and-theres-a-google-paper-about-it-b1f90c3699bd?source=rss-8e4cf0775c8e------2</link>
            <guid isPermaLink="false">https://medium.com/p/b1f90c3699bd</guid>
            <category><![CDATA[retrieval-augmented-gen]]></category>
            <category><![CDATA[prompt-engineering]]></category>
            <category><![CDATA[large-language-models]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[chatgpt]]></category>
            <dc:creator><![CDATA[Raghu Subramanian]]></dc:creator>
            <pubDate>Mon, 23 Feb 2026 07:55:37 GMT</pubDate>
            <atom:updated>2026-02-23T07:55:37.408Z</atom:updated>
            <content:encoded><![CDATA[<blockquote>Gemini 2.0 Flash Lite. <br>A list of 50 names. <br>Task: find the 25th one.<br>Without any changes: 21% accuracy. <br>With the prompt pasted twice: 97%! <br>Same model, same weights, same API call. Just the input repeated!</blockquote><p>The Google paper <em>“</em><a href="https://arxiv.org/abs/2512.14982"><em>Prompt Repetition Improves Non-Reasoning LLMs”</em></a> makes a simple core claim: transform your input from &lt;QUERY&gt; to &lt;QUERY&gt;&lt;QUERY&gt;. No change to how the model generates output, no increase in latency for typical prompt lengths, no new dependencies.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wEJecDWlR5WzT9IdKqIU4Q.png" /><figcaption>Figure : Accuracy on tasks with and without prompt repetition (Leviathan, Kalman &amp; Matias, 2025).</figcaption></figure><p>They tested GPT-4o, GPT-4o-mini, Claude 3 Haiku, Claude 3.7 Sonnet, Gemini 2.0 Flash, Gemini 2.0 Flash Lite, and DeepSeek V3 across ARC, OpenBookQA, GSM8K, MMLU-Pro, MATH, plus two custom retrieval tasks. Gains are consistent across all models. Retrieval tasks show the biggest jumps. Multiple-choice benchmarks show meaningful but smaller improvements.</p><p>One number that tells the whole story on reasoning models: when the same experiment is run with “think step by step” enabled, prompt repetition goes 5 wins, 1 loss, 22 neutral across 28 tests. Essentially flat. That’s not a failure of the technique. It’s a clue.</p><p><strong>Why it works</strong> comes down to causal attention. Every token can only attend to what came before it. Token 1 is blind to everything that follows. So when a model processes [long context][question], the context tokens get their representations locked in before the question is ever seen. By the time the model reads what it&#39;s supposed to be finding, it&#39;s too late to change how it read the context.</p><p>The second copy fixes this. On pass 2, every context token can now attend back through the entire first copy, including the question. It processes the same words with a completely different information state. It’s the difference between skimming a list cold versus skimming it already knowing what you’re hunting for.</p><p>Reasoning models reinvent this on their own. When a model starts its output by restating the problem before solving it, it’s doing the same thing in the generation phase. Prompt repetition moves that work into the parallelizable prefill stage instead, which is why output length and latency stay flat.</p><p><strong>On cost:</strong> prefill is parallelizable, but attention still scales <strong>quadratically</strong> with sequence length. Double the prompt, and prefill compute goes up roughly 4x. For a 200-token prompt that’s rounding error. At 10k tokens it’s real, and you may hit context window limits entirely. The paper also notes Claude Haiku and Sonnet showed latency increases on the longest inputs specifically, so “no latency impact” isn’t universal.</p><p>Short prompts, retrieval tasks: almost free win. Long context pipelines: benchmark before shipping. Already on a reasoning model: skip it, you’re paying for the same thing already.</p><p><strong>What this actually means for production</strong> is more interesting than the paper spells out.<br>If you’re running RAG, the retrieval chunk is usually prepended to the question. That’s exactly the [context][question] structure where prompt repetition helps most. For short retrieved chunks this could be a low-effort accuracy boost worth A/B testing before you reach for a bigger model or more complex reranking.</p><p>There’s also an implicit cost argument here. A lot of teams default to reasoning models for anything that feels “hard.” But if the task is fundamentally a retrieval or lookup problem, it’s not actually hard in the reasoning sense. It just suffers from the causal attention asymmetry. Prompt repetition on a cheap, fast model might get you to the same accuracy as o1 on a class of tasks, at a fraction of the cost and latency. That’s worth knowing before you write the infrastructure for extended thinking.</p><p>It won’t help with summarization, open-ended generation, or anything where the model isn’t hunting for something specific. Near-flat on MATH in the paper. But for extraction pipelines, structured lookups, document QA over short contexts: it’s a one-line change worth 30 minutes of your time.</p><pre>def repeat_prompt(prompt: str) -&gt; str:<br>    return prompt + prompt</pre><p>The paper also tested a verbose variant (&lt;QUERY&gt;\nLet me repeat that:\n&lt;QUERY&gt;) and triple repeat. Both perform similarly across most tasks, with triple repeat showing larger gains specifically on the retrieval benchmarks. Worth testing if you&#39;re already seeing gains from the basic version.</p><p>The broader point isn’t really about this one trick. It’s that inference-time prompt structure affects model behavior in ways that are systematic and explainable, not random. Understanding why helps you know when to reach for it and when not to. This one has a clean mechanism, real numbers behind it, and a one-line implementation. That’s a rare combination.</p><p>Anyway, I’m going to leave you with this. I asked Sonnet 4.5 to summarize this paper for me before writing this piece. First try, it missed two of the benchmarks. So I sent it the same prompt again, added “make no mistakes” at the end.</p><p>Second try: perfect summary, duh!<br>Coincidence? Probably. But I’m choosing to believe!</p><p>But honestly, what’s the weirdest place you’ve seen prompt structure matter in a real pipeline? Curious what breaks in the wild.</p><p><em>Raghu Subramanian is an MS student at UT Austin looking for forward‑deployed engineering / solutions roles in Summer 2026!</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b1f90c3699bd" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>