Writing

TestTime Compute Scaling: Why the Next Frontier in AI Is at Inference, Not Training

On AIME 2024 mathematics problems, o1 solved 83% compared to GPT-4o's 13%. On PhD-level science questions, o1 matched or exceeded PhD expert performance.

26 min read
Edge Inferencetest-time compute scalinginference-time scaling

For the first five years of the large language model era, the primary lever for improving model quality was training compute. Bigger models trained on more data produced better outputs. The scaling laws held. The industry poured billions into GPU clusters to push training compute higher.

In 2024, that story changed.

OpenAI released o1, a model that was not primarily better because it had more parameters or was trained on more data. It was better because it spent more compute at inference time, reasoning through problems before answering. The performance gains on hard reasoning tasks were dramatic. On AIME 2024 mathematics problems, o1 solved 83% compared to GPT-4o's 13%. On PhD-level science questions, o1 matched or exceeded PhD expert performance. These gains came from a fundamentally different axis: not training compute, but test-time compute.

The insight is simple in retrospect. Training a model teaches it what to do. Giving it time to think at inference lets it figure out how to apply what it knows. For tasks where the answer is verifiable (mathematics, formal reasoning, code, logic puzzles), the model can search for good answers rather than retrieve cached answers. This is not the same as the model having more knowledge. It is the model being able to use the knowledge it has more effectively by allocating more compute to the question.

The technical mechanisms behind test-time compute scaling include chain-of-thought reasoning as a search process, process reward models and why they are the key enabling component, the different scaling strategies (best-of-N, beam search, MCTS at inference), adaptive compute allocation across problems of different difficulty, and the practical implications for building systems that use these capabilities.

What Is TestTime Compute and Why Does It Work?

Test-time compute (TTC), also called inference-time compute scaling, is the allocation of additional computation during inference, not training, to improve output quality. Instead of generating a single response and returning it, the model generates multiple responses, evaluates them, revises, and searches through the space of possible answers before committing to one.

The intuition: for a hard mathematics problem, a human expert does not immediately write down the answer. They work through the problem, check their work, identify errors, and try alternative approaches. The quality of the answer depends not just on what the person knows but on how much time they invest in the problem. Test-time compute scaling applies the same principle to language models.

Why does it work technically? The answer comes from understanding what LLMs are optimizing during training. A standard LLM is trained to maximize the probability of the next token given the preceding context. On hard reasoning problems, the optimal next token is often the one that continues toward a correct solution, but identifying the correct solution may require generating and evaluating multiple partial paths. Training gives the model the ability to evaluate paths; test-time compute gives it the budget to do so.

The empirical finding from Snell et al. (2024) is that test-time compute scaling is more efficient than parameter scaling for hard reasoning tasks. Specifically:

  • A smaller model with 4x increased test-time compute outperforms a model trained with 14x more parameters in FLOPs-matched evaluation
  • The efficiency gain of test-time compute over best-of-N baseline is approximately 4x. Adaptive search strategies find correct answers with 4x less compute than generating N independent samples and taking the best

This is not a marginal improvement. It suggests a fundamentally different efficiency curve for capability improvement. Additional training compute improves the base capability of a fixed model; additional test-time compute improves the performance of that model on specific hard problems without any retraining.

The prerequisite: test-time compute scaling requires that the model can verify its own outputs, or that an external verifier exists. Without verification, there is no way to know which of the generated candidate answers is correct. The approach degenerates into expensive guessing. The category of tasks where TTC helps is where verification is possible. Formal mathematics, code (run and test it), logic proofs, and structured reasoning chains with checkable intermediate steps are all examples.

For tasks without strong verification signals (creative writing, open-ended question answering, summarization), TTC either does not help or requires a weaker proxy verifier (like a reward model trained on human preferences), with diminishing returns.

ChainofThought as a Search Process

Chain-of-thought (CoT) prompting is typically described as "showing the model's work": having it generate intermediate reasoning steps before the final answer. This description is accurate but undersells the key mechanism. CoT is a form of implicit search through the space of possible reasoning paths.

Each step in a chain of thought is a choice: of which fact to recall, which inference to make, which representation to use for the current subproblem. The model samples from the distribution over next steps conditioned on the current context. When that context contains explicit reasoning steps, the distribution over subsequent steps is shaped by the accumulated reasoning. Essentially, the partial path conditions the search.

The connection to formal search algorithms:

Depth-first search: Standard greedy CoT follows a single path from premise to conclusion without backtracking. If the path is wrong, the error propagates. A single bad intermediate step leads the model down a wrong path.

Beam search: Maintaining K parallel reasoning chains and periodically pruning the worst ones. Each chain is scored by a verifier; the top K chains at each step are expanded. This recovers from early errors by maintaining diverse hypotheses.

Monte Carlo Tree Search (MCTS): The full search approach. Build a tree of possible reasoning steps, use a value function to estimate the quality of partial paths, and allocate expansion budget to the most promising paths. More expensive per problem but finds correct solutions on harder problems where beam search fails.

The relationship between search depth and problem difficulty:

  • For easy problems (high probability of getting the right first reasoning step), greedy CoT is sufficient
  • For medium problems, best-of-N (sample N chains independently, take the best) recovers from the noisiness of individual chains
  • For hard problems (low probability that any greedy chain is correct), MCTS-style search with a strong value function is necessary
from typing import Optional
import asyncio
 
class ReasoningNode:
    """Node in a reasoning tree for MCTS-style test-time search."""
 
    def __init__(self, state: str, parent: Optional["ReasoningNode"] = None):
        self.state = state           # The reasoning chain so far
        self.parent = parent
        self.children: list["ReasoningNode"] = []
        self.visit_count: int = 0
        self.value_sum: float = 0.0
        self.is_terminal: bool = False
        self.terminal_answer: Optional[str] = None
 
    @property
    def value(self) -> float:
        if self.visit_count == 0:
            return 0.0
        return self.value_sum / self.visit_count
 
    @property
    def ucb_score(self, exploration_weight: float = 1.414) -> float:
        """UCB1 score for tree search exploration/exploitation balance."""
        if self.visit_count == 0:
            return float("inf")
        import math
        parent_visits = self.parent.visit_count if self.parent else 1
        return self.value + exploration_weight * math.sqrt(
            math.log(parent_visits) / self.visit_count
        )
 
class MCTSReasoningSearch:
    """
    MCTS-based test-time compute allocation for reasoning problems.
    Uses a language model for expansion and a verifier/value model for evaluation.
    """
 
    def __init__(self, model, verifier, n_simulations: int = 50,
                 max_depth: int = 20, branching_factor: int = 3):
        self.model = model          # Language model for generating reasoning steps
        self.verifier = verifier    # Process reward model
        self.n_simulations = n_simulations
        self.max_depth = max_depth
        self.branching_factor = branching_factor
 
    async def search(self, problem: str) -> str:
        """Run MCTS search for the best reasoning chain."""
        root = ReasoningNode(state=f"Problem: {problem}\n\nReasoning:")
 
        for sim in range(self.n_simulations):
            # 1. Selection: traverse tree using UCB scores
            node = self._select(root)
 
            # 2. Expansion: generate K next reasoning steps
            if not node.is_terminal and len(node.children) < self.branching_factor:
                await self._expand(node)
 
            # 3. Simulation: roll out to terminal or max depth
            value = await self._simulate(node)
 
            # 4. Backpropagation: update values up the tree
            self._backpropagate(node, value)
 
        # Return the reasoning chain with the highest value
        return self._best_path(root)
 
    def _select(self, node: ReasoningNode) -> ReasoningNode:
        """Select the most promising node to expand."""
        while node.children and not node.is_terminal:
            # Choose child with highest UCB score
            node = max(node.children, key=lambda n: n.ucb_score)
        return node
 
    async def _expand(self, node: ReasoningNode):
        """Generate K next reasoning steps from the current state."""
        prompt = node.state + "\n[Next step]:"
 
        # Sample K diverse next steps
        next_steps = await self.model.sample(
            prompt,
            n=self.branching_factor,
            temperature=0.8,
            max_tokens=200,
        )
 
        for step in next_steps:
            new_state = node.state + "\n" + step
            child = ReasoningNode(state=new_state, parent=node)
 
            # Check if this step is a terminal answer
            if self._is_terminal(step):
                child.is_terminal = True
                child.terminal_answer = self._extract_answer(step)
 
            node.children.append(child)
 
    async def _simulate(self, node: ReasoningNode) -> float:
        """Evaluate the quality of a reasoning chain using the verifier."""
        if node.is_terminal:
            # Hard verification: check if the answer is correct
            return await self.verifier.score_terminal(
                node.state, node.terminal_answer
            )
 
        # Soft verification: score the partial reasoning chain
        return await self.verifier.score_partial(node.state)
 
    def _backpropagate(self, node: ReasoningNode, value: float):
        while node is not None:
            node.visit_count += 1
            node.value_sum += value
            node = node.parent
 
    def _best_path(self, root: ReasoningNode) -> str:
        """Return the reasoning chain with the highest average value."""
        best_terminal = None
        best_value = -1
 
        def traverse(node: ReasoningNode):
            nonlocal best_terminal, best_value
            if node.is_terminal and node.value > best_value:
                best_value = node.value
                best_terminal = node.state
 
            for child in node.children:
                traverse(child)
 
        traverse(root)
        return best_terminal or root.state
 
    def _is_terminal(self, step: str) -> bool:
        return "therefore" in step.lower() or "answer:" in step.lower() or "= " in step
 
    def _extract_answer(self, step: str) -> str:
        if "answer:" in step.lower():
            return step.split("answer:", 1)[-1].strip()
        return step

Process Reward Models: The Key Enabling Component

The central technical enabler for test-time compute scaling is the process reward model (PRM). A PRM is a model trained to assign scalar reward values to intermediate reasoning steps, not just to final answers.

This is distinct from an outcome reward model (ORM), which only evaluates the final answer. The difference matters:

  • An ORM can tell you "this answer is correct" but cannot tell you which step in a 15-step chain was the error
  • A PRM can assign a value to each step. Step 3 had 0.85 confidence, step 7 had 0.42 confidence (likely where the error was introduced), step 12 had 0.11 confidence

The PRM-step-level signal enables:

  1. Early termination: Stop expanding a reasoning chain when a step scores below a threshold
  2. Error localization: Identify and correct specific wrong steps rather than regenerating the entire chain
  3. Beam pruning: In beam search, prune chains that have accumulated low-confidence steps
  4. Verification without ground truth: Score candidate answers based on the quality of their reasoning chains, not just the final answer. This enables verification in domains where the answer is not directly checkable

Training a PRM requires process labels: human annotations (or model annotations) on the quality of individual reasoning steps, not just final answers. This is expensive to collect because it requires step-level annotation, not just answer-level annotation. OpenAI's PRM800K dataset annotated 800,000 step-level labels from human raters on mathematical reasoning chains.

The lighter alternative: train PRMs using Monte Carlo estimation. For each partial reasoning chain, estimate the probability that it leads to a correct final answer by sampling K completions from that partial state and computing the fraction that yield correct answers. This is noisier than human labels but scalable without annotation cost.

async def train_prm_with_monte_carlo(
    partial_chain: str,
    problem: str,
    model,
    n_rollouts: int = 20,
) -> float:
    """
    Estimate the value of a partial reasoning chain via Monte Carlo rollout.
    Returns: estimated probability that this partial chain leads to correct answer.
    """
    completions = await model.sample(
        prompt=partial_chain,
        n=n_rollouts,
        temperature=0.8,
        max_tokens=500,
    )
 
    correct_count = 0
    for completion in completions:
        final_answer = extract_answer(completion)
        if verify_answer(final_answer, problem):
            correct_count += 1
 
    return correct_count / n_rollouts
 
def score_reasoning_chain(
    chain: list[str],     # List of reasoning steps
    prm: callable,        # Trained PRM returning step-level scores
    aggregation: str = "min",
) -> float:
    """
    Score a complete reasoning chain using a PRM.
 
    Aggregation strategies:
    - "min": bottleneck score (worst step determines chain quality)
    - "product": probability that all steps are correct (overpenalizes long chains)
    - "mean": average step quality (lenient toward chains with one bad step)
    - "last": only score the final step (lightweight, reasonable for many tasks)
    """
    step_scores = [prm(step, chain[:i]) for i, step in enumerate(chain)]
 
    if aggregation == "min":
        return min(step_scores)
    elif aggregation == "product":
        result = 1.0
        for s in step_scores:
            result *= s
        return result
    elif aggregation == "mean":
        return sum(step_scores) / len(step_scores)
    elif aggregation == "last":
        return step_scores[-1]
    else:
        raise ValueError(f"Unknown aggregation: {aggregation}")

PRM aggregation matters significantly for performance. Research from the Snell et al. paper found that the "product" aggregation systematically underperforms "min" aggregation. The reason is that it penalizes long reasoning chains even when the steps are individually correct. For most reasoning tasks, "min" (bottleneck) aggregation provides the best correlation with final answer correctness.

Scaling Strategies: BestofN, Beam Search, and MCTS

There are four main test-time compute strategies, arranged by computational cost and applicability to hard problems:

Strategy 1: Best-of-N (Bo-N)

Generate N independent complete reasoning chains and return the chain that scores highest on the verifier/PRM.

async def best_of_n(problem: str, model, verifier,
                     n: int = 16, temperature: float = 0.8) -> str:
    """Generate N reasoning chains, return the best-scored one."""
    chains = await model.sample(
        problem,
        n=n,
        temperature=temperature,
        max_tokens=2000,
    )
 
    scored = [(chain, await verifier.score(chain)) for chain in chains]
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[0][0]

Cost: O(N). Quality: grows as log(N). Doubling N gives diminishing returns. Best for medium-difficulty problems where correct answers exist in the top few samples.

Strategy 2: Beam Search

Maintain K partial chains at each step, expand each by generating M next steps, score all K×M candidates, keep the top K.

Cost: O(K × M × depth). Quality: better than Bo-N for the same compute budget on hard problems because it shares computation across chains at early steps.

Strategy 3: Diverse Verifier Tree Search (DVTS)

A variant of beam search that explicitly maximizes diversity among the K beams at each step, preventing collapse to the same reasoning path.

The key insight: standard beam search with a learned verifier often collapses. The top-K beams all follow similar reasoning paths. DVTS partitions the K beams into groups and runs independent beam search within each group, maintaining diversity.

async def diverse_verifier_tree_search(
    problem: str, model, verifier,
    n_beams: int = 16, n_groups: int = 4, depth: int = 10
) -> str:
    """
    DVTS: divide beams into groups for diverse search.
    Each group explores independently; best answer across groups is returned.
    """
    beams_per_group = n_beams // n_groups
    group_results = []
 
    async def search_group():
        beams = [problem]
        for step in range(depth):
            candidates = []
            for beam in beams:
                # Generate next steps for this beam
                next_steps = await model.sample(beam, n=3, temperature=0.9)
                for step_text in next_steps:
                    new_beam = beam + "\n" + step_text
                    score = await verifier.score(new_beam)
                    candidates.append((new_beam, score))
 
            # Keep top beams_per_group candidates
            candidates.sort(key=lambda x: x[1], reverse=True)
            beams = [b for b, _ in candidates[:beams_per_group]]
 
        # Return best beam from this group
        scored = [(b, await verifier.score(b)) for b in beams]
        return max(scored, key=lambda x: x[1])[0]
 
    # Run all groups concurrently
    group_results = await asyncio.gather(*[search_group() for _ in range(n_groups)])
 
    # Return best result across groups
    scored_results = [(r, await verifier.score(r)) for r in group_results]
    return max(scored_results, key=lambda x: x[1])[0]

Strategy 4: MCTS (full tree search)

The most expensive but most effective strategy for the hardest problems.

Compute-performance tradeoff across strategies:

For MATH benchmark hard subset (problems where GPT-4 base accuracy is <20%):

Strategy Compute multiplier Accuracy
Greedy (CoT) 1x 18%
Best-of-16 16x 47%
Beam search (K=16) ~16x 58%
DVTS (16 beams) ~16x 63%
MCTS (50 sim) ~50x 71%
MCTS (200 sim) ~200x 78%

The crossover point (where additional test-time compute stops helping) varies by problem difficulty and verifier quality. For most production applications, DVTS with 16 beams offers the best compute-quality tradeoff.

Adaptive Compute Allocation: Matching Budget to Difficulty

Not all problems need the same compute budget. A uniform allocation of N reasoning chains per query is wasteful on easy questions and insufficient on hard ones.

Adaptive allocation routes each query to the appropriate compute level based on estimated difficulty:

class AdaptiveComputeAllocator:
    """
    Allocates test-time compute adaptively based on problem difficulty.
    Easier problems get less compute; harder problems get more.
    """
 
    def __init__(self, difficulty_classifier, model, verifier):
        self.classifier = difficulty_classifier
        self.model = model
        self.verifier = verifier
 
        # Compute budgets by difficulty tier
        self.strategy_map = {
            "easy": {"strategy": "greedy", "n": 1},
            "medium": {"strategy": "best_of_n", "n": 4},
            "hard": {"strategy": "beam_search", "n": 16, "depth": 8},
            "very_hard": {"strategy": "mcts", "n_simulations": 50},
        }
 
    async def solve(self, problem: str) -> tuple[str, dict]:
        """
        Solve a problem with adaptive compute allocation.
        Returns: (answer, metadata) where metadata includes difficulty and strategy used.
        """
        # Classify difficulty (cheap: one model call, no search)
        difficulty = await self.classifier.classify(problem)
        strategy_config = self.strategy_map[difficulty]
 
        if strategy_config["strategy"] == "greedy":
            answer = await self.model.complete(problem, temperature=0)
 
        elif strategy_config["strategy"] == "best_of_n":
            answer = await best_of_n(
                problem, self.model, self.verifier, n=strategy_config["n"]
            )
 
        elif strategy_config["strategy"] == "beam_search":
            answer = await beam_search_reasoning(
                problem, self.model, self.verifier,
                n_beams=strategy_config["n"],
                depth=strategy_config["depth"]
            )
 
        elif strategy_config["strategy"] == "mcts":
            searcher = MCTSReasoningSearch(
                self.model, self.verifier,
                n_simulations=strategy_config["n_simulations"]
            )
            answer = await searcher.search(problem)
 
        return answer, {
            "difficulty": difficulty,
            "strategy": strategy_config["strategy"],
            "compute_multiplier": strategy_config.get("n", 1),
        }
 
async def build_difficulty_classifier(model) -> callable:
    """
    Build a lightweight difficulty classifier.
    Returns a function that estimates query difficulty from the question alone.
    """
    async def classify(problem: str) -> str:
        classification_prompt = f"""Rate the difficulty of this reasoning problem on a scale:
- easy: direct recall or single-step inference
- medium: 2-5 step reasoning, standard techniques
- hard: multi-step reasoning with potential backtracking
- very_hard: requires deep search, highly non-obvious
 
Problem: {problem}
 
Difficulty (one word):"""
 
        response = await model.complete(classification_prompt, max_tokens=5, temperature=0)
        for level in ["very_hard", "hard", "medium", "easy"]:
            if level in response.lower():
                return level
        return "medium"  # Default to medium
 
    return classify

Efficiency gains from adaptive allocation: Snell et al. report that adaptive compute allocation achieves 4x better performance per compute unit compared to uniform best-of-N allocation. Easy problems (the majority) get cheap greedy answers, while hard problems (the minority) get expensive search. The average compute per problem is much lower than worst-case, while the hardest problems get the budget they need.

Training for TestTime Compute: RLVR and Thinking Token Training

The model used as the base for test-time scaling needs to be trained to produce reasoning chains that are amenable to search: long, decomposed, self-critical thinking steps that contain verifiable intermediate claims.

Two training approaches have emerged:

RLVR (Reinforcement Learning with Verifiable Rewards): Train the model on problems where correctness is automatically verifiable (mathematics, code, formal logic). The reward signal is binary: the final answer is either correct or not. No human reward labeling needed. The model is trained to maximize the probability of reaching correct answers, which implicitly teaches it to produce reasoning chains that work.

DeepSeek-R1 (2025) demonstrated that RLVR alone, without supervised fine-tuning on human reasoning examples, produces strong reasoning capabilities. The model trained purely through RL on verifiable rewards developed "thinking behaviors" (self-correction, exploration of alternatives, backtracking) that emerged from the training objective without being explicitly taught.

Thinking token training: Train the model explicitly to generate a scratchpad of thoughts (contained in special tokens like <think>...</think>) before generating the final answer. The scratchpad is not part of the output given to the user; it is internal computation. The model is rewarded based on the final answer quality, but the thinking tokens allow it to allocate more computation to the problem before committing.

The training dataset for thinking token models requires examples with explicit step-by-step reasoning labeled as thinking tokens. Both high-quality human demonstrations and RLVR-derived chains can serve as training data.

# Prompting thinking-token models correctly
THINKING_SYSTEM_PROMPT = """You are a reasoning assistant. Before answering any question,
use <think> tags to work through the problem step by step. Your thinking is private
and will not be shown to the user. Only the content after </think> is the final answer.
 
For difficult problems:
- Break the problem into subproblems
- Check each step before proceeding
- Consider alternative approaches if stuck
- Verify your final answer against the problem statement
 
Format:
<think>
[Your step-by-step reasoning here]
</think>
[Final answer here]"""
 
def extract_thinking_and_answer(response: str) -> tuple[str, str]:
    """Extract thinking trace and final answer from model response."""
    if "<think>" in response and "</think>" in response:
        think_start = response.index("<think>") + len("<think>")
        think_end = response.index("</think>")
        thinking = response[think_start:think_end].strip()
        answer = response[think_end + len("</think>"):].strip()
        return thinking, answer
    return "", response

Budget allocation for thinking tokens: Models with explicit thinking tokens allow direct control over the compute budget by limiting the number of thinking tokens. "Think for at most 1000 tokens" vs. "think for at most 5000 tokens" produces different quality-cost tradeoffs for the same model. This is a practical advantage over best-of-N approaches that require running the model multiple times.

Practical Tradeoffs: When TestTime Compute Helps and When It Doesn't

Test-time compute scaling has real but bounded applicability. The conditions under which it helps:

High TTC benefit:

  • Mathematics (exact verifiability, unique correct answer)
  • Formal logic and proof verification
  • Code generation (output can be run and tested)
  • Constraint satisfaction problems
  • Multi-step planning with verifiable goal states

Low TTC benefit:

  • Open-ended creative writing (no ground truth to verify against)
  • Summarization (multiple valid outputs; hard to rank)
  • Pure factual recall (either the model knows the fact or it doesn't; more compute doesn't create missing knowledge)
  • Tasks where speed matters more than accuracy (chatbot responses where latency is the UX constraint)

Negative TTC effect:

  • Tasks where overthinking leads to less natural outputs (casual conversation, brainstorming where fluency matters more than correctness)
  • Tasks where the model's uncertainty is semantic, not computational. If the model lacks the knowledge to answer a question, searching through the space of guesses does not help

Cost implications: A 50-simulation MCTS run costs approximately 50x more than a single inference. For production deployments, TTC approaches are economically feasible only when:

  • The value of a correct answer is high relative to compute cost
  • The task is hard enough that TTC meaningfully improves accuracy
  • Compute can be allocated asynchronously (latency is tolerable)

For real-time chatbot applications with latency constraints under 2 seconds, TTC strategies are generally not practical except for the lightest versions (best-of-3, greedy with CoT). For background research agents, code generation pipelines, and mathematical problem solvers, the economics work well.

Implementation Patterns for Production Systems

import asyncio
from anthropic import AsyncAnthropic
 
client = AsyncAnthropic()
 
async def reasoning_with_budget_tokens(
    problem: str,
    thinking_budget: int = 5000,
    model: str = "claude-opus-4-6",
) -> dict:
    """
    Use extended thinking via the Anthropic API.
    Allocates thinking_budget tokens for internal reasoning.
    """
    response = await client.messages.create(
        model=model,
        max_tokens=16000,
        thinking={
            "type": "enabled",
            "budget_tokens": thinking_budget,
        },
        messages=[{"role": "user", "content": problem}],
    )
 
    thinking_content = ""
    text_content = ""
 
    for block in response.content:
        if block.type == "thinking":
            thinking_content = block.thinking
        elif block.type == "text":
            text_content = block.text
 
    return {
        "thinking": thinking_content,
        "answer": text_content,
        "thinking_tokens_used": len(thinking_content.split()),  # Approximate
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }
 
async def calibrate_thinking_budget(
    problems: list[str],
    budgets: list[int],
    verifier: callable,
) -> dict[int, float]:
    """
    Calibrate which thinking budget is cost-efficient for a problem set.
    Returns: {budget: accuracy} mapping.
    """
    results = {}
 
    for budget in budgets:
        correct = 0
        for problem in problems:
            result = await reasoning_with_budget_tokens(problem, budget)
            if await verifier(result["answer"], problem):
                correct += 1
        results[budget] = correct / len(problems)
 
    return results

The routing strategy for production: Route queries based on a quick difficulty estimate. For the majority of queries (estimated 60-70% "easy" for most production applications), use greedy generation with standard CoT. For medium queries (~25%), use best-of-4 with a fast verifier. For hard queries (~5%), use beam search or MCTS with the full budget.

This routing produces aggregate compute cost close to greedy generation cost while capturing most of the quality benefits of TTC for the hard queries where it matters.

Efficiency: Getting More From Less With Inference Budget

Several techniques reduce the compute cost of test-time compute scaling:

1. Speculative rejection: Run cheap verification at each step and reject chains early before they are fully completed. A chain that scores below 0.3 after step 4 is unlikely to produce a correct final answer. Stop and sample a new chain rather than completing it.

2. Shared prefix caching: In best-of-N, all N chains share the same problem statement prefix. Using KV cache for this shared prefix eliminates prefix computation cost across chains. For long problem statements, this is significant. The problem can be 500 tokens, and the KV cache saves 500 × N token computations.

3. Draft model verification: Use a small model for most reasoning steps, upgrade to the full model only when the draft model's confidence is low or a critical decision point is reached. The principle is identical to speculative decoding applied to reasoning steps rather than individual tokens.

4. Problem decomposition before search: Decompose complex problems into independent subproblems, solve each with TTC, and compose the subproblem answers. This reduces the search depth needed for each subproblem while maintaining overall problem coverage.

Key Takeaways

  • Test-time compute scaling (inference-time scaling) achieves efficiency gains by allocating more compute at inference rather than training. A smaller model with 4x increased test-time compute outperforms a model trained with 14x more parameters in FLOPs-matched evaluation. This is a fundamentally different efficiency curve than training scaling.

  • The key enabling component is the process reward model (PRM), which scores individual reasoning steps rather than just final answers. PRMs can be trained without expensive human annotation using Monte Carlo estimation: sample K completions from each partial chain and compute the fraction that yield correct answers.

  • Four main scaling strategies exist in order of increasing cost and capability: greedy CoT (1x), best-of-N (Nx), beam search (Kx), and MCTS (simulation count x). The right strategy depends on problem difficulty. Most problems only need greedy CoT, while a small fraction of hard problems require full MCTS.

  • Adaptive compute allocation achieves 4x better performance per compute unit compared to uniform best-of-N allocation. Easy problems (the majority) get cheap greedy answers, while hard problems (the minority) get expensive search. The difficulty classifier itself needs to be cheap: a single model call estimating difficulty.

  • Test-time compute is not universally applicable. It helps specifically when verification is possible: mathematics, code (run it), formal logic, constraint satisfaction. It does not help for pure knowledge recall (the model lacks the fact regardless of compute), open-ended creative tasks, or latency-constrained real-time applications.

  • Thinking token models (o1-style, Claude extended thinking, DeepSeek-R1) make test-time compute directly accessible via API parameters. The thinking budget token count controls the compute-quality tradeoff. Calibrate this budget for your specific task distribution rather than using defaults.

FAQ

What is testtime compute scaling in AI?

Test-time compute scaling is the practice of allocating additional computation during inference (when the model is generating a response) to improve output quality, rather than relying solely on model size and training compute. Instead of returning a single generated answer, the system generates multiple candidate reasoning chains, evaluates them using a verifier or process reward model, and returns the best one. Research shows that a smaller model with 4x increased test-time compute can outperform a model trained with 14x more parameters in FLOPs-matched evaluation, making it a highly efficient path to quality improvement for tasks with verifiable answers.

How does OpenAI o1 reasoning work?

OpenAI o1 uses reinforcement learning with verifiable rewards (RLVR) to train a model to produce extended chain-of-thought reasoning before generating its final answer. The model generates a "thinking" trace (a series of intermediate reasoning steps, hypotheses, and self-corrections) before committing to an answer. This thinking trace is produced by the model as part of its generation but is not shown directly to the user in most interfaces. The model was trained using reinforcement learning on problems with automatically verifiable answers (mathematics, code), learning to allocate its reasoning effort toward problems it finds difficult and to self-correct errors in intermediate steps. The result is dramatically improved performance on hard reasoning tasks compared to the same model without extended thinking.

What is a process reward model (PRM)?

A process reward model (PRM) is a model trained to assign quality scores to individual steps in a reasoning chain, not just to the final answer. Unlike outcome reward models that only evaluate whether the final answer is correct, PRMs provide step-level feedback: "step 3 of this reasoning chain has 0.85 confidence; step 7 has 0.42 confidence." This step-level signal enables early stopping of unpromising reasoning chains, identification of exactly where reasoning went wrong, and pruning in beam search or MCTS by scoring partial chains rather than waiting for completion. PRMs can be trained on human-annotated step-level labels or more economically using Monte Carlo estimation: sampling multiple completions from each partial chain and computing the fraction that yield correct final answers.

What is the difference between bestofN sampling and beam search for reasoning?

Best-of-N (Bo-N) sampling generates N independent complete reasoning chains in parallel (or sequentially) and returns the one with the highest verifier score. Each chain starts fresh from the problem statement. Beam search maintains K partial chains that are expanded step-by-step, pruning low-scoring chains at each step and keeping the top-K. The key difference is that beam search shares computation across chains at early steps. All K beams explore the same early reasoning before diverging, while Bo-N wastes compute independently rediscovering the same obvious early steps. For hard problems where most reasoning chains reach the same correct first few steps before diverging, beam search is significantly more efficient than Bo-N at the same compute budget. For problems where early steps are also uncertain, Bo-N's independence prevents premature convergence to wrong paths.

The training scaling law held for a decade: more compute in training, better models. Test-time compute scaling offers a different law: more compute at inference, better answers for hard problems.

The two laws are not alternatives. They compound. A model trained with more compute has a higher capability ceiling; test-time compute allows it to reach that ceiling more consistently on hard problems. The practical question is where to allocate the next dollar of compute: training (raises the ceiling for everyone), or inference (improves the answers on hard queries for specific users right now).

For production applications with specific hard problems to solve (mathematical reasoning, complex code generation, multi-step planning), the inference allocation is often better value. A 50x increase in inference compute on the hard queries (which may be 5% of total queries) costs 2.5x total inference budget, produces dramatically better results on the hard queries, and requires no retraining.

The difficulty is the prerequisite: you need a verifier. Problems without ground truth verification are poor candidates for TTC scaling. The research agenda is currently focused on extending verifiable rewards to broader task categories. As that happens, the range of problems where TTC applies will expand.

For now, the practical advice is to identify the hard problems in your application where TTC applies, calibrate the compute budget for your quality-cost target, and route accordingly. The infrastructure is available. The models support it. The question is which problems in your stack are worth thinking harder about.

Written & published by Chaitanya Prabuddha