Agentic RL (Part II): RL Systems for Real-World Tasks

In Part I, we discussed Agentic Models, Action-Observation Loops, Rewards, and why Reinforcement Learning (RL) becomes the core optimization tool when models go beyond text generation to take actions, observe, and correct themselves in an environment.

However, once we step into real-world scenarios like financial forecasting, scientific discovery, and chemical reasoning, we face a more fundamental problem: it's not the lack of RL algorithms, but the lack of a sufficiently good environment. Real-world tasks rarely have immediate ground-truth answers. Feedback can be delayed by days, months, or even longer; validation costs range from simple rule checks to expensive experiments; and models might learn to exploit evaluator loopholes (reward hacking) instead of solving the actual problem.

Therefore, Agentic RL in the real world is primarily about environment engineering, and only secondarily about algorithm selection. Whether an open-ended task can be represented as an actionable, observable, verifiable environment that can accumulate trajectories often determines whether RL can truly solve high-value problems.

This article uses three systems as cross-sections: EchoZ, Simple-TES, and ether0. They target future event prediction, scientific and engineering search, and chemical reasoning, respectively. Yet, their commonality is striking: all are designed around environments, verifiers, trajectory libraries, and data flywheels.

1. Beyond Static Problem Solving: Confronting Open-Ended Real-World Tasks

In math, coding, and multiple-choice questions, verifiable rewards are relatively clean: does the final answer match, do the unit tests pass, or is the option correct? But real-world open-ended tasks resemble a different class of problems:

Type	Examples	Reward Characteristics	Core Difficulties
Static Verifiable Tasks	Math, code, multiple-choice	Clear answers, clean rewards	Algorithms and sampling efficiency
Real-World Open Tasks	Financial prediction, scientific discovery, molecular design	Delayed, noisy, expensive, hackable	Environments, verifiers, data flywheels

This is why many real-world Agentic RL systems don't start with "which RL algorithm should I use," but rather by answering four questions first:

Environment: What actions can the model take? How does the environment return observations?
Exploration: How does the model search? Parallel research, local iteration, or specialist domain optimization?
Reward / Verifier: How is quality judged? Real-world outcomes, rubrics, evaluators, simulations, or tool combinations?
Learning Loop: How do high-quality trajectories enter training? SFT, IRFT, GRPO, Distillation, or building a continuous learning system?

The following three case studies represent three different answers to these four questions.

2. EchoZ: Turning the Future into Training Data

Future event prediction doesn't look like traditional RLVR (Reinforcement Learning with Verifiable Rewards) because models don't have standard answers when making predictions. However, it possesses a unique advantage: events are ultimately verified (resolution) in the real world. In other words, questions lack labels upon generation, but ground truth is obtained after a waiting period.

EchoZ capitalizes on this by proposing Train-on-Future: instead of training predictive models on historical events that have already occurred, it continuously generates questions about the future. Agents make predictions under incomplete information in the present. Once the real world delivers the outcome, it retroactively evaluates the trajectory quality. EchoZ builds a predictive Agent based on a ReAct-style Thought-Action-Observation loop, saving the entire interaction process as a trajectory. ^[1]^[2]

A simplified workflow is:

text

Real-time trends -> Future question generation -> Multi-Agent research and prediction
-> Wait for event resolution -> Ground truth 
-> Brier / Elo ranking -> Rubric process scoring -> High-quality trajectory filtering
-> SFT / RL / Distillation -> New models continue predicting the future

The crux of this is not merely "using the future as labels." On a deeper level, Train-on-Future turns the real world into an asynchronous, continuous, naturally OOD (Out-Of-Distribution) environment sampler.

2.1 Map-Reduce Prediction Agent

EchoZ doesn't have a single Agent search the web and spit out a probability; instead, it breaks the prediction task into Map-Reduce style information gathering and evidence synthesis.

In the Map phase, the system decomposes macro-questions into multiple relatively orthogonal sub-tasks. Different Agents respectively retrieve official documents, news reports, databases, prediction markets, and social signals. In the Reduce phase, aggregation nodes handle source conflicts, distinguishing between first-hand evidence, second-hand reporting, market prices, and noisy signals to output structured probabilistic predictions.

The significance of this step is transforming "predicting an answer" into "generating an auditable evidence trajectory." The trajectory contains not just the final probability, but also why the model checked those sources, what it saw, how it handled conflicting evidence, and how it ultimately calibrated the probability.

2.2 Using the Future to Reduce Data Leakage

The trouble with the traditional Train-on-Past paradigm is that historical web pages, news, and outcomes have likely already entered the pre-training corpus. Even with strict time cutoffs, it is extremely difficult to restore the true state of the internet at a historical moment. In fact, attempting to prevent "peeking at the answers" by taking snapshots of historical datasets has proven to be incredibly challenging engineering-wise.

EchoZ's Train-on-Future means completely yielding to real physical time: the system only collects the Agent's complete prediction trajectories before the event occurs, and conducts post-training based on these historically locked trajectories after the event's resolution. In such a predictive system, the irreversibility of real time is itself the most critical environmental engineering.

2.3 Rubrics as the Distillation of Expert Process Knowledge

Future predictions cannot be judged solely by whether the final answer is right. A rigorous judgment might fail due to a black swan event, while a poor judgment might hit the mark through sheer luck. If trained directly with outcome rewards, the model would mistake real-world noise for reasoning signals.

Therefore, EchoZ uses multi-dimensional rubrics to evaluate the prediction process. The original rubrics can be roughly divided into four categories:

Category	Focus Area
Sourcing	Whether first-hand sources are used; filtering old news spun as new, misleading search snippets, and metadata errors
Logic	Understanding resolution criteria, proper entity disambiguation, distinguishing verbal claims from actual execution
Timeline	Calculating remaining time windows, considering process lags, trigger events, and exit paths
Calibration	Using base rates of similar historical events, treating a lack of evidence as negative evidence, matching probability with evidence strength

These rubrics essentially break down analysts' implicit methodologies into machine-executable process evaluation dimensions.

Even more interesting is that EchoZ does not entirely rely on human-written rubrics. It uses resolved events to calculate the true Elo rankings of models or trajectories, then has candidate rubrics score the same batch of trajectories. It compares the Spearman correlation between the rubric ranking and the true Elo ranking, ultimately retaining the rubrics that best predict true win rates. ^[2:1]

This step transforms rubrics from empirical rules into reward models iteratively calibrated by the real world. Ground truth is not just the final label; it can also be used to select more reliable standards for process evaluation.

3. Simple-TES: Distilling Test-Time Search into Scientific Discovery Capabilities

EchoZ utilizes the asynchronous feedback of "the future will provide the answer." Simple-TES, on the other hand, targets a different class of tasks: candidate solutions can be scored by an evaluator, but the optimal solution is hard to write out directly. Examples include GPU kernel optimization, quantum circuit compilation, algorithmic engineering, combinatorial constructions, circle packing, Hadamard matrices, scRNA-seq denoising, and scaling law discovery. ^[3]

The commonality across these tasks is that the solution space is highly non-convex, both local refinements and global jumps are crucial, and a massive amount of failed attempts intrinsically contain information. Therefore, the core of Simple-TES is not having the model get the right answer in one go, but conducting organized trial-and-error during test time.

3.1 C × L × K: How to Allocate Search Budgets

Simple-TES decomposes the total evaluation budget into three dimensions:

Dimension	Meaning	Role
$C$ : Global Width	Number of parallel trajectories	Global exploration, avoiding early path lock-in
$L$ : Refinement Depth	Iterations per trajectory	Step-by-step improvement using feedback
$K$ : Local Sample Size	Candidates per iteration	Reducing single-step sampling noise

In each iteration, every trajectory constructs a prompt based on historical feedback, generates $K$ candidate solutions, submits them to the evaluator for scoring, and appends only the highest-scoring candidate of the current round to the trajectory. The system runs $C$ trajectories in parallel for $L$ rounds, finally returning the global maximum score solution.

This illustrates that test-time scaling is not blind over-sampling, but budgeting across global exploration, local enumeration attempts, and long-term feedback accumulation. Mathematical construction tasks often rely more on $C$ , needing to find rare starting structural points; GPU kernels or engineering optimizations rely more on $L$ , as performance gains usually come from continuous fine-tuning; a larger $K$ is especially crucial in long trajectories because the quality of each local selection step is amplified by subsequent iterations.

3.2 Context Selection is also Part of the Policy

The Context Builder of Simple-TES does not stuff all history into the prompt; instead, it decides which successes, failures, and intermediate explanations should enter the context.

RPUCG is similar to the PUCT (Predictor Upper Confidence Bound applied to Trees) algorithm used by DeepMind in the AlphaGo / AlphaZero series of works ^[4]^[5], balancing high-scoring nodes, nodes likely to yield high-scoring descendants, and under-explored nodes. Balance simultaneously retains best, elite, explore, and random candidates to explicitly maintain diversity. LLM-Elite uses an auxiliary LLM to maintain an elite pool of methodologically diverse solutions.

The principle behind this is highly important: in Agentic RL, memory/context selection itself is part of the policy. What failures, successes, and intermediate reasoning the model sees will directly dictate its next search direction.

3.3 From Candidate-Level Rewards to Trajectory-Level Learning

Simple-TES does not simply train the model on the immediate score of each candidate, as this would induce short-sighted policies. In scientific discovery, early low-scoring attempts could be the scaffolding for later breakthroughs, and a failure at a certain step might expose a critical error pattern.

Thus, it employs trajectory-level post-training: first heavily sampling trajectories, then sorting them by the historical highest score reached in each trajectory, retaining only the Top $R %$ elite trajectories. If a trajectory hits its peak score at step 4 and sees no improvement after step 5, the training only retains the first 4 steps, discarding the subsequent redundant exploration. Finally, these retained trajectories are used for weighted maximum likelihood training.

This looks like SFT, but the data isn't static human demonstrations; rather, it consists of high-quality exploration paths filtered by the evaluator from massive rollouts. Simple-TES's analysis reveals a very telling statistic: out of millions of candidate trajectories in the cold-start phase, ultimately only about 0.48% of optimal trajectories are kept and refined into high-quality training data.

In other words, Simple-TES uses expensive test-time compute to distill a low-density exploration space into high-quality training data. Large-scale search precedes learning, and learning conversely elevates the efficiency of the next round of search and research.

4. ether0: Multi-Tier Verifiers in Chemical Reasoning

ether0 targets chemical reasoning and molecular design. It's not an ordinary chemical Q&A model, but a 24B reasoning model based on Mistral-Small-24B, trained on 640,730 experimentally grounded chemistry problems covering 375 tasks. The task scope ranges from synthesizability, blood-brain barrier penetration, and human receptor activity to odor. ^[6]

These types of tasks are perfectly suited to test the "hard to generate, easy to verify" concept discussed in Part I: designing a molecule that satisfies constraints is difficult, but checking whether a candidate molecule satisfies SMILES validity, molecular formulas, functional groups, reaction feasibility, or a specific property prediction is relatively easier.

4.1 From SFT to Generalist GRPO

The training pipeline of ether0 can be compressed into four steps:

Long CoT SFT Cold Start: Using strong models to generate long reasoning chains, filtered by formatting, SMILES/SMIRKS validity, and LLM-as-Judge.
Specialist GRPO: Training specialists by task family, allowing the model to explore within narrower distributions first. The relative advantage estimation of GRPO comes from the critic-free RL approach in the DeepSeekMath work ^[7].
Specialist Distillation: Collecting correct trajectories during specialist training, filtering out low-quality reasoning, non-English outputs, and bad molecular structures, and distilling them back into a generalist model. This mirrors the trajectory distillation idea of the teacher model in DeepSeek-V4 ^[8].
Generalist GRPO: Joint training across all tasks, incorporating curriculums, molecule quality bonuses, and safety alignment.

The significance of specialists and curriculums is keeping the rollouts near the boundaries of the model's capabilities. If a batch of completions is entirely correct or entirely wrong, the GRPO advantage approaches zero, and the training signal is wasted.

4.2 Reward is Verifier Dispatch, Not a Single Scorer

The reward in ether0 is not a single model assigning a total score, but an ensemble of domain tools, rules, databases, and surrogate models:

Task	Verifier Example
IUPAC / MCQ	String or normalized matching
SMILES completion	RDKit validity
Molecular formula	Hill notation / Molecular formula constraints
Functional group	Molecular formula + functional group dual constraints
Solubility edit	Property predictors like KDESol
Retrosynthesis	Purchasability Bloom filter + Molecular Transformer
Reaction prediction	Exact or soft matching of products

This is the quintessential form of scientific and engineering RL: reward is not an abstract scalar, but a suite of verifier dispatch operations. Low-cost rules first filter out invalid samples, and high-fidelity models and databases then calibrate high-value areas. For real scientific tasks, a single oracle is often too expensive, and a single rule is too weak; fusing multi-tier verifiers is the only scalable way to solve complex real-world problems.

4.3 Anti-Cheating Must be Written into the Reward

Chemical tasks are highly susceptible to reward hacking. For instance, in retrosynthesis, the target product might be directly placed into the reactants, generating syntactically valid but physically unreasonable molecules, or exploiting loopholes in surrogate models to optimize a single metric at the expense of synthesizability.

ether0's defense lines include reasonable molecule checks, bad substructure / negative SMARTS patterns, combinations of format rewards and accuracy rewards, molecule quality bonuses, and safety alignment. This demonstrates that real-world RL rewards must contain at least three types of signals: target rewards, constraint rewards, and anti-cheating penalties.

Simply defining "what you want" is not enough; you must also clearly define "which shortcuts don't count."

5. Environment and Task-Driven Data AI Engine

Dimension	EchoZ	Simple-TES	ether0
Task Type	Future event prediction	Scientific/Engineering search	Chemical reasoning and molecular design
Exploration Method	Map-Reduce multi-Agent research	$C \times L \times K$ test-time search	Specialist GRPO + distillation
Reward	Ground truth, Brier/Elo, rubrics	Evaluator score, trajectory best score	Format reward, domain verifiers, quality bonus
Data Flywheel	Train-on-Future	Heavy filtering post massive rollouts	Distilling correct specialist trajectories into generalist
Core Risks	Time leakage, real-world noise, probability distortion	Short-sighted rewards, evaluator overfitting, compute cost	Reward hacking, unreasonable molecules, trivial groups

Fusing the case studies above reveals several common principles for applying Agentic RL in real-world systems.

First, construct the environment before discussing the algorithm. Without an actionable, observable, and verifiable environment, even the best RL algorithm is just performing gradient descent on noise.

Second, the reward must simultaneously satisfy credibility, density, low cost, and anti-cheating measures. EchoZ solves the delayed ground-truth and process rubric calibration issue; Simple-TES addresses evaluator-driven massive trial-and-error; ether0 handles the hierarchical dispatch of domain verifiers and anti-cheating constraints.

Third, trajectories are more important than answers. The training object of Agentic RL is not the answer, but the trajectory. The answer is just the last line of the trajectory; what is truly transferable is how the model searches, verifies, corrects, and terminates.

Fourth, high-quality data comes from meticulous selection following massive exploration. EchoZ filters high-rubric/high-Elo trajectories, Simple-TES retains only a tiny fraction of elite trajectories from millions of candidates, and ether0 filters correct trajectories, low-quality reasoning, and bad molecular structures.

Fifth, Rubrics or Multi-Tier Verifiers will become crucial middle layers for open-task RL. Confronting tasks that are difficult to evaluate directly, EchoZ distilled experts' prediction processes into multi-dimensional rubrics, while ether0 built a multi-tier verifier stack ranging from lightweight rules to high-fidelity models. Both provide exploration directions and process constraints to the model through explicit middle layers.

Although works like CL-Bench have shown that current models perform poorly when faced with a large number of entirely new rubric constraints ^[9], this has simultaneously spurred technologies like Rubrics-to-Tokens to attempt transforming macro rubrics into finer-grained token-level rewards ^[10] to solve the reward credit assignment problem in complex tasks.

These directions collectively suggest: the future moat for domain models and agent systems will not merely be parameter scale, but the data AI production system formed around domain environment tools, task rewards, verifiers, trajectory databases, and closed-loop training.

However, when these complex designs genuinely move towards deployment, the bottleneck will inevitably shift down to the infrastructure layer: massive concurrent rollout requests, asynchronous scheduling of multi-tier verifiers, weight flows between training and inference engines, and the secure isolation of Agent interaction environments. These all heavily test the throughput, stability, and scalability of underlying systems. Exactly to address these engineering challenges, in the next article of this series, we will turn our attention to infrastructures purpose-built for Agentic RL, such as Verl and SkyRL.

References

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023. ↩︎
UniPat AI. "Echo: Towards General AI Prediction." 2026. ↩︎ ↩︎
Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, and Yuzhi Xu. "Evaluation-driven Scaling for Scientific Discovery." arXiv:2604.19341, 2026. ↩︎
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. "Mastering the game of Go with deep neural networks and tree search." Nature, 529(7587): 484-489, 2016. ↩︎
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. "Mastering the game of Go without human knowledge." Nature, 550(7676): 354-359, 2017. ↩︎
Siddharth M. Narayanan, James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi P. Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, and Andrew D. White. "Training a Scientific Reasoning Model for Chemistry." arXiv:2506.17238, 2025. ↩︎
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, and Daya Guo. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, 2024. ↩︎
DeepSeek-AI. "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence." ↩︎
Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. "CL-bench: A Benchmark for Context Learning." arXiv:2602.03587, 2026. ↩︎
Tianze Xu, Yanzhao Zheng, Pengrui Lu, Lyumanshan Ye, Yong Wu, Zhentao Zhang, Yuanqiang Yu, Chao Ma, Jihuai Zhu, Pengfei Liu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. "Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks." arXiv:2604.02795, 2026. ↩︎

Agentic RL (Part II): RL Systems for Real-World Tasks ​

1. Beyond Static Problem Solving: Confronting Open-Ended Real-World Tasks ​

2. EchoZ: Turning the Future into Training Data ​

2.1 Map-Reduce Prediction Agent ​

2.2 Using the Future to Reduce Data Leakage ​

2.3 Rubrics as the Distillation of Expert Process Knowledge ​

3. Simple-TES: Distilling Test-Time Search into Scientific Discovery Capabilities ​

3.1 C × L × K: How to Allocate Search Budgets ​

3.2 Context Selection is also Part of the Policy ​

3.3 From Candidate-Level Rewards to Trajectory-Level Learning ​

4. ether0: Multi-Tier Verifiers in Chemical Reasoning ​

4.1 From SFT to Generalist GRPO ​

4.2 Reward is Verifier Dispatch, Not a Single Scorer ​

4.3 Anti-Cheating Must be Written into the Reward ​

5. Environment and Task-Driven Data AI Engine ​

References ​