Agentic RL (Part II): RL Systems for Real-World Tasks

In Part I, we discussed Agentic Models, Action-Observation Loops, Rewards, and why Reinforcement Learning (RL) becomes the core optimization tool when models go beyond text generation to take actions, observe, and correct themselves in an environment.
However, once we step into real-world scenarios like financial forecasting, scientific discovery, and chemical reasoning, we face a more fundamental problem: it's not the lack of RL algorithms, but the lack of a sufficiently good environment. Real-world tasks rarely have immediate ground-truth answers. Feedback can be delayed by days, months, or even longer; validation costs range from simple rule checks to expensive experiments; and models might learn to exploit evaluator loopholes (reward hacking) instead of solving the actual problem.
Therefore, Agentic RL in the real world is primarily about environment engineering, and only secondarily about algorithm selection. Whether an open-ended task can be represented as an actionable, observable, verifiable environment that can accumulate trajectories often determines whether RL can truly solve high-value problems.
This article uses three systems as cross-sections: EchoZ, Simple-TES, and ether0. They target future event prediction, scientific and engineering search, and chemical reasoning, respectively. Yet, their commonality is striking: all are designed around environments, verifiers, trajectory libraries, and data flywheels.
1. Beyond Static Problem Solving: Confronting Open-Ended Real-World Tasks
In math, coding, and multiple-choice questions, verifiable rewards are relatively clean: does the final answer match, do the unit tests pass, or is the option correct? But real-world open-ended tasks resemble a different class of problems:
| Type | Examples | Reward Characteristics | Core Difficulties |
|---|---|---|---|
| Static Verifiable Tasks | Math, code, multiple-choice | Clear answers, clean rewards | Algorithms and sampling efficiency |
| Real-World Open Tasks | Financial prediction, scientific discovery, molecular design | Delayed, noisy, expensive, hackable | Environments, verifiers, data flywheels |
This is why many real-world Agentic RL systems don't start with "which RL algorithm should I use," but rather by answering four questions first:
- Environment: What actions can the model take? How does the environment return observations?
- Exploration: How does the model search? Parallel research, local iteration, or specialist domain optimization?
- Reward / Verifier: How is quality judged? Real-world outcomes, rubrics, evaluators, simulations, or tool combinations?
- Learning Loop: How do high-quality trajectories enter training? SFT, IRFT, GRPO, Distillation, or building a continuous learning system?
The following three case studies represent three different answers to these four questions.
2. EchoZ: Turning the Future into Training Data
Future event prediction doesn't look like traditional RLVR (Reinforcement Learning with Verifiable Rewards) because models don't have standard answers when making predictions. However, it possesses a unique advantage: events are ultimately verified (resolution) in the real world. In other words, questions lack labels upon generation, but ground truth is obtained after a waiting period.
EchoZ capitalizes on this by proposing Train-on-Future: instead of training predictive models on historical events that have already occurred, it continuously generates questions about the future. Agents make predictions under incomplete information in the present. Once the real world delivers the outcome, it retroactively evaluates the trajectory quality. EchoZ builds a predictive Agent based on a ReAct-style Thought-Action-Observation loop, saving the entire interaction process as a trajectory. [1][2]
A simplified workflow is:
Real-time trends -> Future question generation -> Multi-Agent research and prediction
-> Wait for event resolution -> Ground truth
-> Brier / Elo ranking -> Rubric process scoring -> High-quality trajectory filtering
-> SFT / RL / Distillation -> New models continue predicting the futureThe crux of this is not merely "using the future as labels." On a deeper level, Train-on-Future turns the real world into an asynchronous, continuous, naturally OOD (Out-Of-Distribution) environment sampler.
2.1 Map-Reduce Prediction Agent
EchoZ doesn't have a single Agent search the web and spit out a probability; instead, it breaks the prediction task into Map-Reduce style information gathering and evidence synthesis.
In the Map phase, the system decomposes macro-questions into multiple relatively orthogonal sub-tasks. Different Agents respectively retrieve official documents, news reports, databases, prediction markets, and social signals. In the Reduce phase, aggregation nodes handle source conflicts, distinguishing between first-hand evidence, second-hand reporting, market prices, and noisy signals to output structured probabilistic predictions.
The significance of this step is transforming "predicting an answer" into "generating an auditable evidence trajectory." The trajectory contains not just the final probability, but also why the model checked those sources, what it saw, how it handled conflicting evidence, and how it ultimately calibrated the probability.
2.2 Using the Future to Reduce Data Leakage
The trouble with the traditional Train-on-Past paradigm is that historical web pages, news, and outcomes have likely already entered the pre-training corpus. Even with strict time cutoffs, it is extremely difficult to restore the true state of the internet at a historical moment. In fact, attempting to prevent "peeking at the answers" by taking snapshots of historical datasets has proven to be incredibly challenging engineering-wise.
EchoZ's Train-on-Future means completely yielding to real physical time: the system only collects the Agent's complete prediction trajectories before the event occurs, and conducts post-training based on these historically locked trajectories after the event's resolution. In such a predictive system, the irreversibility of real time is itself the most critical environmental engineering.
2.3 Rubrics as the Distillation of Expert Process Knowledge
Future predictions cannot be judged solely by whether the final answer is right. A rigorous judgment might fail due to a black swan event, while a poor judgment might hit the mark through sheer luck. If trained directly with outcome rewards, the model would mistake real-world noise for reasoning signals.
Therefore, EchoZ uses multi-dimensional rubrics to evaluate the prediction process. The original rubrics can be roughly divided into four categories:
| Category | Focus Area |
|---|---|
| Sourcing | Whether first-hand sources are used; filtering old news spun as new, misleading search snippets, and metadata errors |
| Logic | Understanding resolution criteria, proper entity disambiguation, distinguishing verbal claims from actual execution |
| Timeline | Calculating remaining time windows, considering process lags, trigger events, and exit paths |
| Calibration | Using base rates of similar historical events, treating a lack of evidence as negative evidence, matching probability with evidence strength |
These rubrics essentially break down analysts' implicit methodologies into machine-executable process evaluation dimensions.
Even more interesting is that EchoZ does not entirely rely on human-written rubrics. It uses resolved events to calculate the true Elo rankings of models or trajectories, then has candidate rubrics score the same batch of trajectories. It compares the Spearman correlation between the rubric ranking and the true Elo ranking, ultimately retaining the rubrics that best predict true win rates. [2:1]
This step transforms rubrics from empirical rules into reward models iteratively calibrated by the real world. Ground truth is not just the final label; it can also be used to select more reliable standards for process evaluation.
3. Simple-TES: Distilling Test-Time Search into Scientific Discovery Capabilities
EchoZ utilizes the asynchronous feedback of "the future will provide the answer." Simple-TES, on the other hand, targets a different class of tasks: candidate solutions can be scored by an evaluator, but the optimal solution is hard to write out directly. Examples include GPU kernel optimization, quantum circuit compilation, algorithmic engineering, combinatorial constructions, circle packing, Hadamard matrices, scRNA-seq denoising, and scaling law discovery. [3]
The commonality across these tasks is that the solution space is highly non-convex, both local refinements and global jumps are crucial, and a massive amount of failed attempts intrinsically contain information. Therefore, the core of Simple-TES is not having the model get the right answer in one go, but conducting organized trial-and-error during test time.
3.1 C × L × K: How to Allocate Search Budgets
Simple-TES decomposes the total evaluation budget into three dimensions:
| Dimension | Meaning | Role |
|---|---|---|
| Number of parallel trajectories | Global exploration, avoiding early path lock-in | |
| Iterations per trajectory | Step-by-step improvement using feedback | |
| Candidates per iteration | Reducing single-step sampling noise |
In each iteration, every trajectory constructs a prompt based on historical feedback, generates
This illustrates that test-time scaling is not blind over-sampling, but budgeting across global exploration, local enumeration attempts, and long-term feedback accumulation. Mathematical construction tasks often rely more on
3.2 Context Selection is also Part of the Policy
The Context Builder of Simple-TES does not stuff all history into the prompt; instead, it decides which successes, failures, and intermediate explanations should enter the context.
RPUCG is similar to the PUCT (Predictor Upper Confidence Bound applied to Trees) algorithm used by DeepMind in the AlphaGo / AlphaZero series of works [4][5], balancing high-scoring nodes, nodes likely to yield high-scoring descendants, and under-explored nodes. Balance simultaneously retains best, elite, explore, and random candidates to explicitly maintain diversity. LLM-Elite uses an auxiliary LLM to maintain an elite pool of methodologically diverse solutions.
The principle behind this is highly important: in Agentic RL, memory/context selection itself is part of the policy. What failures, successes, and intermediate reasoning the model sees will directly dictate its next search direction.
3.3 From Candidate-Level Rewards to Trajectory-Level Learning
Simple-TES does not simply train the model on the immediate score of each candidate, as this would induce short-sighted policies. In scientific discovery, early low-scoring attempts could be the scaffolding for later breakthroughs, and a failure at a certain step might expose a critical error pattern.
Thus, it employs trajectory-level post-training: first heavily sampling trajectories, then sorting them by the historical highest score reached in each trajectory, retaining only the Top
This looks like SFT, but the data isn't static human demonstrations; rather, it consists of high-quality exploration paths filtered by the evaluator from massive rollouts. Simple-TES's analysis reveals a very telling statistic: out of millions of candidate trajectories in the cold-start phase, ultimately only about 0.48% of optimal trajectories are kept and refined into high-quality training data.
In other words, Simple-TES uses expensive test-time compute to distill a low-density exploration space into high-quality training data. Large-scale search precedes learning, and learning conversely elevates the efficiency of the next round of search and research.
4. ether0: Multi-Tier Verifiers in Chemical Reasoning
ether0 targets chemical reasoning and molecular design. It's not an ordinary chemical Q&A model, but a 24B reasoning model based on Mistral-Small-24B, trained on 640,730 experimentally grounded chemistry problems covering 375 tasks. The task scope ranges from synthesizability, blood-brain barrier penetration, and human receptor activity to odor. [6]
These types of tasks are perfectly suited to test the "hard to generate, easy to verify" concept discussed in Part I: designing a molecule that satisfies constraints is difficult, but checking whether a candidate molecule satisfies SMILES validity, molecular formulas, functional groups, reaction feasibility, or a specific property prediction is relatively easier.
4.1 From SFT to Generalist GRPO
The training pipeline of ether0 can be compressed into four steps:
- Long CoT SFT Cold Start: Using strong models to generate long reasoning chains, filtered by formatting, SMILES/SMIRKS validity, and LLM-as-Judge.
- Specialist GRPO: Training specialists by task family, allowing the model to explore within narrower distributions first. The relative advantage estimation of GRPO comes from the critic-free RL approach in the DeepSeekMath work [7].
- Specialist Distillation: Collecting correct trajectories during specialist training, filtering out low-quality reasoning, non-English outputs, and bad molecular structures, and distilling them back into a generalist model. This mirrors the trajectory distillation idea of the teacher model in DeepSeek-V4 [8].
- Generalist GRPO: Joint training across all tasks, incorporating curriculums, molecule quality bonuses, and safety alignment.
The significance of specialists and curriculums is keeping the rollouts near the boundaries of the model's capabilities. If a batch of completions is entirely correct or entirely wrong, the GRPO advantage approaches zero, and the training signal is wasted.
4.2 Reward is Verifier Dispatch, Not a Single Scorer
The reward in ether0 is not a single model assigning a total score, but an ensemble of domain tools, rules, databases, and surrogate models:
| Task | Verifier Example |
|---|---|
| IUPAC / MCQ | String or normalized matching |
| SMILES completion | RDKit validity |
| Molecular formula | Hill notation / Molecular formula constraints |
| Functional group | Molecular formula + functional group dual constraints |
| Solubility edit | Property predictors like KDESol |
| Retrosynthesis | Purchasability Bloom filter + Molecular Transformer |
| Reaction prediction | Exact or soft matching of products |
This is the quintessential form of scientific and engineering RL: reward is not an abstract scalar, but a suite of verifier dispatch operations. Low-cost rules first filter out invalid samples, and high-fidelity models and databases then calibrate high-value areas. For real scientific tasks, a single oracle is often too expensive, and a single rule is too weak; fusing multi-tier verifiers is the only scalable way to solve complex real-world problems.
4.3 Anti-Cheating Must be Written into the Reward
Chemical tasks are highly susceptible to reward hacking. For instance, in retrosynthesis, the target product might be directly placed into the reactants, generating syntactically valid but physically unreasonable molecules, or exploiting loopholes in surrogate models to optimize a single metric at the expense of synthesizability.
ether0's defense lines include reasonable molecule checks, bad substructure / negative SMARTS patterns, combinations of format rewards and accuracy rewards, molecule quality bonuses, and safety alignment. This demonstrates that real-world RL rewards must contain at least three types of signals: target rewards, constraint rewards, and anti-cheating penalties.
Simply defining "what you want" is not enough; you must also clearly define "which shortcuts don't count."
5. Environment and Task-Driven Data AI Engine
| Dimension | EchoZ | Simple-TES | ether0 |
|---|---|---|---|
| Task Type | Future event prediction | Scientific/Engineering search | Chemical reasoning and molecular design |
| Exploration Method | Map-Reduce multi-Agent research | Specialist GRPO + distillation | |
| Reward | Ground truth, Brier/Elo, rubrics | Evaluator score, trajectory best score | Format reward, domain verifiers, quality bonus |
| Data Flywheel | Train-on-Future | Heavy filtering post massive rollouts | Distilling correct specialist trajectories into generalist |
| Core Risks | Time leakage, real-world noise, probability distortion | Short-sighted rewards, evaluator overfitting, compute cost | Reward hacking, unreasonable molecules, trivial groups |
Fusing the case studies above reveals several common principles for applying Agentic RL in real-world systems.
First, construct the environment before discussing the algorithm. Without an actionable, observable, and verifiable environment, even the best RL algorithm is just performing gradient descent on noise.
Second, the reward must simultaneously satisfy credibility, density, low cost, and anti-cheating measures. EchoZ solves the delayed ground-truth and process rubric calibration issue; Simple-TES addresses evaluator-driven massive trial-and-error; ether0 handles the hierarchical dispatch of domain verifiers and anti-cheating constraints.
Third, trajectories are more important than answers. The training object of Agentic RL is not the answer, but the trajectory. The answer is just the last line of the trajectory; what is truly transferable is how the model searches, verifies, corrects, and terminates.
Fourth, high-quality data comes from meticulous selection following massive exploration. EchoZ filters high-rubric/high-Elo trajectories, Simple-TES retains only a tiny fraction of elite trajectories from millions of candidates, and ether0 filters correct trajectories, low-quality reasoning, and bad molecular structures.
Fifth, Rubrics or Multi-Tier Verifiers will become crucial middle layers for open-task RL. Confronting tasks that are difficult to evaluate directly, EchoZ distilled experts' prediction processes into multi-dimensional rubrics, while ether0 built a multi-tier verifier stack ranging from lightweight rules to high-fidelity models. Both provide exploration directions and process constraints to the model through explicit middle layers.
Although works like CL-Bench have shown that current models perform poorly when faced with a large number of entirely new rubric constraints [9], this has simultaneously spurred technologies like Rubrics-to-Tokens to attempt transforming macro rubrics into finer-grained token-level rewards [10] to solve the reward credit assignment problem in complex tasks.
These directions collectively suggest: the future moat for domain models and agent systems will not merely be parameter scale, but the data AI production system formed around domain environment tools, task rewards, verifiers, trajectory databases, and closed-loop training.
However, when these complex designs genuinely move towards deployment, the bottleneck will inevitably shift down to the infrastructure layer: massive concurrent rollout requests, asynchronous scheduling of multi-tier verifiers, weight flows between training and inference engines, and the secure isolation of Agent interaction environments. These all heavily test the throughput, stability, and scalability of underlying systems. Exactly to address these engineering challenges, in the next article of this series, we will turn our attention to infrastructures purpose-built for Agentic RL, such as Verl and SkyRL.
References
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023. ↩︎
UniPat AI. "Echo: Towards General AI Prediction." 2026. ↩︎ ↩︎
Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, and Yuzhi Xu. "Evaluation-driven Scaling for Scientific Discovery." arXiv:2604.19341, 2026. ↩︎
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. "Mastering the game of Go with deep neural networks and tree search." Nature, 529(7587): 484-489, 2016. ↩︎
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. "Mastering the game of Go without human knowledge." Nature, 550(7676): 354-359, 2017. ↩︎
Siddharth M. Narayanan, James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi P. Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, and Andrew D. White. "Training a Scientific Reasoning Model for Chemistry." arXiv:2506.17238, 2025. ↩︎
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, and Daya Guo. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, 2024. ↩︎
DeepSeek-AI. "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence." ↩︎
Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. "CL-bench: A Benchmark for Context Learning." arXiv:2602.03587, 2026. ↩︎
Tianze Xu, Yanzhao Zheng, Pengrui Lu, Lyumanshan Ye, Yong Wu, Zhentao Zhang, Yuanqiang Yu, Chao Ma, Jihuai Zhu, Pengfei Liu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. "Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks." arXiv:2604.02795, 2026. ↩︎