Agentic RL: A New Paradigm for Self-Evolving Large Models (Part I)
The development of large models is undergoing a profound structural shift. From the initial question-answering chat assistants, they are gradually evolving into Agentic models capable of actively retrieving information, invoking tools, and autonomously completing multi-step tasks in complex environments. In this evolution, Agentic RL (Reinforcement Learning for Agents) is emerging as the core engine for underlying model optimization. It transforms a large model from a generator that can only output text into a dynamic learning system capable of taking actions, observing, correcting errors, accumulating trajectories, and continuously self-improving in an environment.
1. What are Agentic Models
To understand Agentic RL, we first need to understand what Agentic Models are. Their fundamental difference from ordinary chat models lies in that they are not just generating text, but executing actions in an environment, observing feedback, and completing tasks through multiple iterations. Agentic Models can call a browser to search, write code and execute it in a sandbox, query SQL databases, operate GUI interfaces, or even submit experimental parameters to real scientific simulators, and then continue to the next action based on the feedback from the environment.
We can describe this process using several core concepts:
- Problem: The ultimate goal the model needs to solve, such as fixing a bug, predicting a stock, or optimizing a molecular structure.
- Action: The specific operation executed by the model, such as calling an API or running code.
- Observation: The true feedback from the environment on the action, such as error logs, web page content, or simulation scores.
- Turn: A complete Action and Observation loop.
- Trajectory: The complete multi-step exploration path from the start of the task to ultimate success (or failure).
- Reward: The evaluation of this trajectory or a specific step within it by the environment, validator, or rule system.
Traditional LLMs are more like static knowledge compressors: the user provides a prompt, and the model generates a single-shot answer. Agentic Models, however, are policy models with external knowledge memory and action capabilities: they can break down complex tasks, explore via trial and error externally, incorporate feedback into context, and dynamically adjust their policies.
Core Differences Between Ordinary LLMs and Agentic Models
| Dimension | Ordinary LLM | Agentic Model |
|---|---|---|
| Interaction Mode | Single-turn or short conversations | Multi-step Action-Observation Loop [1] |
| Primary Capabilities | Generation, summarization, Q&A | Planning, tool calling, error correction, environment exploration |
| Feedback Source | User subjective scoring or preference (RLHF) | Tool results, code errors, simulator scores, real-world validation |
| Training Challenges | Quality of answers and alignment with human values | Long-horizon Credit Assignment and acquiring dense environment signals |
When a model starts executing actions, the core problem of training shifts from evaluating the quality of generated text to measuring whether the system can ultimately complete the task, and whether each action brings the system closer to the goal. This is exactly why Reinforcement Learning (RL) must step in.
2. Why Do Agentic Models Need Reinforcement Learning?
Historically, we have been accustomed to using Supervised Fine-Tuning (SFT) to tune models after pre-training. However, applying SFT to Agents quickly hits a ceiling.
The essence of SFT is imitation. It relies on high-quality human expert demonstration trajectories. Yet, the task path space that Agents face in the real world is incredibly vast:
- Tool calling is highly dependent on formats and states; a small number of successful demonstrations simply cannot cover the massive volume of edge cases and error scenarios.
- For complex software engineering, mathematical derivation, or scientific discovery (like predicting protein or crystal structures), true expert trajectories are scarce and expensive.
- Once a model is deployed, environmental rules might change, and static SFT datasets can quickly become obsolete.
The entry point for Reinforcement Learning (RL) is precisely here: it does not require a given static optimal trajectory, but only demands that the environment provides an evaluation of the results. For scenarios with objective verification mechanisms, such as code generation, SQL, and theorem proving, the model can entirely achieve self-evolution through massive trial-and-error exploration in the environment.
But implementing RL in the Agent scenario faces three extremely tricky challenges:
- Sparse Reward: In many tasks, the system only knows whether it succeeded or failed at the very last 50th step. The first 49 steps are entirely exploratory, lacking explicit right-or-wrong signals.
- Credit Assignment: If the ultimate task fails, was it the data lookup in step 3 that was wrong, or the code written in step 40 that had issues?
- Environment Cost: Feedback from the real environment can be very slow and costly (e.g., executing a Density Functional Theory (DFT) calculation or running a real biological wet-lab experiment).
The algorithmic evolution of Agentic RL is essentially the continuous attempt to solve these three major difficulties: transforming sparse, expensive, and noisy environmental feedback into dense, stable, and scalable training signals.
3. Reward: The Iterative Core of Agentic RL
Since the goal is to better acquire environmental feedback, the design of the Reward is the central element of the entire system.
Common Rewards are divided into several types: Human Feedback (accurate but expensive) and LLM/AI Judges (relatively low cost but prone to hallucination). In the Agent domain, the more reliable, easily scalable, and currently fastest-iterating algorithm is the Verifiable Reward.
3.1 Verifiable Reward
In many tasks, as long as the task result can be objectively verified, the model can break the bottleneck of human annotation and kickstart the data flywheel:
- Mathematics: Whether the final answer matches the Ground Truth (e.g., DeepSeekMath [2]'s practice).
- Code: Whether it compiles, passes unit tests, and completes the designated task with high performance.
- Games: Whether a level is cleared or the score increases.
Verifiable Reward is the most direct Outcome Reward. But this is not enough: if a task has dozens of steps, and feedback of +1 or -1 is only given at the very end, the learning signal for the model is extremely sparse, and it is impossible to pinpoint specific issues. The learning efficiency would be incredibly low.
3.2 From Outcome to Process Reward
To solve the sparse signal problem of long-horizon tasks, the PRM (Process Reward Model) emerged (refer to OpenAI's classic paper Let's Verify Step by Step [3]).
PRM no longer just looks at the final outcome; instead, it delves into the trajectory itself. It takes the Action and Observation of each step as context and evaluates whether this step substantially advanced the task. Intuitively, the final reward can be represented as:
In tool-calling scenarios, command-line return results and compiler errors naturally constitute highly valuable Next-State Signals. The advantage of PRM is not in how smart it is, but in that it breaks down the black-box long trajectory into locally learnable, dense feedback.
You might ask, if the model itself can't do it well, why expect a Judge/PRM to score it accurately? The answer lies in computational complexity theory: verifying is always easier than generating. Just like verifying whether a Sudoku solution is correct (a P problem) is far simpler than filling out a Sudoku from scratch (an NP problem). The Policy Model executing the task must rack its brains to plan paths and call tools, whereas the Judge only needs to look at the current context and tool feedback to provide an objective score.
3.3 Multi-Tier Verifier in Complex Scenarios
In complex industrial, scientific, and financial scenarios, verification costs inherently have massive disparities: an RDKit molecular validity check takes milliseconds, an AlphaFold folding or time-series signal analysis takes minutes, while a Wet Lab experiment or waiting for real market feedback might take days.
To achieve an optimal balance between "verification fidelity" and "data production cost," this article proposes introducing the Multi-Tier Verification ladder:
| Verification Tier | Example Tools | Cost | Core Function |
|---|---|---|---|
| Fast-tier | Knowledge graph alignment, data Schema validation, RDKit molecular syntax | Extremely Low | Millisecond/second-level filtering of low-level errors and factual hallucinations |
| Simulation-tier | Machine Learning Force Field (MLFF), AlphaFold [4] structure prediction, time-series signal prediction | Medium | Provides high-fidelity approximate physical surrogate feedback |
| High-fidelity tier | Density Functional Theory (DFT) calculations, Molecular Dynamics (MD) simulations | Relatively High | Provides rigorous, strongly constrained high-quality physics computation verification |
| World-tier | Wet Lab, expert human feedback, real trading markets | High | Provides the ultimate real-world ground truth adjudication |
During system exploration, initial hypotheses generated by the model first go through the low-cost Fast-tier; only when verification passes, or when encountering high-value and highly uncertain candidate solutions, will the system schedule them to the expensive Simulation-tier or World-tier. This turns the "verifier strength" and "data production cost" into an orchestratable resource scheduling problem.
4. The Evolutionary History of RL Algorithms: From Sparse Scalars to Dense Vector Signals
With the continuous refinement of Reward signals, underlying RL algorithms are also constantly evolving. Algorithms like PPO, GRPO, PRM, and OPD are all products of continuously seeking better solutions among engineering complexity, computational resources, and signal quality.
4.1 PPO and GRPO: The Classic and Innovation of Actor-Critic
PPO (Proximal Policy Optimization) has been the most mainstream algorithm in the RLHF era of large models (refer to OpenAI's classic literature PPO [5]). The core characteristic of PPO is that it requires only a single trial (Roll-out), and the Critic (value model) provides an objective baseline estimate for each step (Action or Token) in the trajectory. The system guides the direction by calculating the Advantage of the actual Reward obtained by the action relative to the Critic's estimate. The model training process is about continuously increasing the probability of these advantageous actions occurring. This mechanism provides very detailed Token-level feedback and guarantees extremely high stability under KL divergence constraints. However, it is also highly costly: it requires maintaining an extra Critic model of the same scale as the main model, which is a heavy burden on training VRAM and cluster communication.
To reduce the system overhead of training LLMs, teams like DeepSeek developed the GRPO (Group Relative Policy Optimization) [2:1] algorithm. The breakthrough innovation of GRPO lies in generating multiple trajectories (e.g., 16) in a batch for the same prompt, and then computing local relative advantages by comparing the Rewards of this group of trajectories. It no longer asks for the absolute score of every single step, but rather evaluates which path is relatively better among the current batch of attempts, thereby circumventing the need for an expensive Critic model. In math and coding scenarios with strong Verifiers, GRPO has demonstrated extremely high cost-effectiveness.
4.2 OPD: The Organic Fusion of RL and SFT
Whether PPO or GRPO, what they essentially provide is a Scalar Reward: telling the model if the current action is good (+1) or bad (-1). But when the model makes a mistake, it often doesn't know exactly which Token was written wrong.
OPD (On-Policy Distillation) further improves the quality of training signals. In OPD, the Student model first generates exploration trajectories based on its own policy (On-Policy). Subsequently, the system introduces a powerful Teacher model to calculate Token-level target probability distributions on the same trajectory to guide learning.
Taking the recently released DeepSeek-V4 [6] as an example, in order to integrate the capabilities of multiple experts and solve the problem of excessive variance in traditional scalar valuation, it adopted Multi-Teacher Full-Vocabulary OPD: The system first trains multiple domain-expert models (Teachers). Then, on the trajectories generated by the Student's trial-and-error, it directly calculates and merges the complete Logit distributions over the Full-Vocabulary from these experts, using this as the optimization target for Reverse KL loss. Compared to traditional single-Token advantage estimation, this dense probability distribution over the full vocabulary effectively reduces gradient variance and ensures extremely high training stability.
The prominent advantage of OPD is that it simultaneously possesses the On-Policy characteristic of RL (staying close to the current model's actual weaknesses and environment distribution) and enjoys the dense Token-level directional supervision akin to SFT. The model is no longer rolling the dice in blind trial-and-error, but instead has a strict teacher pointing out every specific mistake step-by-step and providing the correction direction for the right Tokens.
4.3 Hindsight Supervision: Learning from Hindsight
In Agent interactions, we frequently encounter this situation: the model finishes running a piece of code, and the terminal pops up a long traceback error (e.g., TypeError: expected string or bytes-like object). Traditional RL would directly issue a -1 penalty. But Hindsight Supervision views this as a huge waste (refer to recent research on OpenClaw-RL [7] and Agent-environment interaction, as well as self-reflection mechanisms like Reflexion [8]).
If we explain this through the lens of the Theory of Slow Thinking [9], the essence of this post-hoc learning is approximating an optimal Posterior Sampler. When facing complex open-ended problems, if a model blindly explores based solely on the problem itself, it is hard to luckily stumble upon the correct reasoning path; but once it knows the post-hoc result (like the final answer or a specific environment error), the system can adopt a global perspective and reversely construct a Bridge Thought that effectively connects the problem and the answer.
In engineering practice, we feed error messages and other post-hoc feedback as Hints to the Teacher model. Aided by this posterior information, the Teacher doesn't even need to possess overwhelmingly strong capabilities to generate truly effective corrective code or reasoning paths, and can use these to supervise the Student model which can only see the a priori perspective.
This method of directly transforming Directive Signals from environmental feedback into training data allows models to quickly learn to avoid errors without prompting.
Summary
Looking back on the evolutionary journey of Agentic RL: Rewards have moved from looking only at the final Outcome to step-by-step Process Rewards; algorithms have shifted from the scalar feedback of PPO to the Token-level supervision of OPD; and failed trajectories are no longer discarded, but rather recycled into training data via Hindsight mechanisms. These improvements point in the same direction—transforming sparse, expensive environment signals into dense gradients that models can learn from as efficiently as possible.
In the upcoming content, we will shift from algorithms to engineering and implementation: How does this paradigm actually operate in scenarios like scientific discovery and financial prediction? And how can we leverage frameworks like Verl and SkyRL to build sustainable and iterative Agent training systems?
References
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR (2023). ↩︎
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv preprint arXiv:2402.03300 (2024). ↩︎ ↩︎
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. "Let's Verify Step by Step." arXiv preprint arXiv:2305.20050 (2023). ↩︎
Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3." Nature (2024). ↩︎
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347 (2017). ↩︎
DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. ↩︎
Wang, Y., Chen, X., Jin, X., Wang, M., & Yang, L. "OpenClaw-RL: Train Any Agent Simply by Talking." arXiv preprint arXiv:2603.10165 (2026). ↩︎
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS (2023). ↩︎
Yang, H., Xu, Z.-Q. J., Xiong, F., & E, W. "A First-Principles Theory of Slow Thinking and Active Perception." ResearchGate (2024). ↩︎