Agentic RL (Part III):
Architecture Analysis of Verl and SkyRL to Retool-RL Case Practice

In the previous two articles, we discussed the basic concepts and algorithmic evolution of Agentic RL, as well as how real-world tasks can be transformed into trainable environments. However, the truly challenging part is to make the entire closed-loop system operate stably and efficiently.
Take a typical AIME math tool-calling Agent as an example. The user provides a math problem, and the model first generates a problem-solving plan; in the process, it might call a Python Code Tool to calculate results or simplify expressions; Python returns the execution result, an exception, or a timeout, and the model then revises its next plan based on the observation; after a few rounds, it submits the final answer. What SFT usually sees is a static sample: problem, reasoning, and answer. In contrast, what Agentic RL truly hopes to train is the policy within an interactive trajectory: when to call a tool, which tool to call, how to recover after code failure, what kind of intermediate error the environment feedback exposed, and whether the model can translate this feedback into a better next action.
Therefore, the training object of Agentic RL is not a single response, but an interactive trajectory with state transitions. The trajectory is not pre-existing data, but experience sampled in real-time by the current policy in the environment. The reward is not a static label, but may come from the final answer, process reward models, tool observations, and KL penalties. Trajectory generation relies on high-throughput inference engines like vLLM[1] and SGLang[2]; training relies on distributed training backends like FSDP[3], Megatron[4], or DeepSpeed[5]; after each policy update, the new weights must be synchronized back to the inference side to execute the next round of rollouts. If SFT is like learning on a fixed corpus, Agentic RL is more like simultaneously sampling, scoring, updating, and synchronizing in a constantly changing experimental field. This article will provide a detailed analysis of this entire system and process.
1. Analysis of Agentic LLM RL Loop
A minimal LLM RL loop can be written as:
Policy model
-> rollout / tool interaction
-> outcome reward / process feedback
-> logprobs / value / KL
-> advantage / return
-> policy update
-> sync updated weights back to inference engine
-> next rolloutRollout means letting the current model enter the environment to generate its own experience. For standard text RLHF, it might simply be generating a response from a prompt; for Agentic RL, it is often multi-turn tool interaction, including actions, observations, environment states, termination judgments, etc.
The forward pass involves recalculating the probabilities and KL penalties of these tokens or actions under the current policy. The inference engine during generation aims for throughput, while training requires precise per-token logprobs. Much of the system complexity arises because the generation phase and the training phase do not use the same computational topology.
Reward shaping synthesizes signals such as outcome, PRM, KL penalty, and format reward into training signals. The outcome reward in math and coding tasks is relatively clean but too sparse; the process reward is denser, but the judge might misjudge; the KL penalty can prevent policy drift, but it might also suppress exploration.
Advantage answers the question: how much better is this trajectory compared to similar attempts? PPO[6] usually relies on a critic and GAE for calculation; GRPO samples multiple completions for the same prompt and normalizes them within the group using the reward mean and standard deviation, yielding a critic-free relative advantage.
Update increases the probability of good actions and suppresses the probability of bad ones. Weight sync hands the new policy back to the inference engine, starting the next round of rollouts with the latest model.
This is also why Agentic LLM RL is much more complex than SFT—SFT's data is basically fixed before training, and training only requires a forward pass, loss calculation, backward pass, and optimizer updates. In contrast, Agentic RL's samples come from the current policy; generation relies on vLLM / SGLang, and training relies on FSDP / Megatron; each batch might need to execute Python, browsers, databases, code repositories, or remote sandboxes; rewards might come from multiple services; the same training round might also involve many modules like actor, critic, reference, reward model, PRM service, tool sandbox, and inference engine. In summary, behind the implementation of Agentic LLM RL is a continuously operating collaborative system of models, data, and environments.
2. Verl and SkyRL: Two System Abstractions
Verl and SkyRL both solve the problem of the training closed-loop in LLM RL, but their main focuses on abstraction differ. Verl is more like a high-performance LLM RL training execution kernel, with its core being Controller, WorkerGroup, ResourcePool, and DataProto; SkyRL is more like a full-stack platform aimed at multi-turn Agentic environments, centered around Trainer, Generator, Inference Engine, Environment, and Controller.
Both approach the same difficult problem from different angles. Verl starts from distributed computing orchestration, considering how large model computation nodes like actor, critic, reference, and reward model can be orchestrated in a high-throughput, low-redundancy, and scalable manner. SkyRL starts from task interaction and modular interfaces, designing components like generator, environment, inference engine, and trainer to be independently replaceable and flexibly serve multi-turn agentic tasks.
2.1 Verl: Decoupling Control Flow and Computation Flow
The HybridFlow[7] paper behind Verl abstracts RLHF into complex data flows. Nodes in traditional RL might just be small neural network computations; in HybridFlow, each node becomes a distributed LLM program, and many-to-many data resharding is also required between nodes. The key design of HybridFlow is a hybrid single-controller and multi-controller paradigm: high-level RL algorithms are organized by a single-process controller, while low-level model computations are executed by distributed workers.
This design brings about two system levels:
The first level is control flow. The high-level sequence of PPO or GRPO can still be written like a regular Python program: rollout, reward, advantage, update. Researchers focus on algorithm logic and do not need to manually write cross-GPU communication at each step.
The second level is computation flow. Computations such as actor generation, actor training, critic forward, reference forward, and reward model forward are executed by Ray WorkerGroup, FSDP, Megatron, vLLM, or SGLang. DataProto is the core data carrier for passing, slicing, and merging across workers; ResourcePool determines whether roles like actor, rollout, critic, and reward are colocated or disaggregated.
From the user's perspective, driver programming executes sequentially; from the system's perspective, behind every step lies remote dispatch, collect, reshard, and parallel execution. The value of Verl lies in integrating these two perspectives into one system: the algorithm code remains intuitive, while the distributed execution remains efficient.
2.2 Key Optimizations of 3D-HybridEngine
The 3D-HybridEngine of HybridFlow / Verl solves one of the most difficult resource scheduling problems in LLM RL: the same actor model needs to be both trained and generated. The training state requires parameters, gradients, optimizer states, and activations; the generation state requires inference weights and KV cache. If the training engine and inference engine each hold a GPU model copy long-term, it will cause a massive redundant waste of computational resources.
The first level of optimization in 3D-HybridEngine is VRAM time-sharing multiplexing. It tries to release or compress inference-side occupancy during the training phase, and releases training-side temporary states during the generation phase, allowing the same set of GPUs to carry different roles at different stages.
The second level of optimization is efficient weight resharding. The parallel layout of FSDP / Megatron used for training differs from the parallel layout of vLLM / SGLang used for generation; the updated weights must be reassembled from training shards to generation shards. HybridFlow[7:1] specifically designed model parameter resharding between the training and generation phases, aiming to reduce model saving/loading and cross-device communication IO. The paper reports a throughput improvement of 1.53x to 20.57x over the baseline across different RL algorithms, model scales, and cluster setups.
2.3 SkyRL: Generator and Environment Become First-Class Citizens
The entry point of SkyRL[8] is closer to Agentic tasks. It splits the RL stack into modules such as Trainer, Generator, Inference Engine, Environment, and Controller. The Trainer executes optimization steps; the Generator generates a complete trajectory and calculates the reward; the Inference Engine is responsible for model inference rollouts; the Environment executes actions, returning observations and rewards; the Controller manages component placement, initialization, and execution plans.
The SkyRL-v0.1 article[8:1] emphasizes that many RL frameworks couple core components too tightly, causing any modification to the environment, algorithm, or execution plan to affect the whole system. SkyRL attempts to decouple these modules with clear APIs: one can independently replace the environment, inference engine, or training backend; one can colocate or disaggregate the trainer/generator; and one can use synchronous RL, or perform async rollout or pipelining, achieving several times the performance improvement.
2.4 Comparison of the Two Abstractions
| Dimension | Verl | SkyRL |
|---|---|---|
| Primary Abstraction | Controller / WorkerGroup / ResourcePool / DataProto | Trainer / Generator / Environment / Backend |
| Core Focus | Distributed RLHF Computation Orchestration | Multi-turn Agent Tasks and Training System Integration |
| Environment Integration | Extensible, but task semantics are not the top-level entry point | Natively revolves around environment / gym / agent |
| Suitable Scenarios | Low-level topology, extreme throughput, complex model placement | tool-use, multi-turn environments, rapid experimental closed-loops |
This article chooses SkyRL to implement the Retool-RL math experiment because its overall design is more suitable for rapid algorithm iteration and provides sufficiently good performance and system flexibility for small-to-medium-scale experiments.
3. Retool-RL Experiment
The Retool-RL experiment in this article is based on the tool-call RL configuration of OpenClaw-RL[9], combined with SkyRL for implementation and optimization. The core algorithmic innovation is to fuse On-Policy Distillation[10] with the Bridge Thought from the Theory of Slow Learning[11], and adapt it to the Agentic RL scenario.
Traditional Bridge Thought is mostly used for pre-training or SFT, while conventional On-Policy Distillation (OPD) often relies on stronger external teachers to provide token-level supervision. However, in Agentic RL, the actual feedback from tool calls (e.g., code errors, execution outputs, human interactions) naturally contains a posteriori information. By utilizing this feedback, we can automatically generate a brief bridge thought via a PRM after the interaction ends, pointing out the model's deviations and suggestions for correction.
By appending this bridge thought to the reference model's prompt, we allow it to recalculate logprobs under the condition of a "known correction direction." At this point, the reference model is no longer just a standard KL anchor, but transforms into an "a posteriori teacher" providing dense supervision signals.
This design cleverly bridges the reward gap: the outcome reward is responsible for judging whether the "result is right or wrong," while the a posteriori teacher guides "how to correct the process" via OPD. The entire flow is closed-loop and self-consistent, completely eliminating the reliance on strong external models.
The starting point of the experiment is qwen3-4b-retool-sft. Using Qwen3-4B-Instruct as its foundation, this model underwent 3 epochs of SFT on the JoeYing/ReTool-SFT dataset open-sourced by the ReTool project, and has already acquired basic capabilities for instruction following, mathematical reasoning, and Python tool-calling formats. The target task for optimization includes math problems like AIME 2024, and the Agentic interaction format is a multi-turn dialogue with access to a Python Code Tool. Thanks to the GPU VRAM time-sharing multiplexing optimization mentioned above, I completed two sets of experiments on a single RTX 6000 PRO (96G VRAM). The single Baseline RL experiment took about 5 hours, while the Hybrid RL experiment—due to PRM, Bridge Thought generation, and OPD—took approximately 10 hours in total per run.
As summarized in the table below, the core difference among the three sets of experiments lies in the signal sources: SFT signals are dense but come from static data; RL signals come from real attempts of the current policy, but rewards are sparse; Hybrid aims for both—on-policy sampling combined with denser local direction.
| Method | Sampling | Reward / Supervision | Purpose |
|---|---|---|---|
| SFT Baseline | off-policy | dense imitation | Establish tool-calling formats and initial capabilities |
| Baseline RL | on-policy | sparse outcome | Enable the model to learn from its own problem-solving attempts |
| Hybrid RL | on-policy | outcome + process reward + token-level teacher signal | Simultaneously obtain result ranking and local direction |
3.1 Experimental Results
Both RL experiments start from the same qwen3-4b-retool-sft model, comparing pure outcome GRPO with a Hybrid RL version using PRM + Hindsight / OPD on AIME 2024. The evaluation pipelines at step 0 for both had slight differences, so we take the average as a unified SFT baseline reference point. Looking at the core metric pass@8:
| Metric | SFT baseline | Hybrid RL Best | RL Best | Hybrid RL Improvement | RL Improvement |
|---|---|---|---|---|---|
| pass@8 | 0.583 | 0.767 (step 6) | 0.667 (step 4) | +31.4% Relative | +14.3% Relative |
The peak pass@8 for Hybrid reached 76.7%, a 31.4% improvement compared to the SFT baseline's 58.3%, and about a 15.0% relative improvement over the best pure Baseline RL result of 66.7%. It is worth noting that these improvements were achieved in a single-GPU environment after just a few hours of training, which fully demonstrates the efficiency and potential of RL algorithms in solving practical problems.
Looking at the learning curves for both experiments, Hybrid broke past 70.0% starting at step 4 and reached its peak of 76.7% at step 6. The peak for Baseline RL stopped at 66.7% at step 4, mostly lingering between 60.0%-63.3% thereafter. Hybrid maintained a distinct advantage from steps 4 to 12, but degradation appeared later—the optimal checkpoint should be global_step_6; to continue training, early stopping, learning rate decay, or stronger stabilization mechanisms would be needed.
| Step | Hybrid RL | Baseline RL | Δ |
|---|---|---|---|
| 0 | 0.567 | 0.600 | -0.033 |
| 2 | 0.600 | 0.600 | 0.000 |
| 4 | 0.700 | 0.667 | +0.033 |
| 6 | 0.767 | 0.633 | +0.134 |
| 8 | 0.700 | 0.633 | +0.067 |
3.2 Bridge Thought Case Analysis
In Hybrid RL training, bridge thought is not about writing another long, elegant CoT for the model, but rather providing a sufficiently short, local, and actionable inner reminder after the model has reached a crossroads. It's like compressing a whole block of a posteriori information into a single next action: what went wrong, what should be fixed first, and what should be re-verified after fixing. About 35%-40% of the trajectories in the training logs trigger this kind of hindsight intervention; the remaining 60%-65% are deemed not to require intervention, directly returning \boxed{-1}.
The most intuitive kind of example comes from code debugging. A typical bridge thought from the training log is written like this:
"I should define the variable 'x' as a symbol using
symbols('x')before using it in the summation."
The value of this sentence lies not in demonstrating complex mathematical reasoning, but in reducing a tool failure down to a minimally reparable action: first put x into the sympy symbolic system, then handle the summation. A standard outcome reward can only tell the model at the very end that "this trajectory was wrong," but bridge thought directly translates the tool call feedback's traceback into how the next line of code should be written. For agentic RL, this kind of signal is closer to truly learnable experience.
The second type of example reflects tool usage strategy rather than simple error correction. One hindsight log entry reads very plainly:
"Instead of calculating the entire product and then taking modulo, I'll apply modulo 1000 after each multiplication."
What this sentence teaches is not the specific result of a certain problem, but a reliable calculation habit: when the goal is just the last three digits, do not let intermediate products inflate meaninglessly.
The third type of example is geometric modeling. In the original trajectory, the model used
"First, set Q at (0, 0) and B at (1, 0). Since QR is at a 60° angle, calculate R's coordinates as (2·cos(60°), 2·sin(60°)) = (1, √3)."
It did not say to re-solve the problem entirely; instead, it pinned the fix to the earliest testable anchor point, establishing a stable and reliable coordinate system.
Combining these cases, bridge thought is just like the exact reminder a human needs when solving a problem. Outcome reward tells the model whether this path eventually reached the finish line; GRPO tells it which of multiple trajectories for the same problem is better; PRM helps it identify whether intermediate steps are making progress; and bridge thought translates a specific failure into an actionable corrective move for the next step. Together, the training signals transform from simple 0/1 binary states into experiences that the model can digest highly efficiently.
Of course, Hindsight also has its limits: problems like Python syntax errors, clear tracebacks, gaps in coordinate modeling, and modulo strategies are suitable for generating short, accurate local hints. However, if the error stems from a deeper misunderstanding and the tool feedback is unclear, forcing the generation of a hint might instead pollute the token-level supervision. In such cases, we need to rely more heavily on other rewards like outcome/PRM, or let experts/systems provide more fundamental feedback signals.
4. Key Implementation Points of Hybrid RL Based on SkyRL
When implementing Hybrid RL based on the SkyRL framework, the most noteworthy aspect is how it modularizes the realization of multi-turn tool interactions, process rewards, a posteriori teacher signals, and policy updates.
4.1 Training Main Loop of SkyRL
Looking at RayPPOTrainer.train(), SkyRL's main loop can be compressed into the following pipeline:
Prompt batch
-> GeneratorOutput
-> postprocess reward / metrics / masks
-> TrainingInputBatch
-> policy / reference / critic forward
-> KL / advantage / return
-> actor update
-> weight syncEach step in this pipeline has clear data objects. For example, GeneratorOutput is the product on the sampling side, saving multiple trajectories, as well as environment-side statistics (like tool execution errors and termination reasons) and sampling metadata. convert_to_training_input() then packages fields like prompt, response, reward, mask, and rollout logprobs into a TrainingInputBatch, which is padded and sliced according to data parallel size, policy mini-batch, and critic mini-batch.
The significance of this design is that the trainer sees structured data batches instead of an environment's private log format; the environment also doesn't need to know whether PPO, GRPO, or a Hybrid RL algorithm is running downstream. For Agentic RL, this is crucial, because the rollout phase might include multi-tool execution, abnormal results, PRM scoring, and multi-turn observations, while the training phase ultimately requires rewards, logprobs, and masks that can be aligned at the token level.
4.2 Agentic Rollout
SkyRLGymGenerator.agent_loop() is the core of multi-turn Agent rollout. It organizes model generation and environment execution into an explicit state transition process:
while not done:
output_ids, stop_reason = inference_engine.generate(input_ids)
observation, reward, done = env.step(output_text)
input_ids = input_ids + output_ids + obs_ids
if step_wise:
emit_turn(trajectory_output)Each round, the inference engine first generates an assistant action, which is then handed over to the environment to execute the tool and return an observation. The observation is re-encoded and concatenated back into the context, becoming the condition for the next generation. This way, a math problem is no longer a static completion, but an interactive trajectory where actions and observations alternately appear.
The field design of TrajectoryOutput determines whether this trajectory can be stably trained. response_ids saves the tokens truly generated by the policy; loss_mask marks which tokens participate in the policy loss, avoiding optimizing prompt or tool observations as model actions; reward can carry the final answer score, or it can carry step-level or token-level signals.
4.3 Environment and Reward Layer
Taking the ReTool math environment as an example, the execution path of ReToolEnv.step() can be abstracted as:
action text
-> action parser
-> Python executor
-> observation wrapper
-> outcome verifier
-> PRM scorer
-> reward composerThe model's output is first parsed into a tool call action, and then executed via the Python executor. Execution results, exceptions, empty outputs, and timeouts are all wrapped as observations and returned to the model. Upon parsing failure, the environment should not directly abort the whole trajectory, but rather return a formatting correction prompt, giving the model a chance to fix it in the next round. This kind of "recoverable failure" is precisely one of the differences between Agentic RL and standard outcome RL: the model must learn not only to get the answer right, but also to read environment feedback, correct tool calls, and determine when to keep exploring and when to stop.
The composition of the reward is not just simply attaching a final score. The outcome verifier in math tasks is responsible for judging whether the final answer is correct; the PRM is tasked with scoring intermediate reasoning and tool usage processes; tool-call shaping can encourage reasonable tool invocation and prevent the model from reverting to pure text guessing.
4.4 Forward Layer
During the Rollout phase, generated tokens are already obtained, but a re-forward is still needed during training. The reason is that inference engines and training engines bear different goals: the rollout engine pursues high-throughput sampling, typically using engines like vLLM / SGLang; training forward requires precise per-token logprob, an autograd graph, and loss mask alignment, relying on training backends like FSDP / Megatron at the lower level.
fwd_logprobs_values_reward() splits several modules on the training side:
| Forward Module | Output | Role in Hybrid RL |
|---|---|---|
| policy | action logprobs | Optimization target of the current policy |
| reference / teacher | base logprobs | KL anchor or hindsight teacher signal |
| critic | value | Needed for PPO / GAE paths; can be weakened in GRPO path |
In standard PPO, the reference model is primarily a KL anchor, using reference logprobs under a normal prefix to prevent the policy from deviating too far from the initial distribution. In Hybrid RL training, this reference path can also serve as a teacher signal: first extract the delayed bridge thought from the trajectory, then construct an enhanced context with the a posteriori hint, letting the teacher calculate token logprobs under the condition of [prompt + bridge-thought + response]. What is obtained this way is not a normal KL baseline, but a posteriori supervision asking, "If we already knew where this trajectory needed fixing, how would the teacher evaluate these tokens?"
4.5 Advantage Layer
What truly determines the algorithm's style is compute_advantages_and_returns(). The advantage of SkyRL is that the differences among PPO, GRPO, OPD, and Hybrid RL can converge into this localized interface, rather than being scattered across the generator, environment, forward, and update modules.
The signal for GRPO comes from multiple sampling results under the same prompt. Assuming the rewards for a set of trajectories are
It answers the question: among multiple attempts at the same problem, which trajectory is more worth learning? This signal does not need a critic and is particularly suitable for tasks like math and coding where the outcome verifier is relatively clear, but the value function might not be stable. PRM further alters the composition of
The signal for OPD comes from token-level teacher logprobs. Following the perspective of On-Policy Distillation[10:1], the difference between teacher and student can be written as a per-token reverse KL:
Thus, the token-level OPD advantage can be written as:
If the teacher under hindsight conditions leans more toward a certain token, that token will receive a positive push; if the teacher thinks it is unreasonable, its probability will be suppressed. This perfectly complements the limitations of GRPO: GRPO provides trajectory-level ranking, while OPD provides token-level local improvement direction.
Hybrid advantage can be viewed as a linear synthesis of both types of signals:
4.6 Modular Value of SkyRL
Returning to system design, Hybrid RL's smooth implementation is precisely because SkyRL decomposes Agentic RL into several relatively stable module interfaces. The Generator and Environment are responsible for producing real interactive experiences with tool observations; the Trainer organizes these experiences into token-aligned training batches; the forward layer provides probability signals from the policy, reference, critic, and teacher; the advantage layer decides exactly which RL algorithm to adopt. Researchers only need to inject algorithmic logic at points like reward, forward, and advantage, while the underlying high-throughput rollout, distributed training, and weight sync mechanisms can still reuse SkyRL's existing implementations.
5. Summary
From the perspective of system and algorithm synergy, this article explored how to build an efficient and stable training closed-loop for Agentic RL.
At the system infrastructure level, we comparatively analyzed two mainstream framework abstractions, Verl and SkyRL. Verl focuses on low-level computational topology, decoupling control flow and computation flow to achieve efficient orchestration and time-sharing multiplexing of distributed resources; SkyRL focuses on higher-level task abstraction, breaking down modules like Trainer, Generator, and Environment, making the fusion of multi-turn tool calls and reinforcement learning algorithms extremely flexible. Together, both confirm the core system design principle of Agentic RL: tiered decoupling of generation, evaluation, and training, with efficient collaboration among heterogeneous components.
Based on the flexible architecture provided by SkyRL, we completely built the practical case of Retool-RL, streamlining the experimental closed-loop encompassing tool calling, process evaluation, and model updating. Building upon this, we verified the significant benefits brought by the new algorithm: by introducing process rewards (PRM) and Hindsight OPD based on Bridge Thought, we transformed the sparse execution results given by the environment into dense gradient signals with local directionality (Outcome Reward → Process Reward → Token-level OPD).
This combination of system and algorithm showed significant improvements in experiments: on the AIME 2024 benchmark, the Hybrid approach used only 10 hours of single-node training to boost the 4B model's pass@8 from 58.3% to 76.7%, also achieving about a 15% relative gain compared to pure outcome GRPO. This fully demonstrates that for complex reasoning tasks, efficient, flexible system infrastructure paired with denser, more reliable learning signals is the key to achieving leaps in RL model capabilities.
References
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Hao Zheng, et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP, 2023. ↩︎
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, et al. "SGLang: Efficient Execution of Structured Language Model Programs." arXiv:2312.07104, 2023. ↩︎
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, et al. "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." VLDB, 2023. ↩︎
Mohammad Shoeybi, Mostofa Ali Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv:1909.08053, 2019. ↩︎
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. "DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters." KDD, 2020. ↩︎
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017. ↩︎
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, et al. "HybridFlow: A Flexible and Efficient RLHF Framework." arXiv:2409.19256, 2024. ↩︎ ↩︎
Tyler Griggs, Sumanth Hegde, Eric Tang, Shu Liu, Shiyi Cao, et al. "Evolving SkyRL into a Highly-Modular RL Framework." Notion Blog, 2025. ↩︎ ↩︎
Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. "OpenClaw-RL: Train Any Agent Simply by Talking." GitHub, 2025. ↩︎
Kevin Lu and Thinking Machines Lab. "On-Policy Distillation." Connectionism, 2025. ↩︎ ↩︎
H. Yang, Z.-Q. J. Xu, F. Xiong, and W. E. "A First-Principles Theory of Slow Thinking and Active Perception." ResearchGate, 2026. ↩︎