Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Abstract

Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose RGR-GRPO (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods, achieving average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively.

Publication
Proc. of the International Conference on Machine Learning, ICML (Spotlight), 2026