Chinese AI startup DeepSeek is teasing its next major breakthrough with the DeepSeek-GRM models, developed in collaboration with Tsinghua University. This new reasoning method combines generative reward modeling (GRM) and self-principled tuning, aiming to improve the accuracy, efficiency, and human alignment of large language model outputs. The paper reports that DeepSeek-GRM outperformed existing public reward models in some benchmarks.


Following the success of its R1 model, known for competitive reasoning on a low budget, industry watchers expect R2 to debut soon, though the company has not confirmed this. DeepSeek has also introduced DeepSeek-V3-0324, which enhances reasoning, Chinese writing, and web development capabilities.

Backed by High-Flyer Quant and quietly gaining support from top Chinese leadership, DeepSeek is making waves globally. With plans to open-source its GRM models, the company continues its research-first, transparency-driven approach, potentially setting a new standard in reinforcement learning efficiency using Mixture of Experts (MoE) architecture.
