Kimi K2: Open Agentic Intelligence

papers

summary

research

llm

agents

Author

Aakash Kumar Nain (@A_K_Nain)

Published

July 25, 2025

tech_report

Introduction

MoE with 32B activate parameters, and 1T total parameters
Ultra-sparse MoE employed with MLA
Proposes MuonClip optimizer, an optimizer built on top of Muon leveraging QK clipping
Pretrained with 15.5T tokens with zero loss spikes
Multi-stage post-training
Introduces a large-scale agentic data synthesis pipeline
Proposes a general reinforcement learning framework that combines RLVR with a self- critique rubric reward mechanism.

Pre-training

Focuses on maximizing per-token efficiency
The authors find that training instability due to exploding attention logits is more frequent with Muon compared to AdamW
Since they leverage MLA, QK-Norm cannot be applied
To address this, they propose Muon with QK-Clip which works by rescaling the query and key projection weights post-update to bound the growth of attention logits.
The authors define define a per-head scalar Sh(max), the max logit, as the maximum input to softmax in a given batch. The core idea of QK-Clip is to rescale Wk,Wq whenever Sh(max) exceeds a target threshold τ (=100 in their experiments)
Then they integrate Muon with weight decay, consistent RMS matching, and QK-Clip into a single optimizer, MuonClip

Pre-training Data: Improving Token Utility with Rephrasing

Acknowledges the lack of supply of high-quality tokens
Introduces a synthetic data pipeline to increase token utility. Deployed a rephrasing pipeline is employed to amplify the volume of high-quality tokens without inducing significant overfitting.
The rephrasing pipeline consists of three components
- Style- and perspective-diverse prompting
- Chunk-wise autoregressive generation
- Fidelity verification
The data is rephrased multiple times, and a single epoch training is performed

Model Architecture

Similar to DeepSeek-V3
MoE with 32B activated, and 1.04T total parameters
They develop a sparsity scaling law tailored for the MoE model family using Muon
The authors find that increasing sparsity consistently improve model performance
To balance model performance with cost, they adopt a sparsity of 48, activating 8 out of 384 experts per forward pass.
They also found that doubling the number of attention heads provides only modest improvement. Hence they keep only 64 attention heads
16-way PP, 16-way EP, and Zero-1 were used for parallelism# Model Architecture

Training

4096 context window at the start
The training on the first 10T tokens was done with a constant learning rate of 2e-4 after a 500-step warm-up, followed by 5.5T tokens with a cosine decay from 2e-4 to 2e-5
Weight decay of 0.1, and a global batch size of 67M tokens
Long-context activation towards the end of the training. YaRN was used for context extension, and the model was trained on 400 billion tokens with a \(4K\) sequence length, followed by an additional 60 billion tokens with a \(32K\) sequence length.

SFT

Uses simple Muon Optimizer for SFT, and recommends to do the same for any kind of fine-tuning over K2
Agentic capabilities, especially tool usage at scale is the focus
Developed a pipeline that simulates real-world tool-use scenarios at scale.
Three stages in the data synthesis pipeline:
- Tool spec generation: utilizes both real-word tools, and synthetic tools. 3000 MCP based tools, 20,000 synthetic tools
- Agent and task generation: For any sampled agentic tool, generate an agent to use the toolset and some corresponding tasks
- Trajectory generation:For each agent and task generate trajectories where the agent finishes the task by invoking these tools
Every task completion is rewarded with a rubric that specifies success criteria, expected tool-use patterns, and evaluation checkpoints
Hybrid pipeline (LLM based judges + real-execution sandboxes) for filtering and judging completions.

RL

Scale RL in both task diversity and training FLOPs in K2
The authors develop a Gym-like extensible framework that facilitates RL across a wide range of scenarios
Combines tasks with verifiable rewards, and other tasks like creative writing with self-critic rewards

Verifiable Rewards Gym

Math, STEM and Logical Tasks
Diverse Coverage: For math and stem tasks, they collect high-quality QA pairs using a combination of expert annotations, internal QA extraction pipelines, and open datasets. For logical tasks, structured data tasks like tabular reasoning, and puzzles like Sudoku
Moderate difficulty: Assuming that RL prompt should neither be too easy nor too difficult. The authors assess the difficulty of each problem using the SFT model’s \(pass@k\) accuracy and select only problems with moderate difficulty.
Hybrid Rule Verification:
- Deterministic evaluation via code interpreters for instructions with verifiable outputs
- LLM-as-judge evaluation for instructions requiring nuanced understanding of constraints.
Multi-Source Instruction Generation: Three distinct generation strategies
- expert-crafted complex conditional prompts and rubrics developed by in-house data team
- agentic instruction augmentation inspired by AutoIF
- a fine-tuned model specialized for generating additional instructions that probe specific failure modes or edge cases
For faithfulness evaluation, they train a sentence-level faithfulness judge model to perform automated verification.
For coding problems, they collect pull requests and issues from GitHub to build software development environment that consists of user prompts/issues and executable unit tests
For safety they focus on they employ an automated prompt evolution pipeline with three key components: Attack model, target model, and judge model. Each interaction is assessed using a task-specific rubric, enabling the judge model to provide a binary success/failure label.

Self-Critique Rubric Reward

Model evaluates its own outputs to generate preference signals.
The authors curated a mixture of open-source and in-house preference datasets and initialized K2 critic capability via SFT
First the K2 actor generates responses for general prompts that cover a wide range of use cases. The K2 critic then ranks all results by performing pairwise evaluations against a combination of rubrics, including core rubrics, perspective rubrics, and human-annotated rubrics with the flexibility of letting K2 to provide weightage as suitable
During RL training, the critic model is refined using verifiable signals, where On-policy rollouts generated from verifiable-reward prompts are used to continuously update the critic
Same policy optimization as in Kimi-1.5. For each problem \(x\), sample \(K\) responses from the previous policy \(π_{old}\), and optimize the model \(π_θ\) with respect to the following objective:
To encourage the model to properly distribute inference budget, they enforce a per-sample maximum token budget throughout RL training, where the budget is determined based on the type of task
To prevent the potential forgetting of valuable, high-quality data during joint RL training, they curate a dataset comprising hand-selected, high-quality samples and integrate it into the RL objective through an auxiliary PTX loss
During the exploration phase, the temperature is kept high for generation, but they also employ a temperature decay schedule, to shift from exploration to exploitation throughout the training.