Fresh off the press — M2.7 was released just two days ago on March 18, 2026. Here's a full breakdown:
MiniMax M2.7 Overview
The headline story with M2.7 is its "self-evolving" architecture. Earlier versions of the model were used to build a research agent harness capable of managing data pipelines, training environments, and evaluation infrastructure, and by autonomously triggering log-reading, debugging, and metric analysis, M2.7 handled between 30–50% of its own development workflow. That's a meaningful shift in how AI models are being developed.
Coding — How Does It Stack Up Against Codex 5.3 and 5.4?
This is where the story gets interesting for you. M2.7 is essentially at parity with Codex 5.3 on the headline SWE benchmark, at a fraction of the cost.
On SWE-Pro, which covers multiple programming languages, M2.7 achieved a 56.22% accuracy rate, matching GPT-5.3-Codex. It demonstrates an even more notable advantage on benchmarks closer to real-world engineering scenarios — SWE Multilingual (76.5) and Multi SWE Bench (52.7).
It handles end-to-end full project delivery, log analysis, bug troubleshooting, code security, and machine learning. On VIBE-Pro (repo-level code generation), M2.7 scored 55.6%, nearly on par with Opus 4.6 — meaning whether the requirement involves Web, Android, iOS, or simulation tasks, they can be handed directly to M2.7.
Where M2.7 genuinely stands out over Codex is in production-system intelligence. When faced with production alerts, M2.7 can correlate monitoring metrics with deployment timelines to perform causal reasoning, conduct statistical analysis on trace sampling, proactively connect to databases to verify root causes, and even use non-blocking index creation to stop the bleeding before submitting a merge request. On multiple occasions, M2.7 reduced recovery time for live production incidents to under three minutes.
The cost gap is also stark. M2.7 maintains a price point of $0.30 per million input tokens and $1.20 per million output tokens — only xAI's Grok 4.1 Fast is cheaper among frontier models. Meanwhile, GPT-5.3-Codex is roughly 6.5x more expensive for input tokens and roughly 14.7x more expensive for output tokens compared to MiniMax's previous M2.1 generation — M2.7 maintains that same price point.
Where Codex 5.3/5.4 still likely edges ahead is raw reasoning depth on complex, novel architectural challenges and pure senior-engineer-level code generation quality. But for real-world agentic engineering workflows at scale, M2.7 closes most of that gap.
Creative Writing
This is more of a secondary strength. M2.7 demonstrates excellent identity preservation and emotional intelligence, opening up more room for product innovation in interactive entertainment scenarios. MiniMax also unveiled an "OpenRoom" framework built on this emotional intelligence layer.
M2.7 shows significantly improved ability for complex editing in Office suite — Excel, PPT, and Word — and can better handle multi-round revisions and high-fidelity editing. That's more "professional document writing" than pure creative fiction, but the underlying model quality for long-form structured writing is solid.
Honest assessment: MiniMax's public benchmarks don't prominently feature creative writing evals, and the model's positioning is clearly engineering/productivity-first. It's probably competent but not a standout in this area compared to models like Opus or GPT-5 series that have been more heavily RLHF'd for prose quality.
Browser / Agent Tool Use vs Codex
This is where M2.7 is genuinely competitive. In the MM-Claw evaluation, which tests an AI's ability to use complex tools and execute long-horizon tasks, M2.7 achieved a 62.7% accuracy rate, nearing the performance of Sonnet 4.6. The model can maintain instruction adherence when juggling over 40 complex skills simultaneously.
The model supports the Model Context Protocol, allowing it to natively use tools like Web Search and Understand Image for multimodal reasoning.
Codex 5.3/5.4 are primarily coding-specialized models — they're not designed around broad agentic/browser toolchain orchestration the way M2.7 is. M2.7 was purpose-built for multi-agent orchestration and long-horizon tool use, so in that dimension M2.7 likely has an edge.
The Bottom Line for Your OpenClaw Context
Artificial Analysis reports M2.7 scores 50 on its Intelligence Index, matching GLM-5 (Reasoning) while costing less than one-third of GLM-5's cost. The speed caveat: at 49.9 tokens per second, M2.7 is notably slow compared to other reasoning models in a similar price tier, which median at 104.8 t/s.
For your use case — agentic workflows, steward-per-offer architecture, long-horizon tasks — M2.7 looks like a genuinely compelling option to evaluate as an OpenClaw model provider, especially given the explicit MiniMax docs showing native OpenClaw integration. The price-to-performance ratio is exceptional for coding and agentic tasks. The main watch-outs are the lower output speed and the fact that its creative/prose quality hasn't been independently validated to the same degree as its engineering chops.
