MiniMax M3 Just Beat GPT-5.5 on Coding. It's Also Free to Run.

On June 1, 2026, Shanghai-based AI company MiniMax released M3 — an open-weight model that scores 59% on SWE-bench Pro, the hardest and most contamination-resistant software engineering benchmark currently in use. That puts it ahead of both GPT-5.5 and Gemini 3.1 Pro on the same benchmark.

It has a one-million-token context window. It reads text, images, and video. It can operate a desktop computer autonomously. And its API is live at $0.60 per million input tokens — roughly 5–10% of the cost of equivalent closed frontier models.

This is worth paying attention to.

What SWE-bench Pro Actually Tests

SWE-bench Pro matters specifically because it was built to be hard to game. The standard SWE-bench Verified benchmark — the one most AI coding announcements cite — has become saturated, with top models hitting 95%+ scores that reflect benchmark familiarity as much as real capability.

SWE-bench Pro uses 1,865 real pull requests from 41 actively maintained open-source repositories, drawing from problems that weren't public at training time. It's designed to measure what a model can do on genuinely novel engineering tasks. The top closed models score around 23% on the public dataset; the leading scores come from restricted-access systems like Claude Mythos 5.

M3's 59% is a self-reported figure — independent third-party verification from Artificial Analysis and LMArena hadn't been published at launch. That's worth noting. But the number is specific enough to be falsifiable, and MiniMax's prior models have tracked reasonably well against independent evaluation.

Where it sits in the rankings

M3 scores 59% on SWE-bench Pro, ahead of GPT-5.5 and Gemini 3.1 Pro. Claude Opus 4.8 leads the domain of pure code modification at 69.2%. M3 also scores 66% on Terminal-Bench 2.1 and 83.5 on BrowseComp — ahead of Claude Opus 4.7 on the browsing benchmark.

The Architecture Behind the Context Window

A one-million-token context window is not trivial to implement at speed. Most models with long context windows pay a significant latency penalty as the context grows — the attention mechanism has to process every token against every other token, which scales quadratically.

MiniMax built M3 around a new attention variant called MiniMax Sparse Attention (MSA). Rather than processing the full context on every pass, MSA pre-filters relevant key-value blocks and processes only those. The result is a model that maintains the full one-million-token window while running at around 100 tokens per second output — fast enough for real agentic workflows, not just demos.

For developers: a one-million-token context means you can feed entire large codebases into the model without chunking. For complex refactors, architectural reviews, or long-running autonomous coding tasks, that matters significantly.

The Open-Weight Question

MiniMax committed to releasing the weights within ten days of launch, targeting Hugging Face and GitHub. The licensing situation is worth reading carefully before you build on it. MiniMax's M2 model shipped under a modified MIT licence. M2.7 restricted commercial use without prior written authorisation. M3's licence follows a similar pattern — downloadable weights with a non-commercial default and enterprise licensing available through direct sales.

For personal use, research, and non-commercial projects: the weights are freely available. For production commercial deployment: check the licence terms and reach out to MiniMax if you're building a product. The API at $0.60/M input tokens is available to anyone immediately with no licence friction.

This is a common pattern with Chinese open-weight labs — weights available for download and experimentation, commercial use gated behind a licence conversation. It's more restrictive than pure MIT, but still dramatically more open than anything from Anthropic, OpenAI, or Google.

The Bigger Story: The Gap Is Closing

M3 is the most striking example of a broader trend that's been accelerating through 2026. Epoch AI's analysis found that open-weight models now trail the frontier by roughly three months on average — down from nearly a year in late 2024. The gap is closing faster than most people in the industry expected.

DeepSeek V3.2: 94.2% MMLU, 685B parameters, MIT licence, $0.35/M input via API
Qwen 3 235B-A22B: Leading open-weight model for overall reasoning and coding
GLM-5.2: Released June 13 under MIT, 1M token context, early results competitive with leading closed models on math reasoning
MiniMax M3: 59% SWE-bench Pro, 1M context, multimodal, $0.60/M input

The competitive dynamic has shifted. Eighteen months ago, the question for developers was 'closed API or open model?' with a significant capability gap favouring closed. Today the question is more nuanced: do you need the absolute state of the art on a specific task, or do you need something frontier-class that you control, can fine-tune, and pay a fraction of the cost for?

What This Means If You're Building

For most coding use cases — code review, refactoring, test generation, documentation, autonomous PR workflows — M3's capability level is sufficient. More importantly, running it means you're not dependent on a single vendor's uptime, pricing decisions, or policy changes. The Fable 5 suspension showed what happens when a model you depend on disappears without notice. Open weights don't disappear.

The practical path for builders right now: use closed APIs for the tasks where the absolute frontier matters (complex multi-step reasoning, cutting-edge research), and evaluate whether open-weight models can cover the high-volume, cost-sensitive parts of your stack. At $0.30–0.60 per million input tokens versus $10–15 for Anthropic's top tier, the economics can change the viability of features that weren't practical before.

“The frontier used to be a walled garden. It's becoming a starting point that open models catch up to within months.”

M3 won't be the last model to cross this threshold. The pace of open-weight releases has been accelerating all year. Whatever the frontier looks like in six months, there will probably be an open-weight model within reach of it. The question for developers isn't whether to take open models seriously — it's whether your architecture is set up to use them when they're the right tool.

MiniMax M3 Just Beat GPT-5.5 on Coding. It's Also Free to Run.

What SWE-bench Pro Actually Tests

The Architecture Behind the Context Window

The Open-Weight Question

The Bigger Story: The Gap Is Closing

What This Means If You're Building

Sources

More from the blog

Kimi K3 Is the Largest Open-Weight Model Ever. Here's What Developers Need to Know.

Supabase MCP Is Genuinely Useful. It's Also Running as Admin by Default.