Cursor; the AI-powered code editor from Anysphere has recently released Composer 2, their latest in-house “agentic” coding model. These are proprietary models optimized for low-latency, agentic coding—meaning the AI can autonomously plan, edit multiple files, test, and iterate on code within your codebase, rather than just suggesting snippets.
The big focus has always been on combining strong coding intelligence with exceptional speed often 4x faster than comparable frontier models in earlier versions and now, very competitive pricing. Cursor announced Composer 2, positioning it as achieving frontier-level coding performance at a dramatically lower cost.
CursorBench: 61.3%; their internal real-world coding tasks benchmark. Terminal-Bench 2.0: 61.7%; beats Anthropic’s Claude Opus 4.6 at 58.0%, though trails OpenAI’s GPT-5.4 at 75.1%. SWE-bench Multilingual: 73.7%. This represents a significant jump ~17 points on some metrics from Composer 1.5 in a short cycle.
Register for Tekedia Mini-MBA edition 20 (June 8 – Sept 5, 2026).
Register for Tekedia AI in Business Masterclass.
Join Tekedia Capital Syndicate and co-invest in great global startups.
Register for Tekedia AI Lab.
It’s described as on par with or beating models like Claude Opus 4.6 in practical coding scenarios, while being much more affordable. Composer 2 Standard: $0.50 per million input tokens/ $2.50 per million output tokens. Composer 2 Fast (higher speed variant, now the default): $1.50 / $7.50 per million.
For comparison, Claude Opus 4.6 is around $5/$25, and GPT-5.4 is higher—making Composer 2 roughly 3-10x cheaper depending on the competitor. Technical edges include scaled reinforcement learning (RL) on real Cursor usage data, self-summarization for better long-context handling in multi-step tasks, and optimization for interactive agentic workflows in the IDE.
Users and early testers on X are calling it “legit,” with reports of it catching subtle bugs that other models including Claude and various GPT variants missed, and handling complex refactors efficiently. There are also leaks/claims that the base might build on Moonshot AI’s Kimi K2.5 with heavy continued pretraining + RL on Cursor’s proprietary coding data, which would explain the speed/cost advantages over fully proprietary frontier models from OpenAI or Anthropic.
It depends on the metric:Yes, on speed + cost + practical agentic coding in Cursor’s environment especially vs. similarly priced or even higher-priced options like recent Claude versions.
Partially, on raw intelligence—it’s frontier-level and beats some; Opus 4.6 on certain benches, but top models like GPT-5.4 still lead on the hardest tasks. The real win is the value: high performance at a fraction of the cost, making it feel like it outperforms in real developer workflows.
SWE-bench is one of the most widely used and respected benchmarks for evaluating how well large language models (LLMs) and AI coding agents can handle real-world software engineering tasks. Introduced in late 2023, it stands out because it uses actual problems from GitHub rather than synthetic or toy coding exercises.
SWE-bench tasks an AI with resolving real GitHub issues from popular open-source repositories. For each task, the model receives: The full codebase at the state before the issue was fixed. The issue description (title + body from GitHub). Sometimes additional context like comments. The goal is to generate a code patch (diff) that fixes the problem. Success is measured automatically: the patch is applied in a clean Docker environment, and the model’s change must make the relevant unit tests pass. Original full SWE-bench: ~2,294 tasks, all from 12 popular Python repositories.
Tasks include bug fixes, small features, refactors, and more — reflecting genuine developer work. This makes it much harder and more realistic than older benchmarks like HumanEval which tests isolated function completion because it requires: Understanding large, complex codebases often tens of thousands of lines.
Navigating dependencies and repo structure. Interpreting sometimes ambiguous or poorly written issue reports. Generating multi-file edits that don’t break existing functionality. SWE-bench Verified: A cleaned, human-validated subset of 500 tasks. Annotators checked that issues are clear, tests are correct, tasks are solvable from the given info, and no data leaks/memorization artifacts exist.
This version is more reliable for comparing models; less noise from bad tasks. Top models in early 2026 reach ~75-82% on Verified. SWE-bench Lite: A smaller, easier subset often ~300 tasks used for faster evaluation or when full runs are too expensive.
SWE-bench Multilingual: Extends the idea beyond Python. It includes 300+ curated tasks from repositories in 9 languages; Java, TypeScript/JavaScript, Go, Rust, C/C++, etc. This tests cross-language understanding and generalization — performance is noticeably lower than on Python-only versions because most frontier models are still heavily Python-biased in training data.
There are also community forks and extensions like SWE-bench Pro, SWE-bench Live, and others that add multi-language depth, harder tasks, or anti-contamination measures. Scores are usually % Resolved. On SWE-bench Verified: Often higher, e.g., 76-82% for top models like Claude Opus 4.6 or newer Sonnet/Opus variants.
On SWE-bench Multilingual: Lower overall, highlighting gaps in non-Python performance. In the context of Cursor’s Composer 2, they reported 73.7% on SWE-bench Multilingual — a very strong result, especially at their price point, showing it’s competitive even on the harder cross-language version.
Tests agentic capabilities: planning, exploration, multi-file editing, debugging loops. Many tasks are relatively “simple” bug fixes; hours of human work, not days/weeks. Potential data contamination/memorization risks some papers argue top scores partly come from models “remembering” popular repos.
Overall, SWE-bench and especially Verified + Multilingual remains the de facto standard for agentic coding evaluation in 2026 — far more indicative of real usefulness in tools like Cursor, Devin-style agents, or GitHub Copilot Workspace than function-level benchmarks.



