Blitzy Scores a Record 84.95% on SWE-Bench Pro

Jun 26, 2026 • Dr. Neeraj Deshmukh • 4 min read

Blitzy set a new best-in-class score on SWE-Bench Pro Public, outperforming Mythos with older generation models intelligently orchestrated together.

Swe-Bench Graphic Updated 6/26

While the AI landscape has changed considerably in three months, one fact has not.

Blitzy still holds the record on SWE-Bench Pro Public, but now with a score of 84.95%.

The closest result behind us belongs to Anthropic's latest model Mythos at 80.3%.

What makes the result more noteworthy is how we did not use Mythos to achieve this record. Blitzy fused a set of models a generation behind the current frontier: Opus 4.8, GPT 5.5, Sonnet 4.6, and GPT 5.4 Mini. On its own, Opus 4.8 scores 69.2% on this benchmark. The difference in performance comes from the system surrounding the models, not the models themselves.

The Record & Validation

Blitzy's record is approximately an 18-point gain on the 66.5% score we posted in March.

For validation, we brought in Quesma again, who audited the 66.5% run and already knew our setup. We provided our trajectories, methodology, and research team. They analyzed the hardest tasks and checked for anomalous behaviors and internet leakage.

Quesma has independently verified that the results are clean.

Why This Is Important

SWE-Bench Pro remains the gold standard for measuring AI software engineering capability. The benchmark measures a system's precision and power with real problems enterprise developers solve regularly.

In our last published SWE-Bench Pro audit, we argued that the system around the model matters more than the model itself. The 84.95% score further strengthens that position.

Blitzy is the system that understands your codebase and orchestrates the work.

This is the value of an abstraction layer. An enterprise should not have to track which model leads this quarter or hand-craft a harness to extract performance out of it. That work pulls engineers off their core product, and it resets every time a new model ships. Blitzy is the abstraction layer that removes this problem. Steered by an enterprise's unique context and organizational intent, we deliver production-grade results at enterprise speed, scale, and quality, while the models underneath change as often as they need to.

Why Multi-Model Systems Outperform Single Frontier Models

Our guiding principle since day one in 2023 is that taking advantage of all the frontier models beats any single model in isolation. For the enterprise, leveraging all frontier models removes dependencies. Build on one model and you inherit its price and its availability. If that price climbs or a model like Fable is taken offline, you are left exposed. A system that draws on every major model spreads risk, so no provider, price change, or deprecation can impact the work.

Our March audit showed how we achieved our original record of 66.5%. Quesma ran the state-of-the-art base model on the same tasks at maximum reasoning effort, expecting spectacular failures. What they found was more subtle. Almost every wrong answer was close. For instance, the system found the right part of a codebase and wrote a reasonable patch but failed on edge and corner cases.

A single model works from one pass through the code with a limited context window. Blitzy leverages an enterprise's entire codebase as context to build a unique, dynamic knowledge graph. Every task our platform performs draws on this knowledge graph, so when an agent hits a boundary condition that trips a single model, Blitzy has the context to solve problems effectively. Fusing frontier models together takes that further: several models in one build, each covering the others' blind spots, grounded in a shared understanding of the system. That institutional context and fusion is how a stack of older models outperformed a model so powerful that it cannot be released to the public for fear of widespread panic.

What Is Blitzy?

Blitzy is an autonomous enterprise software development platform built for large, legacy codebases. Before writing code, Blitzy spends days of uninterrupted compute reverse-engineering your global estate into a dynamic knowledge graph: every dependency, pattern, and architectural decision across codebases upwards of 100 million lines, mapped and queryable.

Grounded in that understanding, the orchestration engine decomposes a project from your team's spec and recruits thousands of specialized agents in parallel, often executing more than 100,000 agent-to-agent calls on a single job. Those agents fuse the strengths of models from OpenAI, Anthropic, and Google, pulling just-in-time context from the knowledge graph as they build.

Autonomy at scale only matters if the output is correct, so Blitzy spends more compute power verifying its work rather than just writing it. Execution pauses at hard checkpoints where review agents inspect the work, classify risk, and fix problems before a build continues. The result is project-level, production-grade code that merges cleanly, up to five times faster.

Every Blitzy run strengthens this knowledge graph, so the enterprise keeps institutional memory that no engineer can carry out the door, and each project is less expensive than the last.

Quarters of work ship in days.

Intelligent Model Fusion From Inception, Not Bolted On

Multi-model fusion has been part of Blitzy since day one. We built the platform to combine models and ground them in a dynamic knowledge graph. It was the founding design decision, not something we bolted on later.

Many of the strongest agent products started as single-model harnesses. Factory and Devin are the closest examples, and both are now adding the ability to route between models. Routing is useful, but it is not fusion. It picks one model for a task.

Blitzy uses one model family to check the work of another and anchors them to a shared model of the codebase. Bolting model selection onto a single-model harness is not the same as designing for intelligent fusion from inception.

The Blitzy platform's SWE-Bench Pro benchmark results verify that gap.

Conclusion

Foundation models are getting better quickly, and Blitzy is best positioned to reap the benefits.

Our SWE-Bench Pro score is another strong data point.

What it confirms is the thesis we have held since day one: serious software engineering requires a deep understanding of the codebase and a dynamic, multi-model architecture to match.

Blitzy is changing the unit of work and delivering the true promise of AI in software development.

Frequently asked questions

What is Blitzy?

Blitzy enables development teams to transform six-month software projects into six-day turnarounds using Blitzy OS, an agentic platform that enables thousands of AI Agents to 'think' and cooperate for hours to bulk build software with precision. The platform builds everything AI can deliver in a precise manner, around 80% of any roadmap or new product, supplemented with a human engineering guide to complete the remaining 20% needed for production. With over 27 patents and counting, Blitzy is actively hiring PhDs and senior developers in Cambridge, MA who have a passion for building AI that leverages 'System 2 Thinking' to solve problems at inference.

Who is Blitzy for?

Enterprises that aim to dramatically accelerate their software development velocity, development agencies with enterprise clients, development teams with complex existing products, and individuals looking to accelerate their own velocity on complex builds.

How does Blitzy's technology work?

Our patent-pending code ingestion framework maps a curated selection of robust, reliable, and secure open source software libraries that we track by version and update frequently. Combined with our proprietary code generation technology that specializes on enforcing enterprise-class software policies, Blitzy far exceeds the utility of typical chatbots and co-pilots in creating production-ready software at scale.

Is Blitzy a coding co-pilot?

Nope. Blitzy surpasses traditional co-pilots with its ability to autonomously generate nearly-complete code repositories, not just snippets. It features a daily-refreshed knowledge base, avoiding the pitfalls of outdated information. Blitzy's proprietary codebase representation system enables deep understanding of generated code, offering highly contextual and relevant suggestions for your entire repository.

What's my role in Blitzy's development process?

Your team is responsible for bringing the requirements, and as an approver during the technical specification stage. We ask you to edit/approve the Technical Specification. The document is editable, so you can edit and approve to get exactly what you had in mind.

How does Blitzy decide which tasks to delegate to human developers?

Blitzy's multi-agent system is meticulously and rigorously trained to know what it can accomplish, and what needs to be left for the human engineers. This ensures you only receive quality code and have a clear picture of remaining tasks.

Does Blitzy do more than just autonomous code generation?

Yes. Blitzy is a comprehensive platform that provides end-to-end development assistance. We support the entire development lifecycle by taking descriptive inputs and generating software requirements documents, technical design, code structure, and generative code within repos for your product.

Is this high quality and secure?

Quality and security matter deeply to us — and they were our biggest frustration with the copilots already on the market. That frustration is what led us to build something different: a system designed to meet enterprise standards from the start. Every piece of work passes through multiple QA agents that review each other's output before any code reaches you, so what you receive is held to a consistent quality bar rather than the variable output typical of single-pass code generation. We deliver production-grade code repositories. As with any code entering your environment — written by humans or AI — your team should still run its own QA, QC, and security testing before deployment. We build to a high standard and give your reviewers a strong starting point; final validation stays with the team that owns the production environment.

What is the typical cost of your solution?

Blitzy uses a two-phase pricing model: evaluation followed by deployment. This structure lets enterprises validate ROI at their preferred scale before committing to organization-wide implementation. The evaluation phase provides three options. Reverse Engineer ($0) offers an initial assessment with complete codebase reverse engineering and understanding up to 100K lines of code; Proof of Concept ($50K for a 2-month term), where Blitzy delivers a guided POC to demonstrate value; or Structured Pilot ($250K for a 6-month term), which fully deploys Blitzy in your environment with 5M lines onboarding and 1.25M lines generation to prove production readiness. Following successful evaluation, organizations choose between three deployment paths. Commercial ($500K typical investment per year) adopts Blitzy on one team to accelerate a defined initiative: the first 20M lines onboarded are included, with additional onboarding at $0.10 per line and generation at $0.20 per line starting at 2.5M lines, plus dedicated infrastructure and SAML-SSO. Enterprise ($5M typical investment per year) rolls Blitzy out across your engineering organization, with onboarding billed at $0.10 per line across the full codebase — a typical engagement onboards 50M lines — and generation at $0.20 per line as needed, adding a Dedicated AI Solutions Consultant, 2 Forward Deployed Engineers, org-wide onboarding and certification, and priority support. Transformation ($50M typical investment per year) supports your largest codebases, with a typical engagement onboarding 500M lines at the same per-line rates, custom deployment, and embedded teams including a Field CTO, a Dedicated AI Solutions Consultant, 6 Forward Deployed Engineers, and 2 Forward Deployed Designers for complete digital transformation. All tiers maintain SOC 2 Type II compliance, ISO 27001 certification, and guarantee no training on your code. Pricing follows a transparent two-rate model: $0.10 per line onboarded for reverse engineering and $0.20 per line generated for forward engineering. Because reverse engineering also produces complete technical documentation of your codebase, onboarding-only engagements are fully supported, and in every tier costs align directly with the value delivered.

After submitting my prompt, Blitzy added functionality in my tech spec that I did not expect. What do I do?

The system defaults to taking advantage of all technology upgrades when modernizing or upgrading to the latest technology stack. For example, if you specify an upgrade to Java 21, the system will by default implement virtual threads, as it's generally seen as a superior technical approach. If you do not want this, you must simply tell the system to 'make as few changes as possible to achieve the desired request'. Being as specific as possible about what functionality is (and is not) desired helps yield results that will align with expectations.

What do Blitzy agents rely on as a source of truth to represent my existing codebase?

Blitzy agents rely on the actual source code of your existing codebase—not the Tech Spec documentation—when performing refactors or extending functionality. However, an accurate Tech Spec significantly aids the system's efficiency in querying the underlying representation of the code. Therefore, investing time to ensure the Tech Spec reflects the core features of the application will yield expectation-aligned results and will save time with last-mile development.

Can Blitzy work with existing products and code bases?

Yes! Blitzy excels at working with existing codebases, using them as a foundation to ensure consistent, high-quality development. The platform enables you to add new features to existing products, generate comprehensive documentation, and tackle technical debt by upgrading legacy systems to state-of-the-art technologies or refactoring complex codebases. Our platform deploys dedicated AI agents that map and understand your codebase before generation, ensuring intelligent, contextualized development that aligns with your existing patterns and standards.

What programming languages does Blitzy support?

Blitzy's AI platform works with all programming languages.

How should I structure my prompts for Blitzy?

Structure and organization are crucial when prompting Blitzy. The most effective prompts follow our prompting template with clear sections for WHY (vision & purpose), WHAT (core requirements), and HOW (technical details, user experience & implementation priorities). Each section should be detailed but concise, focusing on essential information while providing relevant context. Including structured frameworks and concrete examples - like data models, user stories, or feature templates - helps Blitzy deliver more precise and purposeful solutions.

What information does Blitzy need to compile and run my code?

During code generation, Blitzy compiles your codebase and performs runtime validation to ensure the generated code works correctly. To enable this, we require: (1) Internal dependencies - any private packages, libraries, or binaries not publicly available that your code needs to build and run, (2) Environment variables and secrets - API keys, credentials, and configuration values required for compilation and runtime (shared securely through our encrypted UI, never exposed to AI agents), and (3) Build instructions - the specific steps or scripts needed to compile your code, typically found in your README or setup documentation. This information allows Blitzy to replicate your development environment and verify that all generated code functions properly before delivery.

How can I exclude certain files or folders from Blitzy's code generation?

Create a .blitzyignore file in your repository's root directory to specify which files or paths Blitzy should exclude during tech-spec generation and code generation. This works similarly to .gitignore - simply list the file patterns, directories, or specific files you want Blitzy to skip, using standard gitignore syntax like *.log, /build/, or config/secrets.json. To ensure Blitzy respects these exclusions, mention in both your codebase context prompt and target state prompt that Blitzy should reference the .blitzyignore file and exclude those paths from processing.

Can I cancel my project/job (code gen) once in progress?

At this time, jobs are not cancelable. Once you submit, it consumes the assigned quota.

Build enterprise software in days, not months.

Start building Talk to an expert

Blitzy Scores a Record 84.95% on SWE-Bench Pro

The Record & Validation

Why This Is Important

Why Multi-Model Systems Outperform Single Frontier Models

What Is Blitzy?

Intelligent Model Fusion From Inception, Not Bolted On

Conclusion

More from the blog

Blitzy's Blitz: Adventures in Chess

Dynamic Discourse: Security, AI & Open Source

Frequently asked questions

Build enterprise software in days, not months.

Product

Company

Support

Resources

Social

Legal