Was this helpful?
Thumbs UP Thumbs Down

New AI coding benchmark reveals shocking flaws in top models today

Developer coding on computer
Female programmer coding on desktop computer with multiple screens.

A coding test nobody expected to fail

An AI coding competition has just named a winner, but the winning score was shockingly low. Out of all the test-takers, the top performer only got 7.5 percent correct. That’s not a typo, that’s the actual winning score.

This wasn’t a broken challenge or a rushed trial. It was a carefully crafted test meant to expose how AI handles real-world programming bugs. And for now, even the best tools are falling surprisingly short.

Businessman sitting at table and taking money.

Meet the unlikely top scorer

The top performer wasn’t a big name from Silicon Valley. It was Eduardo Rocha de Andrade, a Brazilian prompt engineer, who took home $50,000 in prize money. He earned it by solving just a tiny slice of the test.

It’s not that he failed, it’s that the test was brutally tough. Solving even a small percentage of the problems showed real skill under tight rules and limited computing resources.

Trainee developer hand up to ask speaker about software coding

Why 7.5 percent matters now

Scoring under 10 percent may sound awful, but in this case, it proves how difficult the challenge really was. The test pushed AI to deal with new, unsolved coding problems.

In an age where people assume machines can do anything, the low score acts like a hard stop. It says we still have miles to go before AI can write reliable software without human help.

Multi exposure of financial graph drawing hologram and USA dollars.

The money behind the mission

Andy Konwinski helped launch the challenge and is offering $1 million to the first open source model that can score above 90 percent. The cash isn’t just for fun; it’s a signal to take this seriously.

He wants developers around the world to push boundaries without needing huge servers or closed systems. The goal is to build smarter AI tools that are open, fair, and ready to work in real-world situations.

Artificial intelligence, AI research of robot and cyborg

No big labs in this first round

Some major AI labs didn’t take part in the opening round. That wasn’t a fluke; the rules made things harder for heavyweight models that rely on massive computing power.

The test runs offline using limited resources, leveling the field for smaller models. It’s not about size or fame. It’s about who can actually solve difficult code problems with efficient thinking and cleaner tools.

GitHub with abstract technology binary code in digital background

Built to be a clean playing field

This test avoids something called benchmark contamination, which happens when models train on problems they’ll see again later. The K Prize changed that by using only fresh issues.

It relied on GitHub bugs posted after the model entry deadline. This keeps the test fair and honest. Models couldn’t memorize anything ahead of time, making each answer a genuine attempt at solving something new.

Developer coding on computer

A new kind of coding benchmark

The K Prize is being compared to an older system called SWE-Bench. While SWE-Bench uses the same problems over and over, this new test changes everything each time.

That means no training beforehand, no shortcuts, and no pattern recognition. It’s about real understanding. Can a model look at a brand-new issue and come up with working code under pressure? That’s the big question.

Artificial intelligence AI research of robot and cyborg

Real code, not textbook examples

The coding problems weren’t made-up examples or school-style puzzles. They were pulled from actual GitHub issues posted by real developers working on live projects.

That means the problems were unpredictable, sometimes messy, and very hard to fix. AI had to read through unclear code and broken functions and figure out how to help, just like a human programmer.

Highlighting the word solution.

Testing more than just output

This wasn’t about copying and pasting pretty-looking code. It was about writing solutions that actually worked. The code had to fix the issue and do it without breaking anything else.

AI models had to think through the logic of the problem, understand the context, and deliver functional results. That’s a far cry from just generating something that looks correct at first glance.

Programmers cooperating at developing programming and website working in a

The test that resets expectations

Most people think AI is nearly superhuman by now. It can pass law school tests, write books, and make apps. So this low score was a surprising wake-up call.

It showed that AI struggles when faced with unfamiliar bugs. Even basic programming tasks, when taken from real life, can throw off today’s best tools. This test pulled back the curtain in a big way.

Hype word on digital screen background with world map

A reminder the hype is ahead of reality

There’s been a lot of buzz around AI taking over professional roles like doctors, lawyers, and coders. But results like this show we’re still far from that world.

It’s not about killing the hype, it’s about keeping things honest. This challenge showed that today’s AI is still learning and far from mastering tasks we take for granted as simple.

A person showing AI bulb concept holding in hand

Making room for smaller players

Because the test didn’t require huge processing power, it allowed smaller models and independent teams to join. This was a key design choice to break tech industry patterns.

It opens the door for more innovation from unexpected places. You don’t need to work at a billion-dollar lab to compete; you just need smart ideas and the skill to build models that can think clearly.

Man getting paid with money.

Why the 90 percent goal matters

Scoring 90 percent on this test isn’t just about bragging rights. It’s a way to show that an AI model can handle tough, unscripted coding tasks under real conditions.

The prize money makes it tempting, but the real reward is bigger. It would mean that someone built a tool that can truly help software teams, bug trackers, and even open source projects around the world.

Challenges ahead road signal.

A test that grows over time

This wasn’t a one-time contest. New rounds will keep happening, each with new issues and harder challenges. It’s a rolling experiment with more chances to learn.

The goal is to watch how AI coding tools improve. Can they adapt? Can developers learn from failure and build stronger models? That ongoing progress is what gives the K Prize lasting value.

AI brain logo with multiple relevant branches logo.

A puzzle with real stakes

This isn’t just tech trivia. These tests help shape the tools that might fix your favorite app or run your smart home. If an AI can’t pass, it shouldn’t code for you.

Building trust in AI starts by testing it honestly. The K Prize makes sure that the models helping us with code can actually do what they promise, no shortcuts or smoke and mirrors.

As AI continues to reshape how we work, learn, and stay competitive in the job market, it’s clear that mastering AI today can protect your career for years to come.

What's next words written under ripped and torn paper.

What comes next is in our hands

The low scores might look disappointing, but they give us a clear view of what needs to improve. This challenge could guide the future of AI development.

Every new round brings better models and deeper understanding. And that means better tools, stronger code, and maybe someday, AI that earns its place in your favorite software. The race is just beginning.

As the role of engineers evolves alongside the rise of AI-powered tools, adaptability will be key to staying relevant. For an insider’s perspective, see how GitHub CEO reveals the key to thriving as an engineer in the AI coding era.

Think AI can crack 90 percent soon? Drop your prediction in the comments and let us know what you’re rooting for.

Read More From This Brand:

Don’t forget to follow us for more exclusive content right here on MSN.

If you like this story, you’ll LOVE our Free email newsletter. Join today and be the first to receive stories like these.

This slideshow was made with AI assistance and human editing.

This content is exclusive for our subscribers.

Get instant FREE access to ALL of our articles.

Was this helpful?
Thumbs UP Thumbs Down
Prev Next
Share this post

Lucky you! This thread is empty,
which means you've got dibs on the first comment.
Go for it!

Send feedback to ComputerUser



    We appreciate you taking the time to share your feedback about this page with us.

    Whether it's praise for something good, or ideas to improve something that isn't quite right, we're excited to hear from you.