Was this helpful?
Thumbs UP Thumbs Down

xAI used gig workers to boost Grok past Claude in coding tests

Grok app displayed on phone
san francisco us nov 4 2023 hand holding a smartphone

Grok’s surprising edge in coding benchmarks

Grok recently showed strong performance in internal coding benchmarks, reportedly surpassing Claude in some evaluations.

The model showed particular strength in solving competitive coding problems and debugging code snippets. While Claude remains a top-tier model in many areas, Grok’s improved coding performance marks a notable capability shift.

xAI’s push to close the gap with rivals appears to be gaining traction in the developer space.

assam india  november 29 2020  up work logo

Gig workers played a quiet but vital role

xAI quietly recruited gig workers to help refine Grok’s performance. These workers were tasked with evaluating code completions, ranking responses, and flagging errors during fine-tuning. Many were hired through platforms like Upwork and other freelance marketplaces.

Their work helped identify weaknesses in Grok’s outputs and provided critical feedback that improved model behavior. This practice, while not uncommon in AI training, was strategically used to rapidly enhance Grok’s coding ability before key evaluations. The role of these workers was largely kept out of public view.

Human interact with AI artificial intelligence brain processor in concept

How reinforcement tuning shaped Grok’s coding edge

xAI implemented reinforcement learning with human feedback (RLHF) to optimize Grok for coding tasks. Gig workers helped supply this feedback by comparing outputs and selecting better answers, which were used to reward the model’s training behavior.

Over time, Grok learned to prioritize correct logic, efficient syntax, and clean formatting. This kind of tuning is resource-intensive but can significantly improve a model’s performance on structured tasks like programming. Grok’s recent gains suggest the method paid off under xAI’s direction.

A program code on computer monitor

Grok’s test environment mimicked real-world coding

The evaluations of Grok beating Claude in coding were designed to reflect real-world developer scenarios. These included challenges like writing Python functions, debugging existing scripts, and optimizing slow code.

The tests emphasized both correctness and efficiency. xAI ensured Grok was exposed to similar exercises during training, helping it adapt to the testing format. This close alignment between training data and test conditions gave Grok a meaningful advantage in benchmark competitions designed to simulate developer needs.

Claude logo displayed on phone

Claude’s strength lies in nuanced reasoning

Although Grok performed better in coding tasks, Claude remains stronger in complex reasoning and philosophical analysis. Claude’s training emphasizes deep understanding, logical coherence, and long-form thought.

In contrast, Grok has been pushed toward snappier, task-based performance. This explains why Claude still leads in academic-style questions, but Grok has started catching up in more practical fields like coding. These different strengths are shaped by how each model was trained and which kinds of data they prioritize.

Grok app displayed on phone

xAI’s sprint strategy after Grok 1’s debut

After releasing Grok 1, xAI quickly moved to iterate and improve its model. This internal sprint included recruiting extra help through gig platforms and deploying targeted tests to benchmark coding results.

The goal was to outpace competitors like Anthropic by focusing on specific domains where Grok could shine. Coding became a top priority because it’s a concrete, measurable area of AI performance. This tactical shift allowed xAI to produce Grok versions that could perform competitively in a short period.

Discussion on AI ethics

Gig labor raises fresh ethical questions

xAI’s use of gig workers has reignited ethical concerns about how AI companies scale quickly. Many of these contributors weren’t fully briefed on how their work would be used or how central it was to Grok’s success.

While this practice isn’t new in tech, critics argue it shows how dependent modern AI systems are on underpaid, often invisible labor. The push to win benchmarks may overshadow transparency and fairness for those doing the behind-the-scenes work that powers these tools.

Anthropic an artificial intelligence startup company logo.

Anthropic responded by doubling down on Claude 3

Following Grok’s rise in coding benchmarks, Anthropic reportedly accelerated development on Claude 3. Updates included improvements to code generation and a refined training pipeline. While Claude remains a popular choice among enterprise users, xAI’s advances pressure Anthropic to innovate faster.

Claude 3 now shows noticeable improvements in speed and reliability, especially in structured domains. Both companies are in an arms race to dominate key use cases, and coding ability has emerged as a significant competitive factor.

april 11 2019 brazil java logo on your mobile device

Internal testing leaked to developers online

Some coding test comparisons between Grok and Claude were shared in developer forums before official statements. These posts highlighted Grok’s success in generating cleaner code with fewer bugs.

Users reported side-by-side comparisons of JavaScript and Python solutions that favored Grok’s outputs. While not all of these leaks were verified, they sparked conversations across Reddit and GitHub about whether Grok was finally becoming a serious rival to Anthropic’s best models in coding.

Python language logo displayed on phone

xAI targeted performance in Python and JavaScript

Grok’s improvements were especially noticeable in Python and JavaScript, the two most widely used languages for AI development and web applications. xAI trained Grok heavily on codebases from GitHub and supplemented them with curated datasets designed for accuracy and coverage.

Gig workers helped correct syntax errors and re-rank completions in both languages. By targeting Python and JavaScript specifically, xAI positioned Grok to perform well in real-world projects and technical interviews, which often center on these programming languages.

Grok app with Elon Musk X account in background

Elon Musk’s hands-on role in Grok’s progress

Elon Musk reportedly pushed for Grok’s improvements, especially after its underwhelming early reception.

He reportedly approved budgets for expanded gig labor and faster training cycles. His competitive mindset influenced Grok’s roadmap, turning it into a model optimized for practical, testable tasks. This management approach mirrors how Musk drives innovation across his other companies.

Female programmer coding on desktop computer with multiple screens.

Grok’s code explanations improved significantly

One area where Grok showed marked improvement was in explaining code logic. Earlier versions often gave vague or overly complex answers when asked to describe what a block of code did. With more fine-tuning and human ranking, Grok generates clearer, step-by-step explanations.

This makes it especially useful for beginners trying to learn programming concepts. By offering code and clean explanations, Grok positions itself as a valuable teaching tool and a coding assistant.

gig economy concept with hand

The training process focused on common coding mistakes

To improve Grok’s reliability, xAI focused training on common developer mistakes like off-by-one errors, infinite loops, and incorrect variable naming. Gig workers were given tasks to spot and label these issues in Grok’s outputs.

By learning from thousands of these corrections, Grok started avoiding the bugs that trip up newer coders and others. This hands-on approach gave Grok a leg up in producing usable, error-free code that passed more automated unit tests.

Grok AI app on a mobile screen and on a desktop on a blurry background

Industry insiders see Grok’s gains as credible

Experts in AI development acknowledged Grok’s recent performance gains as legitimate. Several AI researchers noted that Grok had started returning cleaner, more logically consistent code than previous versions.

While it’s unclear how Grok will perform at enterprise scale, its improvements in structured tasks like coding are being taken seriously. Developers who tested it firsthand confirmed the quality bump, especially when working with short scripts and functions. This marks a turning point for xAI’s presence in the space.

While industry insiders view Grok’s momentum as legitimate, xAI points to an unauthorized system change behind its recent stumble.

Developer coding on computer

The coding battle is reshaping AI priorities

The race between Grok and Claude highlights a broader shift in AI development. Instead of general-purpose intelligence, companies are now chasing domain-specific mastery, especially in programming.

Tools that can generate high-quality code have enormous value for businesses and developers. Grok’s recent success shows that focused investments in areas like software development can yield quick, measurable wins. As benchmarks evolve, we may see even more models tuned aggressively for specific tasks rather than broad conversational ability.

As the coding battle reshapes AI priorities, Claude AI by Anthropic is stepping up, now letting you build full apps directly inside the platform.

Is in-platform app creation the future of AI development? Let us know what you think in the comments.

Read More From This Brand:

Don’t forget to follow us for more exclusive content on MSN.

If you liked this story, you’ll LOVE our FREE emails. Join today and be the first to get stories like this one.

This slideshow was made with AI assistance and human editing.

This content is exclusive for our subscribers.

Get instant FREE access to ALL of our articles.

Was this helpful?
Thumbs UP Thumbs Down
Prev Next
Share this post

Lucky you! This thread is empty,
which means you've got dibs on the first comment.
Go for it!

Send feedback to ComputerUser



    We appreciate you taking the time to share your feedback about this page with us.

    Whether it's praise for something good, or ideas to improve something that isn't quite right, we're excited to hear from you.