A look at (the demo of) Devin, the AI-powered software engineer
15 April edit: a few hours later after publishing this article in March, I communicated with Aaron Mauer (sympy’s developer) via Twitter about a potential data bias which made me suspicious and I added the prefix (the demo of) to the title and a SWE-Bench recency disclaimer. Today, Devin has been “canceled” by a significant portion of the developer community for overpromising with a demo and underdelivering. I haven’t got access to the product so I don’t have an opinion, but I still believe it might be a valuable product, even though it is “just” a ChatGPT wrapper.
On March 12 2024, Cognition Labs, a novel name for the public (but with the #1 gold IOI ‘14 medalist behind it) showed a demo of Devin - the “AI-powered software engineer”:
The only way to access Devin is to sign up on a form/waitlist, so we cannot be sure how well this video demonstrates the capabilities — but let’s analyze their work, the investors, the team and the claims on Twitter.
They claim they found a job on Upwork.com (related to computer vision and Python) and made Devin complete it:
… Devin fails due to a PyTorch version issue, fixes it, asks a few questions about the output and fully completes the task! This is impressive, but it doesn’t end there. One of the extra videos shows a complex case where the user is supplying Devin a link to a GitHub repo that’s an old fork of the math library SymPy with a really difficult bug in logarithm calculations:
Devin clones the repo, analyzes the whole SymPy codebase (!) and finds a division error related to the code for calculating logarithms, fixes it and even runs tests related to logarithms to make sure that the fix doesn’t affect other parts of the project:
And sure enough, if you look at SymPy this is the real world issue “#17148: Incorrect extraction of base powers in log class”.
Considering it’s an issue from 2019, is there a chance it’s in the training dataset? I don’t know, but honestly it’s not even important. What’s truly important is how Devin handles big contexts and how it works on such a huge repository. SymPy is a very complex project! My guess is that their magic sauce lies in iterative search and analysis of repositories when feeding the most important details to a transformer/LLM architecture.
The supporters and investors include the Thiel’s Founders Fund, Doordash co-founder Tony Xu, Stripe CEO and cofounders Patrick Collison and John Collison, the founder of Modal Labs (the same Modal which I mentioned in a previous post), and many others.
As for the founders - Scott Wu is a Harvard graduate and Steven Hao is an MIT graduate, and they already have professional experience with software engineering and machine learning. They also have competitive programming experience - 10 International IOI (Olympiad in Informatics) gold medals, which I admire (timed tasks on difficult problems are not for everybody 😄).
People on social media quickly noticed that Wu was a legendary master on Codeforces and, as a small kid, used to be a really, really fast MATHCOUNTS champion — to the point where he acts like a math-expression autocomplete.
70% no, 30% yes. I’m sure that setting up or re-building many projects requires painful commands with package managers (and all the dependency hell experiences that engineers are used to), external and shared libraries, and obtaining tokens on third-party services that are outside of the scope of what state-of-the-art machine learning assistants can do.
That being said, if a human does the painful things - ML assistants will dominate at many tasks. Boilerplate code is definitely going to be a solved problem. A huge amount of debugging and compiler error tasks can be automated and some are even suited better for computers than humans - especially those revolving generics, floating point math, nested loops with many index variables and macros.
A key selling point they claim is 13.86% unassisted score on SWE-Bench. So what’s SWE-Bench?
SWE-Bench is a paper and benchmark from Princeton University and the University of Chicago that contains instances of unsolved (GitHub) issues. The input is the whole codebase and a human-readable issue, and the expected output is a patch to the codebase. Fundamentally, the task is to solve the issues without breaking the tests. The benchmark is really hard and approaches using Claude and GPT-4 can solve 0-2% of the task:
… This is why SWE-Bench also has an assisted version where a human can intervene - which can get Claude and GPT-4 to 4%. But Devin claims 13.86% unassisted score. I find this impressive, but I also think it’s a benchmark that’s too recent, since it comes from ICLR 2024 and many projects will be in the 4-6% range once it becomes more mainstream and well-recognized.
I hope we get access to Devin soon so that we can evaluate it without guessing, but I’m optimistic about its performance. I think it’s way, way too early to claim that software engineering has been automated or obsoleted, and I think Devin and similar tools are just going to be tools that are going to accelerate talented engineers.
On top of that, if you are scared, consider the fact that software engineering is not just about making pull requests related to a small subset of types of bugs, it’s also about planning, negotiating, meeting clients’ needs and talking to non-engineers.
In any case, I wish Cognition Labs a successful launch!
ncG1vNJzZmiemaC2tLXPomWsrZKowaKvymeaqKVfpXy0tM6rq2aon6jBbq2Mpaaoo12WwW6wxK%2Bgp2WknbJurcg%3D