Happy new year! The shutdown period is often when the noise drops and the big questions surface. This year, that quiet coincided with genuinely striking progress in AI coding agents, especially in the six to eight weeks following the release of Claude Opus 4.5, and its integration into Claude Code.
We have seen world-class software engineers report a clear step change in how these agents feel. Some describe a shift from being a 10x engineer to something closer to 100x. Others report a human-effort year’s worth of usable production code generated in under an hour. My own experiments over the break gave me the first real tingling sense that, if AGI arrives gradually, a coding agent may be how we first notice it.
Against that backdrop, Qinghua Lu and I here at CSIRO’s Data61 have been thinking hard about the bigger picture. We are no longer primarily doers. 𝗠𝗼𝗿𝗲 𝗽𝗿𝗼𝘃𝗼𝗰𝗮𝘁𝗶𝘃𝗲𝗹𝘆, 𝘄𝗲 𝗺𝗮𝘆 𝗻𝗼 𝗹𝗼𝗻𝗴𝗲𝗿 𝗯𝗲 𝘁𝗵𝗲 𝗽𝗿𝗶𝗺𝗮𝗿𝘆 𝘃𝗲𝗿𝗶𝗳𝗶𝗲𝗿𝘀 𝗼𝗿 𝗼𝘃𝗲𝗿𝘀𝗲𝗲𝗿𝘀 𝗲𝗶𝘁𝗵𝗲𝗿.
The real leverage of coding agents comes from 𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝘃𝗲𝗿𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻. When behaviour can be cheaply accepted or rejected by automated checks, “hallucinations” become exploratory branches rather than failures. The human role, for now, shifts to working with AI to decompose problems and plan work so that automated, scalable verification is possible.
We capture this argument in our new paper, 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆-𝗙𝗶𝗿𝘀𝘁 𝗔𝗜 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗶𝗻 𝘁𝗵𝗲 𝗘𝗿𝗮 𝗼𝗳 𝗔𝗜𝘄𝗮𝗿𝗲. The core claim is this: the breakthrough is not pushing AI automation into existing workflows while humans catch AI mistakes, but redesigning problems around 𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲/𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝘃𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆. Identify tasks with intrinsic scalable verification, or decompose them in radically different ways. Otherwise, you are simply moderately accelerating workflows that were never designed to scale verification, where verification becomes the bottleneck and hallucination feels costly rather than a creative asset.
It feels like an exciting, slightly unsettling year ahead. How are you rethinking verification and oversight as something that must be automated and scalable by design?

