ICMI 2025 Keynote – Future of Human Oversight

It was a great pleasure to deliver ๐—ฎ ๐—ธ๐—ฒ๐˜†๐—ป๐—ผ๐˜๐—ฒ ๐˜†๐—ฒ๐˜€๐˜๐—ฒ๐—ฟ๐—ฑ๐—ฎ๐˜† ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐Ÿฎ๐Ÿณ๐˜๐—ต ๐—”๐—–๐—  ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น ๐—–๐—ผ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ผ๐—ป ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป, ๐˜„๐—ต๐—ฒ๐—ฟ๐—ฒ ๐—œ ๐—ฒ๐˜…๐—ฝ๐—น๐—ผ๐—ฟ๐—ฒ๐—ฑ ๐˜๐—ต๐—ฒ ๐—ฒ๐˜ƒ๐—ผ๐—น๐˜ƒ๐—ถ๐—ป๐—ด ๐—ป๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐—ณ๐˜‚๐˜๐˜‚๐—ฟ๐—ฒ ๐—ผ๐—ณ ๐—ต๐˜‚๐—บ๐—ฎ๐—ป ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ถ๐—ด๐—ต๐˜.

Recent reports such as OpenAIโ€™s GDPVal and METR show that AI systems can now autonomously perform complex doing/solving tasks and often surpass human experts on certain tasks. The productivity can be in the order of hundreds of times faster/cheaper. Yet this promise collapses if the way experts oversee AI outputs is not carefully designed. GDPValโ€™s results show that oversight effort alone can consume one-third of the time a human would take to do the task from scratch, reducing the net productivity gain to only around 20% (fixing included)

The next frontier for productivity gain, is not just more capable AIโ€”but more efficient human oversight.

In my talk, I argued for:

– ๐—˜๐˜…๐—ฝ๐—น๐—ผ๐—ถ๐˜ ๐˜๐—ต๐—ฒ โ€œ๐˜€๐—ผ๐—น๐˜ƒ๐—ฒ-๐˜ƒ๐—ฒ๐—ฟ๐—ถ๐—ณ๐˜† ๐—ฎ๐˜€๐˜†๐—บ๐—บ๐—ฒ๐˜๐—ฟ๐˜†.โ€ Many problems inherently exhibit this asymmetry (for example, factoring large numbers versus verifying the result through simple multiplication); in other cases, the asymmetry can be intentionally designed. The secret sauce behind much of AIโ€™s productivity lies in ensuring that verification is far cheaper than solving, without sacrificing rigour.

– ๐—ฆ๐—ต๐—ถ๐—ณ๐˜ ๐—ณ๐—ผ๐—ฐ๐˜‚๐˜€ ๐—ฏ๐—ฒ๐˜†๐—ผ๐—ป๐—ฑ ๐—”๐—œ ๐—ผ๐˜‚๐˜๐—ฝ๐˜‚๐˜๐˜€. Oversight should not centre on the task output itself but on the additional artefacts generated around it. Effective oversight depends on the tools, rationale/evidence artefact bundles, and multimodal signals that make verification easier and more meaningfulโ€”not on simply showing humans the raw output and asking, โ€œDoes this look right?โ€

– ๐—จ๐˜€๐—ฒ ๐—ฟ๐—ถ๐—ด๐—ผ๐—ฟ๐—ผ๐˜‚๐˜€ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐˜๐—ผ ๐˜๐—ฟ๐—ถ๐—ฎ๐—ด๐—ฒ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ถ๐—ด๐—ต๐˜ ๐—ป๐—ฒ๐—ฒ๐—ฑ๐˜€. Identify subtasks or task types where AI performance is reliably above human level and low risk if wrong; these may not require active human oversight and can instead rely on LLM-as-judge or post-hoc sampling and monitoring.

– ๐—ฆ๐—ฒ๐—ฒ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ถ๐—ด๐—ต๐˜ ๐—ฎ๐˜€ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด, ๐—ป๐—ผ๐˜ ๐—ด๐˜‚๐—ฎ๐—ฟ๐—ฑ๐—ถ๐—ป๐—ด. Oversight should move beyond catching AI mistakes toward understanding, steering, and upskilling human capabilities. Properly designed oversight can enhance human expertise more effectively than repetitive task execution, turning oversight into a mechanism for upskilling and knowledge retention.

Our research at CSIRO’s Data61 is working deeply on these issuesโ€”quantifying oversight efficiency, designing oversight-enabling tools, and developing insights to identify and engineer โ€œsolve-verify asymmetry.โ€

We are seeking partners across industry and government who want to rethink human oversightโ€”not merely for compliance or safety, but as a ๐˜€๐˜๐—ฟ๐—ฎ๐˜๐—ฒ๐—ด๐—ถ๐—ฐ ๐—ฐ๐—ฎ๐—ฝ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† that unlocks both dramatic productivity gains and accelerated workforce learning

Slides: https://www.linkedin.com/feed/update/urn:li:activity:7384708499882160128/


About Me


About me – According to AI

Director/Head of CSIRO’s Data61
Conjoint Professor, CSE UNSW

For other roles, see LinkedIn & Professional activities.

If you’d like to invite me to give a talk, please see here & email liming.zhu@data61.csiro.au

Featured Posts