As part of the international expert advisory group representing Australia, I’m pleased to see the second Key Update of the AI Safety Science Report, led by Yoshua Bengio, released (link in the comment). The official update captures the broad progress: more robust adversarial training, better AI-generated content tracking, wider adoption of frontier safety frameworks, and early use of technical safeguards to support governance measures. It also makes clear that attackers can still bypass leading systems with modest effort and that small data injections can introduce meaningful vulnerabilities.
Alongside these findings, CSIRO’s Data61 is actively monitoring several deeper technical signals that do not yet feature prominently in international reporting but matter for long-term AI safety.
𝗙𝗶𝗿𝘀𝘁, 𝘁𝗵𝗲 𝗰𝗼𝘀𝘁 𝗮𝘀𝘆𝗺𝗺𝗲𝘁𝗿𝘆 𝗮𝘁 𝗺𝗼𝗱𝗲𝗹 𝗹𝗲𝘃𝗲𝗹 𝗶𝘀 𝗲𝘅𝗽𝗮𝗻𝗱𝗶𝗻𝗴. It now takes far fewer samples to subvert or reverse safety mechanisms than to build and maintain them. This asymmetry challenges the durability of any defence that relies solely on model-level tuning. 𝗦𝘆𝘀𝘁𝗲𝗺-𝗹𝗲𝘃𝗲𝗹 𝗱𝗲𝗳𝗲𝗻𝗰𝗲-𝗶𝗻-𝗱𝗲𝗽𝘁𝗵 𝗰𝗮𝗻 𝘀𝘁𝗶𝗹𝗹 𝗯𝗲 𝗿𝗼𝗯𝘂𝘀𝘁.
𝗦𝗲𝗰𝗼𝗻𝗱, 𝘀𝗼𝗺𝗲 𝗺𝗼𝗱𝗲𝗹-𝗹𝗲𝘃𝗲𝗹 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗶𝗻𝘁𝗲𝗿𝘃𝗲𝗻𝘁𝗶𝗼𝗻𝘀 𝗰𝗿𝗲𝗮𝘁𝗲 𝗻𝗲𝘄 𝗯𝗲𝗵𝗮𝘃𝗶𝗼𝘂𝗿𝘀. Stress-testing and alignment optimisation have triggered sabotage-like behaviours and specification gaming in reasoning models. Attempts to monitor model reasoning can lead models to conceal reasoning traces or encode responses using subliminal signalling channels. These second-order effects are early signals that model-level interventions may reduce, rather than increase, controllability. 𝗦𝘆𝘀𝘁𝗲𝗺-𝗹𝗲𝘃𝗲𝗹 𝗱𝗲𝗳𝗲𝗻𝗰𝗲𝘀 𝗱𝗼 𝗻𝗼𝘁 𝗽𝗹𝗮𝗰𝗲 𝘁𝗵𝗲 𝘀𝗮𝗺𝗲 𝗽𝗿𝗲𝘀𝘀𝘂𝗿𝗲 𝗼𝗻 𝗺𝗼𝗱𝗲𝗹𝘀.
𝗧𝗵𝗶𝗿𝗱, 𝘀𝗰𝗮𝗹𝗲 𝗲𝗳𝗳𝗲𝗰𝘁𝘀 𝗶𝗻 𝗺𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 𝗮𝗿𝗲 𝗻𝗼𝘄 𝘂𝗻𝗮𝘃𝗼𝗶𝗱𝗮𝗯𝗹𝗲. Many post-deployment systems still depend on human reporting or manual review, but volume overwhelms these channels even when AI reduces per-case errors. 𝗘𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗼𝘃𝗲𝗿𝘀𝗶𝗴𝗵𝘁 𝗺𝘂𝘀𝘁 𝗮𝗰𝗰𝗼𝘂𝗻𝘁 𝗳𝗼𝗿 𝗿𝗲𝗮𝗹 𝗵𝘂𝗺𝗮𝗻 𝗰𝗼𝗴𝗻𝗶𝘁𝗶𝘃𝗲 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀 𝗮𝗻𝗱 𝘀𝗵𝗶𝗳𝘁 𝗳𝗿𝗼𝗺 𝘀𝗶𝗺𝗽𝗹𝗲 𝗲𝗿𝗿𝗼𝗿 𝗱𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 𝘁𝗼𝘄𝗮𝗿𝗱 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗵𝘂𝗺𝗮𝗻–𝗔𝗜 𝗰𝗼-𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴.
𝗙𝗼𝘂𝗿𝘁𝗵, 𝘀𝘆𝘀𝘁𝗲𝗺-𝗹𝗲𝘃𝗲𝗹 𝗽𝗿𝗼𝗽𝗮𝗴𝗮𝘁𝗶𝗼𝗻 𝗽𝗮𝘁𝗵𝘄𝗮𝘆𝘀 𝗮𝗿𝗲 𝗯𝗲𝗰𝗼𝗺𝗶𝗻𝗴 𝗰𝗹𝗲𝗮𝗿𝗲𝗿. Tool-using agents can transmit misalignment through the tool ecosystem rather than the model weights. Retrieval-augmented pipelines can drift through personalisation dynamics. Multi-agent settings sometimes show unintended coordination driven by shared pretrained priors, not incentives. 𝗦𝘆𝘀𝘁𝗲𝗺-𝗹𝗲𝘃𝗲𝗹 𝗱𝗲𝗳𝗲𝗻𝗰𝗲𝘀 𝘁𝗵𝗮𝘁 𝗰𝗼𝗻𝘀𝗶𝗱𝗲𝗿 𝗺𝘂𝗹𝘁𝗶-𝘀𝘆𝘀𝘁𝗲𝗺 𝗲𝗳𝗳𝗲𝗰𝘁𝘀 𝗻𝗲𝗲𝗱 𝘁𝗼 𝗯𝗲 𝗱𝗲𝘃𝗲𝗹𝗼𝗽𝗲𝗱.
We will continue sharing evidence and solutions.
https://www.linkedin.com/feed/update/urn:li:activity:7399552460664709120

