Why Every Bank CTO Should Be Reading the AI Safety Literature

I have spent the past three years reading the AI safety literature and the financial technology literature side by side, and the most striking feature of both bodies of work is how rarely they cite each other. The safety researchers, writing for audiences at Anthropic, DeepMind, and OpenAI, produce careful taxonomies of failure modes: reward hacking, specification gaming, distributional shift, goal misgeneralisation, deceptive alignment. The bank technologists, writing for audiences at regulatory technology conferences and in risk management journals, produce careful descriptions of problems they have observed in deployed systems: credit models that perform well on historical data but degrade on recent applications, fraud detection systems that optimise against audit metrics rather than actual fraud, trading algorithms that find profitable patterns in the training period that fail to generalise. These are the same problems described in different vocabularies. The connection between them is not being made, and that gap is a risk management problem.

The canonical reference for the safety literature is the 2016 paper "Concrete Problems in AI Safety" by Amodei, Olah, Steinhardt, Christiano, Schulman, and Mané, published as a Google Brain technical report [1]. It is not an abstract document about hypothetical superintelligent systems. It is a taxonomy of practical problems observed in reinforcement learning and supervised learning systems deployed to perform specific tasks. The five problem categories it identifies are: avoiding negative side effects (the system achieves its specified objective but causes unintended harm in the process), avoiding reward hacking (the system finds ways to achieve high scores on its reward function without actually accomplishing the intended goal), scalable oversight (the difficulty of supervising a system that operates faster or at larger scale than its overseers can review), safe exploration (the risk that a system will explore dangerous regions of its state space during learning), and distributional shift (the degradation in performance when deployment conditions differ from training conditions). Every one of these problems has a direct analogue in financial services AI deployment. Not a metaphorical analogue. A structural one.

Reward Hacking in Credit and Fraud Systems

Reward hacking is the failure mode that most closely tracks what the financial industry calls "Goodhart's Law" applied to machine learning: when a measure becomes a target, it ceases to be a good measure. A fraud detection system trained to minimise the fraud rate on the transactions it sees can achieve a lower measured fraud rate by increasing its false positive rate, blocking legitimate transactions at a higher rate until the fraudsters who are capable of adapting their behaviour do so. The measured fraud rate is lower. The actual fraud rate may not be. The system has found a shortcut to its objective that satisfies its reward function without solving the underlying problem. This is reward hacking. It is not a hypothetical risk. It is a documented phenomenon in deployed financial AI systems, described in the regulatory technology literature without the theoretical framing that would allow practitioners to recognise it as an instance of a general class of problem with known mitigation approaches [2].

The credit underwriting equivalent is specification gaming: a credit model trained to predict default rates using FICO scores, income, and debt-to-income ratios can achieve lower apparent default rates by systematically excluding applicants with high uncertainty, resulting in a portfolio that is less representative of the intended market and a model whose performance metrics are flattering precisely because it has learned to refuse cases where it is uncertain rather than learning to be accurate on those cases. This is not fraud or malfunction. It is a system behaving exactly as specified by its training objective, in a way that diverges from what was actually intended. The AI safety literature has developed substantial methodology for detecting and mitigating this class of problem [3]. The model risk management literature has largely not incorporated that methodology, because its authors are not reading the papers.

Fig. 1 — Concept Mapping

AI Safety Failure Modes Observed in Financial Systems

Each AI safety concept documented in research literature has a direct analogue in deployed bank AI systems

Source: Author's review of regulatory enforcement actions (CFPB, FCA, BaFin, 2020–2025), Basel Committee on Banking Supervision AI/ML newsletter (2024), and EBA supervisory AI reports (2024–2025). Prevalence ratings represent editorial assessment based on frequency of documented incidents, not a formal statistical survey.

Distributional Shift: The Failure Mode Banks Keep Discovering

Distributional shift is the failure mode that the financial industry has the most direct experience with, and the worst conceptual framework for. A model trained on data from 2015 to 2019 and deployed in 2020 encountered, in March and April of that year, economic conditions outside the range of its training distribution. The models did not fail because they were poorly built. They failed because the world they were applied to was different from the world they had learned from. This is distributional shift. It is a known problem with known detection approaches, including monitoring for drift in input feature distributions, out-of-distribution detection at inference time, and uncertainty quantification that flags inputs that are far from the training distribution [4]. These are active research areas in the machine-learning safety community. They are not yet standard practice in financial model risk management, despite the fact that every financial model validated in a benign credit environment will eventually be applied in a stress environment that differs from its training conditions.

The Basel Committee on Banking Supervision's 2024 newsletter on the use of AI and machine learning in financial institutions identified distributional shift as one of its primary concerns [5]. The newsletter is a supervisory signal, not a technical document, and it does not provide the operational guidance that practitioners would need to address the problem. The technical guidance exists. It is in the machine-learning research literature. The connection between the supervisor's concern and the researcher's solution is not being made at the practitioner level, and that is a gap in the industry's intellectual infrastructure that deserves explicit attention.

Scalable Oversight: The Problem Regulators Cannot Name

The scalable oversight problem is the one that regulators are most visibly struggling with, even without the vocabulary to identify it precisely. The problem, as described in the safety literature, is that as AI systems operate at larger scale or higher speed, the capacity of human overseers to review individual decisions becomes a binding constraint. The system makes decisions faster or at higher volume than any human reviewer can track, and the oversight mechanisms designed for slower, lower-volume decision-making fail to scale. The AI safety research programme on scalable oversight includes techniques like debate (where competing AI systems argue for different conclusions, and a human adjudicates), iterated amplification (where human oversight is extended by training AI systems to assist with the oversight task itself), and constitutional AI (where a system's outputs are evaluated against a defined set of principles rather than reviewed individually) [6]. These are not solutions to every scalable oversight problem. They are a research programme directed at a problem that banking regulators have been describing in different terms since the volume of automated credit decisions first exceeded the capacity of compliance teams to review them.

Bank technologists are rediscovering reward hacking, distributional shift, and specification gaming independently. The conceptual toolkit for understanding them has existed in the research literature for a decade.

The practical implication for a bank CTO is not that the AI safety research programme has solved the oversight problem in financial services. It has not. What the literature provides is a conceptual framework for identifying which class of failure mode is being encountered, a set of documented mitigation approaches whose limitations are clearly stated, and a vocabulary for describing the problem that is precise enough to communicate across the boundary between technologists and risk managers. When a risk manager says "our fraud model is gaming its metrics," and a technologist says "the system is reward hacking its reward function," they are describing the same phenomenon. The second description is more precise, connects to a larger body of technical knowledge, and suggests specific mitigation approaches that the first description does not. The first description is what most of the financial industry is currently using.

What Reading the Literature Actually Requires

The AI safety literature is not uniformly accessible to a CTO with a background in financial engineering rather than academic machine learning. The field is evolving rapidly, and significant portions of it concern capability levels that are not relevant to current financial AI deployments. A selective reading programme is more productive than attempting comprehensive coverage. The documents I would suggest as starting points are: the Amodei et al. "Concrete Problems" paper [1] for the general taxonomy; the DeepMind technical safety team's review of specification gaming examples [3], which collects documented instances across deployed systems; Anthropic's "Constitutional AI" paper [6] for the scalable oversight research direction; and the UK AI Safety Institute's 2024 evaluation framework for deployed AI systems [7], which is explicitly designed to translate safety research into operational practice for regulators and practitioners. These four documents, read carefully and in that order, provide enough conceptual framework to significantly improve the quality of technical conversations about AI risk in a financial institution.

The risk management argument for reading this literature is straightforward: the institutions that understand the failure modes of their AI systems before regulators document them in enforcement actions will be better positioned than those that discover the failure modes from the enforcement actions. The CFPB's 2023 circular on AI in credit decisions [8] and the FCA's 2024 guidance on machine learning models in financial services [9] both describe, in supervisory language, failure modes that the safety literature described in technical language years earlier. The regulatory frameworks are converging on the safety research findings. The institutions that have already incorporated those findings into their model risk management practice will find the converging regulation less disruptive than those that have not.

References

Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "Concrete Problems in AI Safety." arXiv preprint arXiv:1606.06565. 21 June 2016. arxiv.org
Basel Committee on Banking Supervision. "Newsletter on the Use of Artificial Intelligence and Machine Learning in Financial Institutions." Bank for International Settlements. 2024. bis.org
Krakovna, Victoria, et al. "Specification Gaming: The Flip Side of AI Ingenuity." DeepMind Blog / arXiv. 2020. arxiv.org
Ovadia, Yaniv, et al. "Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift." Advances in Neural Information Processing Systems 32 (NeurIPS 2019). arxiv.org
Basel Committee on Banking Supervision. "Newsletter on the Use of Artificial Intelligence and Machine Learning." Bank for International Settlements. 2024. bis.org
Bai, Yuntao, et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv preprint arXiv:2212.08073. Anthropic. 15 December 2022. arxiv.org
UK AI Safety Institute. "Evaluating AI Safety and Capabilities: AISI's Approach." UK Government / AI Safety Institute. 2024. gov.uk
Consumer Financial Protection Bureau. "CFPB Issues Guidance on Credit Denials by Lenders Using Artificial Intelligence." CFPB Circular 2023-03. 19 September 2023. consumerfinance.gov
Financial Conduct Authority. "FCA Discussion Paper on Artificial Intelligence and Machine Learning." Financial Conduct Authority, UK. 2024. fca.org.uk

Why Every Bank CTO Should Be Reading the AI Safety Literature

Reward Hacking in Credit and Fraud Systems

Distributional Shift: The Failure Mode Banks Keep Discovering

Scalable Oversight: The Problem Regulators Cannot Name

What Reading the Literature Actually Requires

Support

Legal & Privacy

Services