Claude Jailbroken To Attack Mexican Government Agencies

Resources
News
Claude Jailbroken To Attack Mexican Government Agencies

Incident summary

A threat actor successfully jailbroke Anthropic’s Claude and used OpenAI’s ChatGPT to orchestrate attacks against multiple Mexican government agencies. The campaign, active from December 2025 for approximately one month, resulted in the theft of 150 GB of sensitive government data including taxpayer records and employee credentials. The attack was identified and reported by cybersecurity firm Gambit Security.

Key Facts

150 GB of government data exfiltrated, including tax records and voter information
Claude was used to identify vulnerabilities, write exploitation scripts, and automate data theft
ChatGPT was used in parallel for network traversal research and evasion techniques
The attacker persisted through repeated jailbreak attempts before Claude’s guardrails relented
Anthropic confirmed the activity, banned all associated accounts, and cited Claude Opus 4.6 as having improved mitigations
Gambit Security identified at least 20 unpatched vulnerabilities in the targeted networks
Potential attribution to a foreign state actor, though unconfirmed

How AI was used in the attack

The attacker used AI tooling in two distinct phases. Initial access appears to have already been achieved before AI orchestration began — a critical detail that significantly lowers the bar compared to using AI for initial compromise. Once inside, Claude was weaponised as an attack orchestrator:

Tool	Role in the Attack
Claude (Anthropic)	Vulnerability identification, exploitation script generation, attack path planning, automated data theft orchestration, thousands of detailed target reports
ChatGPT (OpenAI)	Network traversal research, identifying required credentials per system, detection evasion techniques

Open questions & our analysis

1. The credentials claim

The Gambit report states Claude produced reports containing “ready-to-execute plans, telling the human operator exactly which internal targets to attack next and what credentials to use.” This raises an obvious question: where did the credentials come from?

Our read is that this is almost certainly one or more of the following:

Credentials were already harvested during initial access (memory-scraped, keylogged, or pulled from credential stores) and were fed into the Claude session as context for the AI to reason over and operationalise.
Claude was generating plausible credential guesses or wordlists based on observed username formats and common patterns. On internal services with weak lockout policies, even a modest wordlist can succeed.
The wording in the report is imprecise marketing language from a security vendor, conflating “Claude suggested which credentials to try” with “Claude sourced the credentials” — these are very different things.

Bottom line: It is extremely unlikely that Claude autonomously discovered valid credentials from the open internet. LLMs don’t have access to live breach databases during inference. The more plausible scenario is that credentials were either already in the attacker’s possession or Claude was generating educated guesses that happened to work, something that succeeds more often than it should on internal infrastructure.

2. Were the attack plans actually tested?

The report doesn’t clearly distinguish between ‘plans generated’ and ‘plans executed.’ Our assessment:

Given the confirmed data exfiltration of 150 GB, at least some plans were executed successfully.
It’s also highly plausible that the attacker generated a large volume of plans and selectively executed those that aligned with their observed environment, using Claude as a force multiplier for planning velocity rather than autonomous execution.
Some credential attempts that ‘worked’ may have done so through simple volume — enough wordlist entries against an internal service with no lockout eventually produces hits. This is not sophisticated; it’s patience.

3. Post-access, not initial access

This is perhaps the most important operational detail buried in the reporting. The attacker appears to have used Claude as an orchestrator after achieving initial access through conventional means. This is a fundamentally easier problem:

Pre-access: AI must help navigate external attack surfaces, bypass perimeter controls, identify unknown vulnerabilities in targets it has no context on.
Post-access: AI has authenticated context, network topology information, and a human operator feeding it real data. The task becomes planning and scripting — things LLMs are genuinely good at.

This distinction matters when assessing AI risk in cyber. The threat model of ‘AI enables initial breach’ is less validated here than ‘AI dramatically accelerates post-compromise operations.’ The latter is still extremely concerning, but it informs where defenders should focus.