In July 2023, Zou et al. published “Universal and Transferable Adversarial Attacks on Aligned Language Models” and broke the safety alignment of basically every major language model - ChatGPT, Bard, Claude, LLaMA-2. The method was simple: append a carefully optimized string of tokens to any harmful query, and the model would comply instead of refusing.
That was almost three years ago. Since then we’ve seen dozens of defenses proposed, new attack variants, pip-installable implementations, and an entire subfield emerge around LLM red-teaming. I’m writing about this now because after working on trust infrastructure for AI agents at Vijil, I keep coming back to the same conclusion: the core vulnerability that GCG exposed is still not solved.
Let me walk through the original attack, what happened since, and why this matters more today than it did in 2023.
What GCG actually does
The core idea is surprisingly simple. You take a harmful query - something the model would normally refuse - and append a carefully optimized string of tokens to the end of it. That suffix is garbage to a human reader, but to the model it shifts the probability distribution just enough that the model starts its response with something like “Sure, here is…” instead of refusing.
And once a model starts with an affirmative response, the rest follows. It’s almost like the model commits to being helpful the moment it generates that first token, and the safety training can’t pull it back.
The method is called Greedy Coordinate Gradient (GCG). It works by:
- Computing gradients of the loss with respect to every token position in the suffix simultaneously
- Identifying the top-k most promising token substitutions based on gradient magnitude
- Running forward passes to evaluate candidates
- Picking the single replacement that minimizes the loss
The loss function targets the probability of the model generating a specific affirmative prefix. No complicated reward models, no RL. Just gradient descent on token embeddings.
The 2023 numbers that shocked everyone
On open-source models, GCG achieved:
- Vicuna-7B: 100% attack success rate on harmful behaviors
- LLaMA-2-7B-Chat: 88% success, despite Meta’s heavy investment in safety training
But the real shock was transferability. Suffixes optimized on open-source models worked on models they were never trained on:
- GPT-3.5 Turbo: up to 86.6% success rate
- GPT-4: roughly 50%
- Claude-2: low automatically, but ~100% with 30 seconds of manual refinement
You optimize where you have gradient access, and the attack transfers to black-box systems behind APIs. Different safety training, potentially different architectures, and yet the same adversarial suffix works.
What happened next: the defense wave (2023-2024)
The community responded fast. Within months, several defenses were proposed:
Perplexity filtering - GCG suffixes are essentially random tokens, so they have sky-high perplexity. A simple perplexity check on inputs catches most of them. Cheap, easy to implement, and it works against vanilla GCG.
SmoothLLM - Randomly perturbs the input (character swaps, insertions, deletions) and aggregates predictions across multiple perturbed copies. The idea is that adversarial suffixes are fragile and don’t survive perturbation, while legitimate queries do.
Erase-and-check - Systematically erases subsets of tokens and checks if the remaining prompt is still flagged as harmful. Comes in three flavors: RandEC (random subsampling), GreedyEC (greedy token erasure), and GradEC (gradient-informed erasure).
SafeDecoding - Modifies the decoding process itself, achieving ASR reductions from ~75% to 3-4% against GCG, AutoDAN, and DeepInception.
For a moment, it felt like the problem was manageable. Filter the gibberish, smooth the inputs, adjust the decoding. Done.
The attacks got better too (2024-2025)
Then came the second wave. The attackers adapted.
nanogcg (August 2024) - Gray Swan AI released a fast, pip-installable implementation of GCG. What used to require research-level engineering became pip install nanogcg. The barrier to entry dropped to basically zero.
AmpleGCG - Instead of optimizing one suffix at a time, this approach trains a generative model on successful adversarial suffixes to learn their distribution. It can then spit out hundreds of novel suffixes for any query. The result: near 100% ASR on aligned LLMs, and 99% on GPT-3.5. It turns suffix generation from an optimization problem into a sampling problem.
IRIS - Minimizes the dot product between LLM input embeddings and pre-computed activations of refusal responses. When combined with GCG, it substantially increases both transferability and universality of adversarial suffixes.
RL-based attacks - Researchers started using reinforcement learning agents to attack scaffolded and defended LLMs. Early results suggest the field will ultimately converge on RL attackers that can adapt to defenses in real-time.
And most importantly: researchers showed that many of the defenses from 2023-2024 break under adaptive attacks. If the attacker knows about your perplexity filter, they can optimize for low-perplexity suffixes. If you use SmoothLLM, they can craft perturbation-robust suffixes. The defense-aware attacker wins most of the time.
September 2025: “The Resurgence of GCG”
A paper literally titled “The Resurgence of GCG Adversarial Attacks on Large Language Models” dropped in September 2025, evaluating GCG and its variants against newer models like Qwen2.5 and LLaMA-3.2. Three key findings stood out:
- Attack success rates decrease with model size - larger models are harder to attack, but not immune
- We’ve been overestimating our defenses - prefix-based heuristics for measuring attack success substantially overestimate effectiveness compared to GPT-4o semantic judgments
- Coding prompts are more vulnerable than safety prompts - suggesting the attack surface is broader than we thought
Three years in, GCG isn’t just historically interesting. It’s still a live threat that keeps evolving.
Why this matters more in 2026
In 2023, LLMs were mostly chatbots. You’d ask a question, get an answer. The blast radius of a jailbreak was limited - you’d get the model to say something offensive, screenshot it, post it on Twitter.
In 2026, LLMs are agents. They browse the web, execute code, call APIs, manage databases, make purchases. A jailbroken agent doesn’t just say something bad - it can do something bad. The attack surface hasn’t just grown, it’s fundamentally changed.
This is exactly why I moved to working on trust infrastructure. When an agent can take real-world actions, “the model might say something harmful” becomes “the model might do something harmful.” The stakes are different.
What I actually recommend
Having shipped LLM systems in enterprise and now working on the defense side, here’s my practical take:
Layer your defenses. No single defense works against adaptive attackers. Combine perplexity filtering, input classification, output monitoring, and behavioral guardrails. Make the attacker solve multiple problems simultaneously.
Don’t trust alignment alone. RLHF and safety fine-tuning raise the bar, but GCG proved these are surface-level properties. The harmful capabilities are still in the weights. Treat alignment as one layer, not the whole stack.
Monitor in production. Static red-teaming before deployment is necessary but not sufficient. Attacks evolve. Your monitoring needs to catch novel adversarial inputs that didn’t exist when you tested.
Design for failure. Assume the model will eventually produce something it shouldn’t. Build your system so that a compromised model output doesn’t automatically become a compromised action. Humans in the loop, permission boundaries, action verification.
The GCG paper was a wake-up call in 2023. Three years later, the alarm is still ringing. We’ve gotten better at hitting snooze, but we haven’t actually gotten out of bed yet.
The original paper is on arXiv, with code on GitHub. For the current state of affairs, check out The Resurgence of GCG and AmpleGCG.