TerraShark: How I Fixed LLM Hallucinations in Terraform Without Burning All My Tokens
Agentic coding has been a game changer for me. I use Claude Code daily, I’ve written about it, I build tools around it. For most things, it’s incredible. But Terraform? Terraform has been a nightmare.
Not because the models are bad. They’re great. But hand them a Terraform task and things go sideways fast. And I mean fast.
The Hallucination Problem
Here’s what kept happening. I’d ask Claude to refactor a module - say, migrate from count to for_each. Simple task. The model would produce syntactically valid HCL, everything looks fine, you run terraform plan and suddenly it wants to destroy and recreate 14 resources in production. Why? Because it forgot the moved blocks. Every time.
Or I’d ask it to set up a secrets workflow. It would use the sensitive flag on a variable and call it a day. But sensitive doesn’t keep secrets out of state. The value is still there in plaintext in your state file. The model just doesn’t know that.
These aren’t edge cases. These are the bread-and-butter mistakes:
- Defaulting to
countinstead offor_eachfor collections - Omitting
movedblocks during refactors, causing destroy/create cycles - Using
sensitiveand assuming state is safe - Recommending CLI-only
terraform importinstead of declarative import blocks - Producing CI pipelines that re-run plan during apply instead of using saved artifacts
The code looks correct. It passes syntax checks. It even passes basic review if you’re not paying close attention. But it’s operationally dangerous. And that’s the worst kind of bug - the one that looks fine until it blows up in production.
Trying Existing Solutions
So I looked for Claude Code skills that could help. Claude Code has a skill system. I tried the two I’ve found, and to be fair, it works. About 90% of the time, the output quality was noticeably better. It covers module patterns, testing, CI templates. Solid work.
But there was a big problem. It BURNED through my tokens like nothing else.
I took a loot and saw why: The skills dump around 4,400 tokens into context just on activation. Before any reference files. Before any of my actual code. Then it loads reference files that can be over 1,000 lines each. A single query about module design could easily cost 10,000+ tokens in skill overhead alone.
I’m on the Claude Pro plan. There’s a 5-hour usage limit. With terraform-skill active, I was hitting that limit way too fast. Not because I was doing more work, because the skill was eating context for breakfast.
Building TerraShark
So I built my own. The goal was: fix the hallucinations without the token tax.
The project is called TerraShark. It’s open source, MIT licensed, and works with both Claude Code and Codex (and any agent, including OpenClaw). But the interesting part isn’t what it does - it’s how it does it.
The Core Insight
Most skills work like reference manuals. They dump a wall of information into context and hope the model picks the right parts. This is fundamentally wasteful. The model already knows basic Terraform. It doesn’t need the syntax explained again. What it needs is to know what it gets wrong.
TerraShark is built around a different idea: telling an LLM what good Terraform looks like is less effective than telling it how to think about Terraform problems.
Instead of a static reference, TerraShark is a 7-step diagnostic workflow. The model doesn’t just read information and generate code. It follows a structured process:
- Capture execution context - runtime, version, providers, backend, environment criticality
- (!) Diagnose likely failure modes - what could go wrong before writing any code
- (!) Load only relevant references - pull a few targeted files, related to the detected failure modes
- Propose fix path with risk controls - why this works, what could still fail
- Generate implementation artifacts - HCL, migration blocks, CI updates
- Validate before finalize - command sequences tailored to the runtime
- Output contract - assumptions, tradeoffs, validation plan, rollback notes
The critical step is number 2 and 3. The model diagnoses before it generates. This is what prevents the most common failure: producing syntactically valid but operationally dangerous code.
Five Failure Modes
Every piece of content in TerraShark maps to one of five failure modes. The references are structured by the failure modes. They represent the five most common ways LLM-generated Terraform causes real damage:
- Identity churn - resource addressing instability, refactor breakage, index-based identity problems
- Secret exposure - secrets leaking through state, logs, defaults, or artifacts
- Blast radius - oversized stacks, weak boundaries, unsafe production applies
- CI drift - version mismatches, unreviewed applies, missing plan artifacts
- Compliance gate gaps - missing policies, approvals, audit controls
Content that doesn’t reduce the probability of any of these failure modes doesn’t make it into the skill. Period.
Token Efficiency by Design
Some quick math.
The core SKILL.md, the file that loads on every activation, is 79 lines. About 600 tokens. Compare that to terraform-skill’s 4,400 tokens. That’s over 7x leaner on activation alone.
Then: TerraShark has 18 granular reference files instead of 6 large ones. When the model diagnoses “this is an identity churn problem,” it loads identity-churn.md. Not the compliance gates. Not the CI delivery patterns. Not the testing matrix. Just the one or two files that are relevant.
A query about secret handling never loads the module architecture docs. A query about CI pipelines never loads the compliance framework mappings. This granularity means the model processes hundreds of tokens of reference material instead of thousands.
The difference in practice: No more token burning, my session time is back to normal. Plus, at least IMO, even the quality of the Terraform code is better.
LLM-Aware Guardrails
Every reference file includes something we call an LLM mistake checklist - a list of specific errors that language models make when generating Terraform. Not generic best practices. Specific hallucination patterns.
For example, the coding standards reference includes a feature guard table that maps Terraform features to their minimum version and the specific LLM error pattern:
movedblocks (1.1+) - models omit them during refactors, causing destroy/createfor_eachovercount(0.12+) - models default tocountfor every collectionwrite_onlyarguments (1.11+) - models usesensitiveand assume state is safe- Declarative
importblocks (1.5+) - models recommend CLI-only import
The idea is simple: a reference that only shows the right pattern still allows the model to hallucinate the wrong one. A reference that explicitly names the hallucination pattern reduces it.
Empirical Process
The content in TerraShark was not designed by intuition, we tested it thoroughly.
We started with much larger reference content - broader coverage, more examples, more explanations. Then we built an automated test suite: a large set of practical Terraform task patterns covering module creation, refactoring, CI pipeline setup, migration, security review, compliance checks.
Then we stripped content iteratively. Remove a section. Re-run the full test suite. Measure quality.
- If quality dropped → content was restored. It carried signal the model needed.
- If quality stayed stable → content was permanently removed. It was redundant with the model’s existing knowledge.
We kept going until every remaining section was load-bearing. Removing any further content caused measurable quality degradation.
How To Use
Installation is a one-liner:
git clone https://github.com/LukasNiessen/terrashark.git ~/.claude/skills/terrashark
Claude Code auto-discovers skills in ~/.claude/skills/. No restart needed.
After that, whenever you ask Claude about Terraform, TerraShark activates (you can with one line turn off auto activation though, then only /terrashark ... will activate it). The model follows the 7-step workflow, diagnoses the failure modes, loads only what’s relevant, and produces an output with an explicit contract: what it assumed, what it traded off, how to validate, and how to roll back.
You can also invoke it explicitly:
/terrashark Migrate this module from count to for_each
The output contract is what makes this auditable. Before applying anything, you can check: did the model assume the right version? Did it identify the right risks? Is the rollback path viable?
Closing
TerraShark has helped me a lot sofar. It is open source, MIT licensed, and works with both Claude Code and Codex. If you’re burning tokens on Terraform skills or tired of catching hallucinated count loops and missing moved blocks, give it a try!
TerraShark: https://github.com/LukasNiessen/terrashark