AI in government: what works, what does not, and why most pilots fail.
The honest assessment of where AI is delivering in federal civilian agencies, and where the pilots stall before they ship.
7 min read
Federal agencies are running AI pilots. Most pilots stall. The ones that ship share a small number of properties. The ones that stall share a different small set. Here is the honest pattern.
What works in government AI today
1. Document summarization for legal, regulatory, and policy work. Saves substantial hours per analyst per week. Easy to measure. Easy to control. Low risk if outputs are reviewed.
2. Drafting routine correspondence for high-volume offices: constituent letters, FOIA acknowledgments, internal status updates. Same human-in-the-loop pattern. Same easy measurement.
3. Internal search and Q&A over agency documentation. Done well, it cuts time-to-answer for staff from hours to seconds. Done poorly, it produces confident wrong answers and erodes trust.
4. Operational triage: classifying tickets, routing inquiries, prioritizing queues. The work is structured. The downside of a misroute is bounded. The upside compounds with volume.
What does not work (yet) in government AI
1. Public-facing chatbots that promise to answer policy questions. The accuracy bar is too high. The legal and political downside of a confidently wrong answer is too high. Almost every pilot in this category has rolled back.
2. Replacing analyst judgment on regulated decisions without keeping the human in the loop. Almost any agency lawyer will tell you why.
3. AI-only document generation for legally binding text. Drafts, yes. Final text, no. The level of context, precedent, and edge-case handling required is not where the technology is yet.
Why pilots stall
The honest list:
- The training data is not where the team thought it was, and getting access takes longer than the pilot timeline allows. - The contracting vehicle does not accommodate the iterative workflow that AI development requires. - The security review takes longer than the pilot. - The success metric was defined as "deploy" not as "produce X measurable outcome," so the pilot ships into a void. - The training of staff to use the tool well was deprioritized, so adoption stalls.
What to do if you are running an AI pilot
Pick the highest-volume, highest-effort, lowest-risk workflow. Define success as a specific time saving or output quality improvement. Train the team to use the tool. Measure for six weeks. Iterate on prompts and data flow. Then decide whether to scale.
Most of the technical work is straightforward. The hard part is operational and organizational. The learntrainai.com workshop is built specifically for staff at federal civilian agencies who are trying to make this work.
For the technical depth on the prompt and deployment side, the book Prompt to Product by Snake Blocker and Naveen Dhillon is the reference.