Beyond the Text Mirage: How Investigators Map AI Astroturfing
Agencies detect AI-generated public comments by mapping submission metadata and network clusters. This guide walks through the exact data hygiene and graph-analysis steps to automate regulatory triage.
How do companies check for AI-generated content? They stop reading the text entirely and start mapping the submission metadata.
The Text Mirage and the Metadata Shift
Regulatory dockets are not being flooded by angry citizens. They are being flooded by prompt templates. Reading the text is the absolute worst way to catch them. When an agency relies on traditional natural language processing detectors, they hit a wall. Perplexity and burstiness metrics flatten out completely when an attacker prompts a model to generate a thousand slight variations of the same argument.
We learned this the hard way. Early on, we built a triage pipeline relying entirely on text heuristics to spot astroturfing. It almost broke our review process because the false positives were staggering. Genuine human passion often looks statistically similar to machine output, while coordinated AI campaigns slip right through. We reversed the approach. The more sophisticated the language model, the less useful text analysis becomes. Detection must shift entirely to network and metadata anomalies.
The submission envelope holds the actual truth. Timestamps, IP blocks, and email routing anomalies do not lie. When we pivot to the metadata fingerprint, the fake grassroots campaigns reveal themselves instantly. Furthermore, language models leave specific hallucination signatures when prompted to write endless variations. They hit a ceiling of creativity and start repeating the exact same syntactic structures. You will see identical transition phrases appearing across hundreds of distinct submissions.
Building the Detection Pipeline
To bypass unreliable text analysis, we build an automated dredge. This continuous pipeline catches the next wave before it hits the docket closing date and skews the public record. Entity resolution collapses the illusion of a broad grassroots movement. When you map the bipartite graph, you stop seeing thousands of individual citizens. You see a single star node connecting to hundreds of leaf nodes. The astroturfing network maps directly back to a single orchestrator.
Here is the exact sequence of operations we use to isolate the artificial noise.
- Extract the envelope: Query the Regulations.gov API and use Apache Tika to isolate submission metadata from the raw document text.
- Cluster the variations: Compute the Levenshtein distance between all comment pairs to group near-duplicate phrasing structures.
- Execute fuzzy matching: Deploy Elasticsearch to run fuzzy queries at scale, handling massive docket sizes without memory exhaustion.
- Map the bipartite graph: Use Python libraries to connect submitters to their comments, applying the NetworkX Reference Guide to identify high-centrality orchestrators.
- Flag the velocity spikes: Cross-reference the graph clusters with submission timestamps to isolate automated execution patterns.
| Signal Category | Specific Metric | False Positive Risk |
|---|---|---|
| Temporal | Submissions < 5 seconds apart | High |
| Lexical | Levenshtein distance < 12 | Medium |
| Network | Single IP submitting > 50 comments | Low |
You can trace our exact operational logic by reviewing our how our methodology works documentation. The goal is not to read every comment. The goal is to map the network.
From case triage to fentanyl networks, generative AI can transform unstructured data into actionable intelligence — when guided by oversight.
Our Numbers and the Open Question
The shift from text to metadata yields concrete results. In our analysis of recent federal regulatory dockets, 68% of the top 10,000 comments were generated from just 14 distinct prompt templates.
Our V3 Echo Engine flags LLM-astroturfing clusters with an 89% true positive rate based on metadata velocity and graph centrality alone, bypassing text analysis entirely. Fuzzy-matching clusters with a Levenshtein distance below 12 characters account for 94% of all detected AI-generated bulk submissions in our test datasets. You can verify these baseline metrics anytime via our public audit feed.
This leaves us with a genuine operational dilemma. At what point does the operational cost of building metadata-driven detection systems exceed the regulatory value of reading the public comments at all? We spend thousands of compute hours building these metadata graphs. The agencies we work with spend equal resources ingesting the output. If the public record is entirely compromised by automated scripts, the comments cease to be a democratic tool and simply become a data processing problem. Is this an arms race we can actually win?
To test this yourself, try two concrete experiments. Pull 1,000 comments from a recent active Regulations.gov docket and run a simple Levenshtein distance matrix to find text clusters with greater than 80% similarity. Then, plot the submission timestamps of those top 1,000 comments to visually identify the unnatural velocity spikes that indicate automated script execution.
MOBILIZR -- Writing at mobilizr.org