Some AI Detectors Are Wrong 100% of the Time

The Authors Guild tested five AI detectors on pre-AI human writing. Pangram scored 0% false positives. Sidekicker flagged every article as AI — two at 100%.

4 min readEAEvgenii ArsentevEvgenii Arsentev · PhD

The Authors Guild ran five AI detection tools against ten articles written by professional authors between 2020 and 2022 — before ChatGPT became widely available, when these texts were unambiguously human-written. The results were split sharply: two tools performed near-perfectly, two failed badly, and one landed somewhere in between. The test exposed something that matters for any writer, student, or professional whose work could be reviewed by an AI detector: the tools are not equally reliable, and the worst ones get it wrong on every single text.

Pangram identified all ten articles correctly as human-written, posting a 0% false positive rate. Grammarly also performed well, with a false positive range of 0–9% across the tested articles. Originality.ai was mostly reliable, with most false positive rates in the 0–1% band. At the other end of the spectrum, Sidekicker flagged every single article as mostly AI-generated, and two of them scored a full 100% — meaning the tool was maximally confident that two pieces of unambiguously human writing were machine-made. ZeroGPT was highly inconsistent, with false AI detection rates ranging from 5.3% to 76.3% depending on the article.

The paradox at the center of AI detection

The Authors Guild highlighted a problem that makes accurate AI detection structurally difficult: the better a human writer is, the more their work resembles AI output. Language models were trained overwhelmingly on professionally written text — the kind of polished, coherent, well-structured prose that professional authors produce. As a result, the statistical patterns that make AI text detectable and the patterns that mark expert human writing have converged. Pangram CEO Max Spero, whose tool passed the test, noted that AI models tend to give themselves away through "uniformity, especially in argument construction" — but for casual or generic writing, not necessarily for carefully crafted text.

This creates a troubling asymmetry. A weak AI-generated text may be easier to catch, because it produces a narrower, more repetitive argument structure. But a skilled AI-assisted or AI-polished piece — and a skilled human piece — can be nearly indistinguishable by tool. The Guild's warning follows directly from this: no detector should be used as the sole basis for a judgment, because a false positive is not a trivial mistake. For an author, being wrongly flagged can mean losing a contract, having work rejected, or having a reputation damaged.

What writers and builders should take from this

If your work might be reviewed by an AI detector — as a student submitting an essay, a journalist submitting a piece, a contractor whose deliverable gets scanned — this test gives you something concrete to act on. The gap between Pangram (0% false positives) and Sidekicker (100% false positives on every text) is not a minor calibration difference. It is a fundamental quality difference between tools. Not all AI detectors are created equal, and relying on the wrong one is not a neutral choice.

For builders integrating AI detection into platforms — plagiarism checks, submission screening, content moderation — the lesson is the same: validate your tool against a known human-written dataset before deploying it in any context where a false positive has real consequences. The Authors Guild's test was not technically complex. It was ten articles and five tools. That such a basic check surfaces such dramatic variation should give pause to anyone treating AI detection as a solved problem.

What I'd actually do

If you write professionally and are worried about being wrongly flagged: run your text through Pangram or Grammarly before submission — both performed well in this test. If you are building a platform that screens content with an AI detector: first run it on a sample of known human writing and measure the false positive rate yourself. Do not trust the vendor's marketing numbers. And if the tool produces any false positives on unambiguously human text, treat it as unsuitable for high-stakes decisions until you understand why.

#ai#ai-detection#writing#tools#authors

Related guides

EAEvgenii Arsentev

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

Source: the-decoder.com