MentorMe
·3 min read

Profanity Filters in 2026 — When Automation Is and Isn't Moderation

Word lists, pattern matching, and three-strike systems — the engineering of a kind community.

communitymoderationMentorMe

A profanity filter is not a moderator. A moderator has judgment. A filter has a word list. Confusing the two is how communities die slowly.

But filters still matter. Used right, they keep the worst noise out without becoming the thought police. Used wrong, they create an aggressive false-positive culture where members can't say "class" because it contains a banned substring.

Here's how we built the filter for MentorMe's community, and the three rules that kept it from going feral.

First rule. A filter is a floor, not a ceiling. It catches the ten things nobody should ever say. Slurs, targeted harassment, the obvious stuff. Everything else — the gray zone, the context-dependent, the "is this a joke or is this mean" question — belongs to human moderators. The filter exists so moderators don't waste their day removing the easy calls.

Second rule. Pattern match, don't substring match. "ass" as a substring matches "class", "passage", "grass", "embassy", and about 400 other words. A filter that matches on substrings is a filter that gets turned off within a week because members revolt. Use word-boundary regex. Match "\bass\b" not "ass". It's one character of regex syntax and it's the difference between a tool and a menace.

"We don't filter off-topic posts (that's a moderator call, not a word-list call)."

Third rule. Transparency beats stealth. When the filter blocks a post, tell the user which word triggered it and why. Don't shadow-ban. Don't silently drop the message and pretend it posted. Users who think their post went through and didn't will never trust the platform again. A clear "this was flagged because of X, please rephrase" is respected. A silent drop is not.

The engineering is the easy part. We keep the banned word list in a single Postgres table, scoped by severity (block, warn, allow-with-strike). The filter runs server-side on post submission. Client-side filtering is security theater because anyone can disable it in devtools.

The three-strike system is where this gets interesting. One flagged post is a mistake. Two is a pattern. Three is a choice. We track strikes per user in the users table, with a 90-day rolling window. First strike is a warning in-line, no action taken. Second strike is a temporary post cooldown — 24 hours of read-only. Third strike is an escalation to a human moderator who makes the call on whether to suspend.

The warnings themselves are worded carefully. "Your post was flagged for [word]. Communities work when we treat each other well. You can edit the post or appeal the flag." Not "You violated our rules." Not "Your message was removed." Just a statement of what happened and a path forward. The tone is the difference between a member who adjusts and a member who leaves.

There's a whole category of things we chose NOT to filter. We don't filter profanity for emphasis — "this is fucking brilliant" stays. We don't filter political opinions, religious opinions, or disagreement. We don't filter negativity about our product (we want to hear it). We don't filter off-topic posts (that's a moderator call, not a word-list call).

247%

Growth in AI job postings since 2023

We also chose not to run an LLM-based toxicity classifier. We tested three of them. They all had a bias toward flagging posts from non-native English speakers as hostile, which is the opposite of what we want for a global community. The word list is biased too — toward English — but at least the bias is auditable. We can read every word on the list. We can explain every flag. An LLM that returns "toxicity score 0.73" cannot be explained to the member whose post it flagged.

The larger point. Moderation is a community responsibility, not a software responsibility. The filter is scaffolding. The community is the culture. Kind communities come from kind norms, kind founders, kind moderators, and clear consequences applied consistently. Software supports that. Software does not replace it.

Action step: open your current platform's profanity filter and read the word list — if you can't audit it, you can't trust it.

Start free at mentorme.com — community, 2 courses, AI Operator Stack, and the skill library.

Related reading