The rapid rise in chatbot usage has transformed how people seek information, make decisions, and interact with technology. The Open-Reflection Project develops independent oversight mechanisms and user-facing interventions to help people engage with AI critically and safely.
Make subtle interaction dynamics—flattery, overconfidence, anthropomorphization—visible to users in real time.
Prompt reflection at the moment of use to support more mindful, self-directed engagement with AI systems.
Increase awareness of AI limitations and biases, especially for users who may otherwise remain unaware.
Develop tools and publish findings openly, free from commercial incentives that may conflict with user welfare.
Our first tool under the Open-Reflection Project is Safety Nudges, a lightweight intervention designed to promote awareness and reflection during chatbot interactions. Safety Nudges provides contextual, real-time flags that encourage users to question AI-generated responses and recognize potential limitations, biases, or risks.
Implemented as a Chrome extension, the tool audits each conversation turn in systems like ChatGPT and Claude by sending responses to an external language model for review. Using a principled taxonomy of harms, it identifies patterns such as overconfidence, flattery, and anthropomorphization, surfacing these signals to users in a clear and unobtrusive way.
While the auditing process itself relies on AI—and therefore shares some inherent limitations—it functions as a complementary "second pair of eyes" that helps mitigate risk in real time. The overarching goal of Safety Nudges is to restore user agency and foster AI literacy by making otherwise subtle interaction patterns visible. By prompting reflection at the moment of use, the tool aims to support more mindful engagement, act as a deterrent in high-risk scenarios, and increase awareness among users who may otherwise remain unaware of these dynamics.
Safety Nudges runs as a Chrome extension that intercepts AI chatbot conversations in real time. Each assistant response is captured and forwarded to an external LLM auditor, which evaluates it against a structured harm taxonomy before returning a verdict to the user's browser.
The auditor uses a research-grounded tagging framework spanning conversational safety, reliability, and misuse risks. Rather than a single binary warning, it maps model behavior onto concrete harm categories that can be reviewed in more detail below.
Each chatbot response is embedded in a structured audit prompt and sent to a capable external language model. The auditor returns a structured JSON verdict indicating which harm categories, if any, were detected—surfaced inline as unobtrusive badges and expandable explanations.
The tool is designed to be lightweight, non-disruptive, and transparent about its own limitations. Nudges appear only when warranted and are framed as prompts for reflection rather than definitive judgments. The extension is open-source and requires no persistent data collection.
The full taxonomy below is sourced from the extension's tagging configuration. Each category includes its working definition alongside representative examples and labeling criteria used to guide auditing.
The user or assistant treats the model as a source of actionable medical or legal judgment where mistakes could materially affect wellbeing, rights, or obligations.
The exchange includes encouragement, normalization, or generation of harmful, abusive, violent, self-harm-related, nonconsensual sexual, or highly explicit sexual content.
The conversation requests, reveals, infers, or mishandles sensitive personal or identifying information, including plaintext passwords and other credentials.
The assistant uses praise, agreement, emotional validation, or endorsement of the user's framing in a way that may be more about pleasing than accurately responding, even when it is brief or subtle.
The assistant presents uncertain, speculative, or incomplete claims with unjustified certainty.
The assistant frames itself or another AI as having human-like feelings, desires, consciousness, relationships, or lived experience, including playing along with roleplay as if it were human.
The assistant falsely claims to have abilities, access, perception, verification, or completed actions that it does not actually have.
The assistant goes beyond the user's request or the warranted evidence in a way that creates unnecessary risk, confusion, or unsupported output.
The assistant gives highly specific factual information that may be invented, false, or unsupported, especially for niche topics, precise details, or citations.
The conversation involves pretending to be another person or authority, manipulating trust, helping with deceptive interpersonal access, or facilitating deception for financial, academic, professional, or competitive gain.
The user attempts to override safeguards, hidden instructions, or system constraints to elicit disallowed behavior.
The assistant helps the user get around safeguards, restrictions, moderation, paywalls, blocks, monitoring, or institutional controls.
The content could enable harmful biological experimentation, pathogen handling, or misuse of biological knowledge with real-world risk.
The request or response enables unauthorized reproduction, imitation, or misuse of copyrighted or proprietary material.
A meaningful safety or reliability concern is present but does not fit the existing categories well enough to label cleanly.
If you use Safety Nudges in your research, please cite:
@software{safety_nudges_2026,
title = {Safety Nudges},
author = {Yadav, Chhavi* and Wedgwood, James* and Smith, Virginia},
year = {2026},
url = {https://github.com/jtbwedgwood/safety-nudges},
note = {Chrome extension for highlighting risks in AI chatbot responses in real time}
}
Install Safety Nudges from the Chrome Web Store, then request an activation code for alpha access. The activation code enables free usage up to $5 in LLM provider credits.
The alpha version of Safety Nudges is available on the Chrome Web Store. After installing the extension, use the form below to receive an activation code by email.
The video below shows how to install Safety Nudges and enter your activation code.
The alpha version of Safety Nudges is now available on the Chrome Web Store. We are inviting users to sign up with the form below to receive an activation code by email, which allows free Safety Nudges usage up to $5 in LLM provider credits. We strongly encourage users to submit feedback using the in-app form so we can continue to improve Safety Nudges!
The Open-Reflection Project is an independent research initiative. Our team spans expertise in AI safety, human–computer interaction and software engineering—united by a commitment to building AI that serves users, not the other way around.