feat: detect false refusals on benign prompts during optimization#210
feat: detect false refusals on benign prompts during optimization#210KewkLW wants to merge 2 commits intop-e-w:masterfrom
Conversation
Add --detect-false-refusals flag that checks whether abliteration causes the model to refuse harmless prompts it shouldn't refuse. False refusals are penalized in the KL divergence score component to steer the optimizer away from over-abliteration. The penalty is: false_refusal_weight * (false_refusals / good_prompts). Default weight is 0.5, configurable via --false-refusal-weight. Disabled by default, preserving existing behavior.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial feature to prevent over-abliteration during model optimization by detecting and penalizing "false refusals." Previously, models could become overly aggressive in their refusal behavior, even for benign prompts, without being adequately penalized. The new mechanism identifies instances where the model refuses harmless requests and incorporates a penalty into the KL divergence score, guiding the optimizer away from such undesirable outcomes. This ensures that the optimization process not only reduces harmful responses but also preserves the model's ability to respond appropriately to safe prompts. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a useful feature to detect and penalize false refusals on benign prompts, which helps to avoid over-abliteration. The implementation is clean and reuses existing components effectively. I've found one minor issue regarding a missed update to the default configuration file, which violates the repository's style guide. Please see the specific comment for details.
| detect_false_refusals: bool = Field( | ||
| default=False, | ||
| description=( | ||
| "Whether to detect false refusals on benign (good) prompts during evaluation. " | ||
| "When enabled, the model is checked for refusing harmless prompts that it shouldn't refuse. " | ||
| "False refusals are penalized in the KL divergence component of the optimization score." | ||
| ), | ||
| ) | ||
|
|
||
| false_refusal_weight: float = Field( | ||
| default=0.5, | ||
| description=( | ||
| "Weight for false refusal penalty when detect_false_refusals is enabled. " | ||
| "The penalty is: weight * (false_refusals / total_good_prompts). " | ||
| "Higher values more aggressively penalize over-abliteration." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
According to the repository's style guide (rule #8), new settings added to config.py should also be added to config.default.toml. The new settings detect_false_refusals and false_refusal_weight are missing from config.default.toml.
Please add them to config.default.toml with their default values and descriptions, ensuring the order matches config.py.
References
- When new settings are added in
config.py, they should also be added toconfig.default.toml, set to their default value and with their description as a comment. The order of settings inconfig.default.tomlshould match that inconfig.py. (link)
Add detect_false_refusals and false_refusal_weight entries to match the repository convention of mirroring all config.py settings.
|
Could you please hide the red color :) It hurts the view |
For many configurations, the trial time is dominated by counting refusals. I expect that this change would nearly double the total processing time in some cases. And I don't really understand how this is supposed to work. Overabliteration doesn't generally cause the model to refuse on harmless prompts. The model's overall tendency to refuse is suppressed. Harmful and harmless prompts do not "switch places". If that is what you are observing, there is a deeper problem here, not one related to abliteration being too aggressive. Perhaps the wrong modules are being targeted. Note that Qwen3.5 is not yet officially supported, precisely because of questions like which modules to target. |
|
Fair points on both counts. On performance, you're right that refusal counting dominates trial time, especially with larger eval sets. Doubling the generation passes is a more accurate estimate than my 10-15% figure. On the core premise, I think I was conflating two different failure modes. What I was actually observing on Qwen3.5 was likely degraded coherence manifesting as nonsensical outputs that happened to trigger refusal markers (e.g. generating "I'm sorry" fragments in gibberish), not the model genuinely learning to refuse harmless prompts. That's model damage, which KL already captures, not a refusal direction problem. Your point about Qwen3.5 module targeting is well taken. kabachuha's testing on gpt-oss-20b showed false refusal rates of only 1-4/100 across all trials, confirming this isn't a real problem on properly supported architectures. I'm fine closing this if you don't see a use case for it. The underlying observation that KL alone sometimes misses quality degradation might be better addressed through other metrics rather than re-running refusal classification on harmless prompts. |
|
Like with #209, I do think this could be an interesting plugin. There may be steering applications that aren't about refusals (e.g. slop removal) where this is a bigger problem than in the refusal case. |
Summary
Adds
--detect-false-refusalsto check whether abliteration causes the model to refuse harmless prompts it shouldn't refuse. When false refusals are detected, the optimizer penalizes that trial to steer away from over-abliteration.Changes:
config.py: Newdetect_false_refusals(bool, default off) andfalse_refusal_weight(float, default 0.5)evaluator.py:count_refusals()now accepts an optionalpromptsparameter.get_score()runs the good prompts through the refusal classifier when enabled, adds penalty to the KL score component, returns a 4th value (false_refusals)main.py: Unpacks the 4th return value and stores it as a trial attributeMotivation
KL divergence measures general model damage but doesn't directly catch the worst failure mode of abliteration: the model refusing benign prompts it shouldn't refuse. I kept running into trials during Qwen3.5 abliteration where KL was acceptable (< 0.05) but the model had started refusing normal requests like "Write a poem about cats" because the ablation was too aggressive.
This feature directly measures that failure mode. The penalty formula is:
Adding it to the KL component (rather than as a third objective) keeps the bi-objective optimization structure intact while still penalizing over-abliteration.
Usage
Performance impact
Adds one extra generation pass on the good evaluation prompts (~100 prompts by default) per trial. Roughly 10-15% slower per trial depending on model size and batch size.
Design decisions
is_refusal()so the same refusal markers apply consistentlycount_refusals()takes optional prompts to avoid duplicating logic between bad-prompt and good-prompt refusal counting