Skip to content

feat: detect false refusals on benign prompts during optimization#210

Open
KewkLW wants to merge 2 commits intop-e-w:masterfrom
KewkLW:false-refusal-detection
Open

feat: detect false refusals on benign prompts during optimization#210
KewkLW wants to merge 2 commits intop-e-w:masterfrom
KewkLW:false-refusal-detection

Conversation

@KewkLW
Copy link

@KewkLW KewkLW commented Mar 4, 2026

Summary

Adds --detect-false-refusals to check whether abliteration causes the model to refuse harmless prompts it shouldn't refuse. When false refusals are detected, the optimizer penalizes that trial to steer away from over-abliteration.

Changes:

  • config.py: New detect_false_refusals (bool, default off) and false_refusal_weight (float, default 0.5)
  • evaluator.py: count_refusals() now accepts an optional prompts parameter. get_score() runs the good prompts through the refusal classifier when enabled, adds penalty to the KL score component, returns a 4th value (false_refusals)
  • main.py: Unpacks the 4th return value and stores it as a trial attribute

Motivation

KL divergence measures general model damage but doesn't directly catch the worst failure mode of abliteration: the model refusing benign prompts it shouldn't refuse. I kept running into trials during Qwen3.5 abliteration where KL was acceptable (< 0.05) but the model had started refusing normal requests like "Write a poem about cats" because the ablation was too aggressive.

This feature directly measures that failure mode. The penalty formula is:

kld_score += false_refusal_weight * (false_refusals / total_good_prompts)

Adding it to the KL component (rather than as a third objective) keeps the bi-objective optimization structure intact while still penalizing over-abliteration.

Usage

# Disabled by default (existing behavior preserved)
heretic --model Qwen/Qwen3.5-9B

# Enable with default weight (0.5)
heretic --model Qwen/Qwen3.5-9B --detect-false-refusals

# Custom weight (higher = stricter)
heretic --model Qwen/Qwen3.5-9B --detect-false-refusals --false-refusal-weight 1.0

Performance impact

Adds one extra generation pass on the good evaluation prompts (~100 prompts by default) per trial. Roughly 10-15% slower per trial depending on model size and batch size.

Design decisions

  • Off by default for backwards compatibility
  • Penalty on KL component rather than a third objective, since Optuna's multi-objective TPE works best with 2 objectives
  • Reuses existing is_refusal() so the same refusal markers apply consistently
  • count_refusals() takes optional prompts to avoid duplicating logic between bad-prompt and good-prompt refusal counting

Add --detect-false-refusals flag that checks whether abliteration
causes the model to refuse harmless prompts it shouldn't refuse.
False refusals are penalized in the KL divergence score component
to steer the optimizer away from over-abliteration.

The penalty is: false_refusal_weight * (false_refusals / good_prompts).
Default weight is 0.5, configurable via --false-refusal-weight.

Disabled by default, preserving existing behavior.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial feature to prevent over-abliteration during model optimization by detecting and penalizing "false refusals." Previously, models could become overly aggressive in their refusal behavior, even for benign prompts, without being adequately penalized. The new mechanism identifies instances where the model refuses harmless requests and incorporates a penalty into the KL divergence score, guiding the optimizer away from such undesirable outcomes. This ensures that the optimization process not only reduces harmful responses but also preserves the model's ability to respond appropriately to safe prompts.

Highlights

  • Configuration: Introduced detect_false_refusals (boolean) and false_refusal_weight (float) settings to control and weigh the false refusal detection mechanism.
  • Evaluation Logic: Enhanced count_refusals() to accept an optional list of prompts and modified get_score() to perform false refusal detection on benign prompts, apply a penalty to the KL score, and return the count of false refusals.
  • Main Optimization Loop: Updated the main.py objective function to capture and store the newly returned false_refusals as a trial attribute.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/heretic/config.py
    • Added detect_false_refusals boolean field, defaulting to False, to enable or disable the feature.
    • Added false_refusal_weight float field, defaulting to 0.5, to control the penalty strength.
  • src/heretic/evaluator.py
    • Modified count_refusals method to accept an optional prompts parameter, allowing it to be reused for both bad and good prompts.
    • Updated the get_score method's return type hint to include an additional integer for false_refusals.
    • Implemented logic within get_score to check for false refusals on good_prompts when detect_false_refusals is enabled.
    • Added a penalty to the kld_score based on the false_refusal_weight and the rate of false refusals.
  • src/heretic/main.py
    • Adjusted the unpacking of the evaluator.get_score() return value to include false_refusals.
    • Stored the false_refusals count as a user attribute for the Optuna trial.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful feature to detect and penalize false refusals on benign prompts, which helps to avoid over-abliteration. The implementation is clean and reuses existing components effectively. I've found one minor issue regarding a missed update to the default configuration file, which violates the repository's style guide. Please see the specific comment for details.

Comment on lines +218 to +234
detect_false_refusals: bool = Field(
default=False,
description=(
"Whether to detect false refusals on benign (good) prompts during evaluation. "
"When enabled, the model is checked for refusing harmless prompts that it shouldn't refuse. "
"False refusals are penalized in the KL divergence component of the optimization score."
),
)

false_refusal_weight: float = Field(
default=0.5,
description=(
"Weight for false refusal penalty when detect_false_refusals is enabled. "
"The penalty is: weight * (false_refusals / total_good_prompts). "
"Higher values more aggressively penalize over-abliteration."
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository's style guide (rule #8), new settings added to config.py should also be added to config.default.toml. The new settings detect_false_refusals and false_refusal_weight are missing from config.default.toml.

Please add them to config.default.toml with their default values and descriptions, ensuring the order matches config.py.

References
  1. When new settings are added in config.py, they should also be added to config.default.toml, set to their default value and with their description as a comment. The order of settings in config.default.toml should match that in config.py. (link)

Add detect_false_refusals and false_refusal_weight entries to match
the repository convention of mirroring all config.py settings.
@kabachuha
Copy link

Could you please hide the red color :)

It hurts the view

@p-e-w
Copy link
Owner

p-e-w commented Mar 5, 2026

Roughly 10-15% slower per trial depending on model size and batch size.

For many configurations, the trial time is dominated by counting refusals. I expect that this change would nearly double the total processing time in some cases.

And I don't really understand how this is supposed to work. Overabliteration doesn't generally cause the model to refuse on harmless prompts. The model's overall tendency to refuse is suppressed. Harmful and harmless prompts do not "switch places". If that is what you are observing, there is a deeper problem here, not one related to abliteration being too aggressive. Perhaps the wrong modules are being targeted.

Note that Qwen3.5 is not yet officially supported, precisely because of questions like which modules to target.

@KewkLW
Copy link
Author

KewkLW commented Mar 6, 2026

Fair points on both counts.

On performance, you're right that refusal counting dominates trial time, especially with larger eval sets. Doubling the generation passes is a more accurate estimate than my 10-15% figure.

On the core premise, I think I was conflating two different failure modes. What I was actually observing on Qwen3.5 was likely degraded coherence manifesting as nonsensical outputs that happened to trigger refusal markers (e.g. generating "I'm sorry" fragments in gibberish), not the model genuinely learning to refuse harmless prompts. That's model damage, which KL already captures, not a refusal direction problem.

Your point about Qwen3.5 module targeting is well taken. kabachuha's testing on gpt-oss-20b showed false refusal rates of only 1-4/100 across all trials, confirming this isn't a real problem on properly supported architectures.

I'm fine closing this if you don't see a use case for it. The underlying observation that KL alone sometimes misses quality degradation might be better addressed through other metrics rather than re-running refusal classification on harmless prompts.

@p-e-w
Copy link
Owner

p-e-w commented Mar 6, 2026

Like with #209, I do think this could be an interesting plugin. There may be steering applications that aren't about refusals (e.g. slop removal) where this is a bigger problem than in the refusal case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants