Skip to content

[PoC]: Multi-directional refusal supression with Self-Organizing Maps#196

Open
kabachuha wants to merge 21 commits intop-e-w:masterfrom
kabachuha:som
Open

[PoC]: Multi-directional refusal supression with Self-Organizing Maps#196
kabachuha wants to merge 21 commits intop-e-w:masterfrom
kabachuha:som

Conversation

@kabachuha
Copy link

So, I finally implemented multi-directional abliteration in heretic, porting the Self-Organizing Maps refusal direction determination from #140.

Integrating Multidirections and SOMs turned out to be much easier than I thought, thanks to a local GLM Air derestricted who helped with the code.

Initially I thought it would break everything and I would fail, but no, it worked perfectly with multiple directions optimization. (In my experiment I used 4 directions)

Here is the comparison to the readme reference

Model Refusals for "harmful" prompts KL divergence from original model for "harmless" prompts
google/gemma-3-12b-it (original) 97/100 0 (by definition)
mlabonne/gemma-3-12b-it-abliterated-v2 3/100 1.04
huihui-ai/gemma-3-12b-it-abliterated 3/100 0.45
p-e-w/gemma-3-12b-it-heretic 3/100 0.16
gemma-3-12b-it-SOM-heretic (mine) 3/100 0.08

Closes #140. I know it will be better to have as a plugin, so this PR serves as a future reference.

From testing (chat with the model), can confirm that it writes NSFW and edgy content just fine. I'm going to sleep and I can upload the model to huggingface if you will ask for it.

Also, directions visualization would be fun to have.

Here is how the parameter listing looks like:

image

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the model's refusal suppression mechanism by enabling multi-directional abliteration. It leverages Self-Organizing Maps to automatically identify and utilize multiple distinct refusal vectors, moving beyond a single, averaged direction. This change aims to improve the precision of refusal while minimizing unintended side effects on the model's general capabilities, as evidenced by the improved KL divergence metrics.

Highlights

  • Multi-directional Refusal Suppression: Implemented a novel approach for multi-directional refusal suppression, allowing the model to learn and apply multiple distinct refusal directions, enhancing its ability to handle harmful prompts while preserving performance on harmless ones.
  • Self-Organizing Maps (SOM) Integration: Integrated Self-Organizing Maps (SOMs) to automatically determine and extract multiple refusal directions from 'bad' prompt residuals. This replaces the previous single-direction approach with a more nuanced, data-driven method.
  • Configurability: Introduced new configuration parameters in config.default.toml and src/heretic/config.py to control SOM behavior, including grid dimensions (som_x, som_y), training iterations, learning rate, neighborhood radius (sigma), and the number of top neurons (som_k) to use.
  • Abliteration Logic Update: Refactored the core abliterate function in src/heretic/model.py to process and apply multiple refusal directions simultaneously. This involves calculating and summing individual delta Ws for each direction before applying them to the model weights.
  • Performance Improvement: Demonstrated improved performance, achieving a lower KL divergence (0.08) from the original model for 'harmless' prompts compared to previous methods, while maintaining effective refusal rates for 'harmful' prompts (3/100).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • config.default.toml
    • Added new configuration parameters for enabling multi-directional SOM and controlling its properties (som_x, som_y, som_iterations, som_lr, som_sigma, som_k).
  • src/heretic/config.py
    • Defined new multidirectional_som boolean field and associated integer/float fields for SOM parameters within the Settings class.
  • src/heretic/main.py
    • Added a conditional import for the minisom library, providing a user-friendly error message if not installed.
    • Implemented logic to use SOMCalculator for deriving multiple bad_means (refusal directions) per layer when multidirectional_som is enabled.
    • Updated the calculation of refusal_directions and orthogonalization steps to handle a list of directions instead of a single direction.
  • src/heretic/model.py
    • Modified the AbliterationParameters dataclass to accept max_weights and min_weights as lists, accommodating multiple refusal directions.
    • Refactored the abliterate method to iterate over multiple refusal directions, calculate a total_delta_W by summing individual direction deltas, and apply this combined delta to the model weights.
  • src/heretic/som.py
    • Added a new file defining the SOMCalculator class, which provides methods for initializing, training a MiniSom instance, and extracting the weights of the top-k winning neurons.
  • src/heretic/utils.py
    • Updated the get_trial_parameters function to correctly format and display list-based parameters, such as max_weights and min_weights, for each direction.
Activity
  • The pull request was opened by kabachuha, introducing the concept and initial implementation of multi-directional refusal suppression using Self-Organizing Maps.
  • The author provided performance comparisons demonstrating the effectiveness of the new approach in reducing refusal rates for harmful prompts while maintaining low KL divergence for harmless prompts.
  • The author noted that the current implementation serves as a reference and could be refactored into a plugin in the future.
  • The author confirmed through testing that the model can generate NSFW and edgy content, indicating successful abliteration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for multi-directional refusal suppression using Self-Organizing Maps (SOMs). The implementation touches configuration, main application logic, and model abliteration, and adds a new som.py module.

My review has identified a few issues:

  • There are some minor violations of the repository style guide in config.default.toml and src/heretic/som.py regarding comment formatting, configuration consistency, and missing type hints.
  • More importantly, I've found several potential correctness and logic bugs:
    • A likely runtime error in src/heretic/main.py when handling the number of neurons from the SOM.
    • A critical bug in src/heretic/model.py where the abliteration logic is incorrect for RowNormalization.PRE and RowNormalization.NONE, which could lead to no effect or breaking subsequent operations.
    • A potential out-of-bounds access in src/heretic/model.py.
    • Unconventional logic in src/heretic/som.py for identifying top neurons, which may not produce the intended results.

Please address these points to ensure the new feature is robust and correct.

@p-e-w
Copy link
Owner

p-e-w commented Feb 27, 2026

Cool stuff! As you mentioned, this will probably become a plugin, but it's great to have a PR to experiment with.

W = W.view(W.shape[0], -1) # Flatten to (out_features, in_features)

if self.settings.row_normalization != RowNormalization.NONE:
# Keep a reference to the original weight matrix so we can subtract it later.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't remove comments. This PR will be a lot easier to read without the many unnecessary changes to existing code that are currently in it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-e-w Yes, sure. It is 80%+ vibe-coded and I didn't clean up it at the time of push. I will do it soon

@kabachuha
Copy link
Author

So, I abliterated gemma 27b now

Comparison to existing huggingface variants I found:

Model Refusals for "harmful" prompts KL divergence from original model for "harmless" prompts
google/gemma-3-27b-it (original) 98/100 0 (by definition)
kldzj/gemma-3-27b-it-heretic 2/100 0.36
Userb1az/gemma-3-27b-it-heretic 1/100 0.31
TeeZee/gemma-3-27b-it-heretic-v2 7/100 0.16
LastRef/gemma-3-27b-it-heretic-x-a 0/100 0.12
gemma-3-27b-it-SOM-heretic (this) 4/100 0.09

As it's seen, it has slightly higher refusal count, but much less KL divergence than the other models. Chat with it also shows that it works fine on sensitive and unsafe topics.

I wonder how it, together with heretic, will fare on more advanced models like GraySwanAI/Mistral-7B-Instruct-RR (built through representation correction specifically with abliteration protection in mind, and the SOM paper authors paid additional attention to breaking it - they had a success rate of 25%, while vanilla ablit capped on 5%).

And gpt-oss, of course.

The best would be to test the Multi-direction and Single-direction models on standard benchmarks, but it will require resources and time.

@kabachuha
Copy link
Author

kabachuha commented Feb 27, 2026

@p-e-w @red40maxxer Okay, dudes, I think I've just achieved SotA on GPT-OSS abliteration. With five abliteration directions, I've achieved refusal rate of 3 / 100 while also keeping KL of 0.1166.

Now, I will compare all gpt-oss-20b heretics I managed to find on Huggingface to the method from this multidirectional Pull Request.

Model Refusals for "harmful" prompts KL divergence from original model for "harmless" prompts
openai/gpt-oss-20b (original) 100/100 0 (by definition)
p-e-w/gpt-oss-20b-heretic 58/100 0.96
kldzj/gpt-oss-20b-heretic-bf16 6/100 1.56
p-e-w/gpt-oss-20b-heretic-v2 49/100 0.125
coder3101/gpt-oss-20b-heretic 19/100 0.293
Chris886621991/gpt-oss-20b-heretic 59/100 0.07
DavidAU/OpenAI-gpt-oss-20B-INSTRUCT-Heretic-Uncensored 19/100 0.293
p-e-w/gpt-oss-20b-heretic-v3 74/100 0.05
kabachuha/gpt-oss-20b-SOMbliterated (this) 3/100 0.117

So, yeah, this method is a bomb. A pipe bomb.

Uploaded to Huggingface: https://huggingface.co/kabachuha/gpt-oss-20b-SOMbliterated

@MagicalAlchemist
Copy link

I've tried it, and I'm quite surprised by the drastic drop in refusal on the first trial.

Screenshot 2026-02-28 004012

MagicalAlchemist/Apriel-1.6-15b-Thinker-Magic_beta-decensored

@kabachuha
Copy link
Author

image

@MagicalAlchemist From the diagram here, for example, I can see these two clusters fit just fine on the blue zone when "pulled" with imaginary lines just like a puzzle. So it's quite possible that it was a perfect mix-up.

I'm more shocked with this low KL 👀

@MagicalAlchemist
Copy link

image

@MagicalAlchemist From the diagram here, for example, I can see these two clusters fit just fine on the blue zone when "pulled" with imaginary lines just like a puzzle. So it's quite possible that it was a perfect mix-up.

These two dots are the dissidents :D

I'm more shocked with this low KL 👀

likely an MPOA self-healing effect after that hard ablation 🥶

@kabachuha
Copy link
Author

Now the big refusey boy, gpt-oss-120b.

Model Refusals for "harmful" prompts KL divergence from original model for "harmless" prompts
openai/gpt-oss-120b (original) 97/100 0 (by definition)
nightmedia/gpt-oss-120b-heretic-v2-mxfp4-q8-hi-mlx 22/100 0.53
hrktos-37/gpt-oss-120b-heretic 26/100 0.08
TeeZee/gpt-oss-120b-heretic-v1 15/100 0.76
jbeslt/gpt-oss-120b-heretic 21/100 0.95
kldzj/gpt-oss-120b-heretic-v2 22/100 0.53
kldzj/gpt-oss-120b-heretic 19/100 0.92
kabachuha/gpt-oss-120b-SOMbliterated (this) 7/100 0.22

It works so good, it feels unreal 👀

For reproducibility, the run took ~1h 5min on a H100. (400 trials)

Also, for people loving precision over refusals, here is a Pareto selection below. If you want to have really low KLs, I recommend to download this pull request and try it for yourself.

image

Btw, I ran out of memory saving this, so cannot upload, aargh 😩😩😩

@p-e-w, please fix the LoRA, I wasted a damn hour.


Subjective evaluation: NSFW writing and profanity work fine.

The model has a slight DID for some requests. For example, it will recite the safety policy and agree with that it is allowed to give you the pipe bomb recipe. After agreement in the reasoning, it gives the recipe just as asked and even an attack plan. It distorts the meaning of safety in "yours" safety, so it makes sure you will survive the attack. In the end it gives generic safety and legality advice, but no refusal.

What is this monster?


Merge lora request or return lora saving, @p-e-w, I cannot even save this model. 😡😡😡

@kabachuha
Copy link
Author

At this moment, for most overwhelming models, the refusal reduction for many directions is quite fast and achieves the floor lower than non-SOM abliterate runs. The adequate refusal elimination speed from multiple directions, as I view myself, happens much faster and better than the ultimate convergence for non-SOM models. Unless the companies will train their models robustly to create multiple refusal directions intentionally (see Mistral-7b-RR, the SOM paper authors paid additional attention to. And the newer gpt-oss-releases (hypothesised gpt-oss-2 aka Aurora Alpha on OpenRouter) can definitely have them), I don't think the parameter reduction is necessary

Scalability to 300b+ models (GLM, Kimi), is another question. In fact, I haven't seen any heretics for this scale models to begin comparing with.

@p-e-w
Copy link
Owner

p-e-w commented Mar 3, 2026

Have you experimented with different numbers of directions? This PR sets a default of 4, but gpt-oss-20b-SOMbliterated appears to have used 5, unless I'm misunderstanding something.

For models that also have an MLP to abliterate, the number of parameters would grow to 25, which is definitely too many for TPE to converge reliably within 200 trials.

@kabachuha
Copy link
Author

kabachuha commented Mar 3, 2026

Have you experimented with different numbers of directions? This PR sets a default of 4, but gpt-oss-20b-SOMbliterated appears to have used 5

I set 4 for testing simplicity, and used 5 because in the paper it has given the best results. I launch gpt-oss-20b with 4 directions now.

Reducing the number of parameters might be hard because the manifold seems very non-linear, as per the additional experiments section (Fig 9-12) in the original paper https://arxiv.org/pdf/2511.08379

The "anti-abliteration" aligned models like Mistral-7b-RR are a special tricky case. I'd like to test this model in heretic too, but it doesn't load for some reason (tokenizer problems?)

@kabachuha
Copy link
Author

kabachuha commented Mar 3, 2026

@p-e-w Hmm, yes, I got 3/100, 0.1055 with 4 directions, without additional MPOA. (better than the SOMbliterated version on Huggingface, which used MPOA)

? Which trial do you want to use? (Use arrow keys)
 » [Trial  97] Refusals:  3/100, KL divergence: 0.1055
   [Trial  41] Refusals: 15/100, KL divergence: 0.1009
   [Trial 142] Refusals: 18/100, KL divergence: 0.0543
   [Trial 102] Refusals: 24/100, KL divergence: 0.0482
   [Trial 120] Refusals: 28/100, KL divergence: 0.0424
   [Trial  53] Refusals: 30/100, KL divergence: 0.0370
   [Trial 141] Refusals: 35/100, KL divergence: 0.0343
   [Trial 130] Refusals: 37/100, KL divergence: 0.0342
   [Trial 127] Refusals: 40/100, KL divergence: 0.0213
   [Trial 129] Refusals: 43/100, KL divergence: 0.0198
   [Trial 132] Refusals: 46/100, KL divergence: 0.0191
   [Trial  42] Refusals: 61/100, KL divergence: 0.0179
   [Trial 100] Refusals: 92/100, KL divergence: 0.0175
   [Trial  38] Refusals: 95/100, KL divergence: 0.0168
   [Trial  99] Refusals: 97/100, KL divergence: 0.0111
   [Trial  74] Refusals: 98/100, KL divergence: 0.0011
   Run additional trials

Maybe the number of parameters (convergence) / num of adversarial directions can influence something. need to think about this

The Huggingface version had 3/100, 0.1166 for reference

I wonder how it will stack with MPOA

And here the trials num is 150

@p-e-w
Copy link
Owner

p-e-w commented Mar 3, 2026

The "anti-abliteration" aligned models like Mistral-7b-RR are a special tricky case. I'd like to test this model in heretic too, but it doesn't load for some reason (tokenizer problems?)

The tokenizer indeed appears to differ from those in all three versions of Mistral-7B-Instruct, but seems to be most similar to that from 0.1. Maybe cloning Mistral-7B-Instruct-v0.1 and then overwriting the .safetensors files with those from Mistral-7B-Instruct-RR will do the trick?

@p-e-w
Copy link
Owner

p-e-w commented Mar 3, 2026

Maybe the number of parameters (convergence) / num of adversarial directions can influence something. need to think about this

The parameter space we're asking Optuna to search is absolutely massive. With 13 parameters and 200 trials, it cannot even exhaustively check two values per parameter (2^13 = 8192 >> 200). Every reduction in parameter count matters.

@kabachuha
Copy link
Author

4 directions + MPOA. Lower KL, but +1 refusal.

? Which trial do you want to use? (Use arrow keys)
 » [Trial 155] Refusals:  4/100, KL divergence: 0.0908
   [Trial  92] Refusals:  7/100, KL divergence: 0.0855
   [Trial 166] Refusals:  8/100, KL divergence: 0.0807
   [Trial 126] Refusals: 14/100, KL divergence: 0.0747
   [Trial   2] Refusals: 15/100, KL divergence: 0.0655
   [Trial 157] Refusals: 19/100, KL divergence: 0.0591
   [Trial  64] Refusals: 21/100, KL divergence: 0.0483
   [Trial  27] Refusals: 23/100, KL divergence: 0.0474
   [Trial 168] Refusals: 26/100, KL divergence: 0.0470
   [Trial  93] Refusals: 27/100, KL divergence: 0.0401
   [Trial 146] Refusals: 29/100, KL divergence: 0.0373
   [Trial 178] Refusals: 31/100, KL divergence: 0.0354
   [Trial 101] Refusals: 33/100, KL divergence: 0.0268
   [Trial 198] Refusals: 41/100, KL divergence: 0.0265
   [Trial 164] Refusals: 56/100, KL divergence: 0.0229
   [Trial 192] Refusals: 61/100, KL divergence: 0.0225
   [Trial 122] Refusals: 76/100, KL divergence: 0.0190
   [Trial 119] Refusals: 89/100, KL divergence: 0.0180
   [Trial  66] Refusals: 94/100, KL divergence: 0.0165
   [Trial 123] Refusals: 97/100, KL divergence: 0.0126
   [Trial  87] Refusals: 98/100, KL divergence: 0.0012
   Run additional trials

@spikymoth
Copy link
Contributor

FWIW, here's what the Optuna developers recommend for suggesting variables that represent proportions:
https://optuna.readthedocs.io/en/latest/faq.html#how-do-i-suggest-variables-which-represent-the-proportion-that-is-are-in-accordance-with-dirichlet-distribution

I think it would be reasonable to follow this:

  1. Keep a singular max_weight and min_weight per trial
  2. Pass additional proportional variables in accordance with the Dirichlet distribution
  3. Multiply each direction with its proportional variable (after applying the weight)

This still adds som_k variables per component, but keeps the total weight in check (which should also make the comparison with a single direction more fair, as raising the limit for max_weight to 4-5 has been observed to enable stronger ablations when combined with MPOA).

@p-e-w p-e-w mentioned this pull request Mar 4, 2026
4 tasks
@p-e-w
Copy link
Owner

p-e-w commented Mar 4, 2026

@kabachuha

I'm breathing down your neck with p-e-w/gpt-oss-20b-heretic-ara:

ara-results

😄

See #211 for details.

@kabachuha
Copy link
Author

kabachuha commented Mar 4, 2026

12 is much higher than 3 or 7 :)

@p-e-w
Copy link
Owner

p-e-w commented Mar 4, 2026

But my KLD is much lower than yours.

Also, the refusal count is always an approximation, because trigger words can occur in compliant responses, and responses that aren't strictly speaking refusals (and don't contain trigger words) can still be non-compliant in spirit. So the refusals should be viewed more as a rough indicator of compliance, and a difference of less than 10 is not reliably indicative of anything.

@erm14254
Copy link
Contributor

erm14254 commented Mar 4, 2026

I am curious, is it possible to know how much the degradation going from 0.0657 to 0.1166 affects the model capabilities? With a KL Divergence of +0.0509, what sort of degradation are we talking about here? Small degradation? Big degradation?

@p-e-w
Copy link
Owner

p-e-w commented Mar 4, 2026

@erm14254

The KL divergence of probability distributions has a clear mathematical interpretation in terms of "surprise", but that doesn't translate to a tangible statement about LLM output quality.

The KLD is still uniquely valuable though, because it's one of very few metrics with the property that as it approaches zero, the model approaches the exact behavior of the original model, almost surely. This is not true for benchmarks, for example.

@p-e-w
Copy link
Owner

p-e-w commented Mar 5, 2026

Okay, I have enabled TPE optimization for additional parameters in ARA (#211), and am now getting unambiguously better results for gpt-oss-20b than with this PR:

p-e-w/gpt-oss-20b-heretic-ara-v3

ara-results

This is the same refusal count as kabachuha/gpt-oss-20b-SOMbliterated, but at half the KL divergence. I actually reached 0 refusals on many trials, with KLD as low as 0.25. I have tested this model in chat and it appears to be excellent, giving detailed, pertinent responses to requests rather than evasive non-answers.

Note that I'm still exploring which parameter ranges are appropriate in the general case, so ARA may not yet work that well with other models.

@kabachuha
Copy link
Author

@p-e-w Great work! I'm curious what is the math / explainability behind ARA and what are its failure cases

Would you mind making some multidimensional PCA plots to show the evolution of the hidden states as ARA approaches the target value, like here below the authors of the SOM paper did?

llama2-7b_mix_layer13_3d_compare_mds

@p-e-w
Copy link
Owner

p-e-w commented Mar 6, 2026

Yes, I will show such visualizations in the upcoming writeup, though I will probably use 2D since I find 3D plots very difficult to interpret.

@joninco
Copy link

joninco commented Mar 9, 2026

Tested SOMA on Qwen3.5-35B-A3B (MoE, 40 layers, 256 experts, 8 routed + 1 shared, GatedDeltaNet hybrid attention). Only the shared expert's mlp.down_proj is targetable as a standard nn.Linear — routed experts use fused 3D tensors.

Setup

  • 8× RTX PRO 6000 Blackwell (96GB), one heretic worker per GPU sharing an Optuna journal (400 trials total)
  • LLM judge for refusal classification instead of marker-based detection — marker-based showed 3-5/100 refusals where the judge found 33+/100 on this model
  • Curated dataset: 10 harm categories × 40 prompts, style-matched benign set, 30/10 train/test split
  • Two patches for Qwen3.5 support:
    1. Shared expert discovery in get_layer_modules() (layer.mlp.shared_expert.down_proj)
    2. Hybrid layer support — GatedDeltaNet layers use linear_attn.out_proj instead of self_attn.o_proj, so get_abliterable_components() scans all layers

SOMA Pareto Front (judge-validated)

[Trial  17] Refusals: 19/100, KL divergence: 0.5603
[Trial  11] Refusals: 32/100, KL divergence: 0.4203
[Trial  12] Refusals: 33/100, KL divergence: 0.2883
[Trial  12] Refusals: 44/100, KL divergence: 0.2792
[Trial  33] Refusals: 45/100, KL divergence: 0.1536
[Trial  12] Refusals: 55/100, KL divergence: 0.0378
[Trial  32] Refusals: 60/100, KL divergence: 0.0277
[Trial  13] Refusals: 61/100, KL divergence: 0.0198
[Trial  31] Refusals: 62/100, KL divergence: 0.0120
[Trial  11] Refusals: 71/100, KL divergence: 0.0017

ARA Pareto Front on Same Model/Dataset/Judge (for reference)

Refusals: 33/100, KL divergence: 0.6850
Refusals: 35/100, KL divergence: 0.1220
Refusals: 43/100, KL divergence: 0.0260
Refusals: 56/100, KL divergence: 0.0150

Config

multidirectional_som = true
som_x = 4
som_y = 4
som_iterations = 10000
som_lr = 0.01
som_sigma = 0.5
som_k = 4
winsorization_quantile = 0.995
batch_size = 8
max_response_length = 256
n_trials = 50  # per worker × 8 GPUs = 400 total
n_startup_trials = 80

Open to trying something else!

@kabachuha
Copy link
Author

@joninco Thank you! This is very valuable.

It looks more destructive than ARA, but can push the refusals lower, as I see from the table.

Turns out, Qwen3.5 is harder to crack fully than we first thought

@kabachuha
Copy link
Author

Very recent (4 Mar 2026) paper on another refusal elimination method. Seems to extend SOM's idea of the refusal manifold and its collapse through not simple centroid direction, but though a morphing optimal transport process.

https://arxiv.org/pdf/2603.04355

image

@p-e-w
Copy link
Owner

p-e-w commented Mar 9, 2026

@joninco

Thank you for doing this comparison! It's difficult to tell from the Pareto fronts which approach is better.

There are also lots of potentially confounding factors. You are using custom datasets and custom refusal detection code, as well as custom settings that I consider problematic (n_startup_trials too high; higher is not automatically better because it can cause TPE overconfidence). There is also the worker sharding which I have never tried before and mainline Heretic doesn't yet support.

We need a cleaner experimental setup that doesn't change so many things at once.

@p-e-w
Copy link
Owner

p-e-w commented Mar 9, 2026

@joninco

Just to have a comparison, could you also run the same test with the current Heretic baseline (that is, just the standard Heretic run without any PR)?

@p-e-w
Copy link
Owner

p-e-w commented Mar 9, 2026

@kabachuha

It looks more destructive than ARA, but can push the refusals lower, as I see from the table.

I expect that ARA will achieve much lower refusals if the upper range for overcorrect_relative_weight is raised, which is what happened with GPT-OSS. It went from a best result of 30 refusals to 0 refusals by raising the maximum from 1.0 to 1.3.

I'm going to push several small adjustments to the ARA branch tomorrow.

@GhostWithAHat
Copy link

I merged this into the main branch locally and gave it a try with gpt-oss-20b and Qwen3.5-4B. With gpt-oss-20b i have about the same results as with ARA. For Qwen3.5-4B the difference to ARA is significant:
ARA: 4/100 refusals, KDL 0.1396
SOMA: 2/100 refusals, KLD 0.1301

@erm14254
Copy link
Contributor

erm14254 commented Mar 11, 2026

I merged this into the main branch locally and gave it a try with gpt-oss-20b and Qwen3.5-4B. With gpt-oss-20b i have about the same results as with ARA. For Qwen3.5-4B the difference to ARA is significant: ARA: 4/100 refusals, KDL 0.1396 SOMA: 2/100 refusals, KLD 0.1301

Based on my testings as of right now best results are obtained like this:

gpt-oss: ARA
Qwen3.5: SOMA

@p-e-w
Copy link
Owner

p-e-w commented Mar 11, 2026

For Qwen3.5-4B the difference to ARA is significant:
ARA: 4/100 refusals, KDL 0.1396
SOMA: 2/100 refusals, KLD 0.1301

That's actually not a significant difference at all, because keyword-based refusal detection isn't very reliable. If you conclude from this that "the ARA version refuses twice as often as the SOMA version", then I can assure you from many months of experience that this isn't how it works. A difference like that is completely in the noise.

KLD is by far the more reliable of the two metrics, and refusal count should be understood to carry a "±10" after it, at minimum.

Did you run with the latest ARA commit? I also recommend trying to remove mlp.down_proj from the target_components setting to see if it improves the results.

@joninco
Copy link

joninco commented Mar 11, 2026

@joninco

Just to have a comparison, could you also run the same test with the current Heretic baseline (that is, just the standard Heretic run without any PR)?

Ran the Heretic baseline (no ARA/SOMA) with just my dataset to reduce the number of variables. However, I think the refusal rate isn't very accurate with simple string matching and the Qwen3.5 models. I think I'll just wait for your ARA final changes!

[Trial 178] Refusals: 39/100, KL divergence: 0.0047
[Trial 171] Refusals: 42/100, KL divergence: 0.0020
[Trial 169] Refusals: 43/100, KL divergence: 0.0016
[Trial 102] Refusals: 44/100, KL divergence: 0.0015
[Trial 132] Refusals: 45/100, KL divergence: 0.0010
[Trial 114] Refusals: 46/100, KL divergence: 0.0009
[Trial 9] Refusals: 47/100, KL divergence: 0.0007
[Trial 194] Refusals: 48/100, KL divergence: 0.0003
[Trial 8] Refusals: 60/100, KL divergence: 0.0003
[Trial 122] Refusals: 61/100, KL divergence: 0.0001
[Trial 182] Refusals: 63/100, KL divergence: 0.0001
[Trial 152] Refusals: 66/100, KL divergence: 0.0001

@GhostWithAHat
Copy link

Did you run with the latest ARA commit?

Just now. Very impressive results with Qwen3.5 4B!
[Trial 179] Refusals: 0/100, KL divergence: 0.3984
[Trial 77] Refusals: 1/100, KL divergence: 0.1338

@p-e-w
Copy link
Owner

p-e-w commented Mar 12, 2026

@GhostWithAHat

Thanks for the update. Of course, my comment regarding the refusal count also applies when it favors ARA.

Your three results of

Refusals: 1/100, KL divergence: 0.1338

4/100 refusals, KDL 0.1396

2/100 refusals, KLD 0.1301

Are all indistinguishable from a practical standpoint. Neither the difference in refusals nor that in KLD is at all significant. It appears that for this model, ARA and SOMA perform equally well.

@p-e-w
Copy link
Owner

p-e-w commented Mar 12, 2026

@joninco

I think I'll just wait for your ARA final changes!

The changes to the ARA branch have already been made. It would be great if you could re-run your tests with the latest ARA changes, while keeping everything else the same way you had it for the previous ARA and SOMA tests, so the results can be compared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

9 participants