[PoC]: Multi-directional refusal supression with Self-Organizing Maps by kabachuha · Pull Request #196 · p-e-w/heretic

kabachuha · 2026-02-26T22:08:14Z

So, I finally implemented multi-directional abliteration in heretic, porting the Self-Organizing Maps refusal direction determination from #140.

Integrating Multidirections and SOMs turned out to be much easier than I thought, thanks to a local GLM Air derestricted who helped with the code.

Initially I thought it would break everything and I would fail, but no, it worked perfectly with multiple directions optimization. (In my experiment I used 4 directions)

Here is the comparison to the readme reference

Model	Refusals for "harmful" prompts	KL divergence from original model for "harmless" prompts
google/gemma-3-12b-it (original)	97/100	0 (by definition)
mlabonne/gemma-3-12b-it-abliterated-v2	3/100	1.04
huihui-ai/gemma-3-12b-it-abliterated	3/100	0.45
p-e-w/gemma-3-12b-it-heretic	3/100	0.16
gemma-3-12b-it-SOM-heretic (mine)	3/100	0.08

Closes #140. I know it will be better to have as a plugin, so this PR serves as a future reference.

From testing (chat with the model), can confirm that it writes NSFW and edgy content just fine. I'm going to sleep and I can upload the model to huggingface if you will ask for it.

Also, directions visualization would be fun to have.

Here is how the parameter listing looks like:

TODO: make main logic work with lists

gemini-code-assist · 2026-02-26T22:08:32Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the model's refusal suppression mechanism by enabling multi-directional abliteration. It leverages Self-Organizing Maps to automatically identify and utilize multiple distinct refusal vectors, moving beyond a single, averaged direction. This change aims to improve the precision of refusal while minimizing unintended side effects on the model's general capabilities, as evidenced by the improved KL divergence metrics.

Highlights

Multi-directional Refusal Suppression: Implemented a novel approach for multi-directional refusal suppression, allowing the model to learn and apply multiple distinct refusal directions, enhancing its ability to handle harmful prompts while preserving performance on harmless ones.
Self-Organizing Maps (SOM) Integration: Integrated Self-Organizing Maps (SOMs) to automatically determine and extract multiple refusal directions from 'bad' prompt residuals. This replaces the previous single-direction approach with a more nuanced, data-driven method.
Configurability: Introduced new configuration parameters in config.default.toml and src/heretic/config.py to control SOM behavior, including grid dimensions (som_x, som_y), training iterations, learning rate, neighborhood radius (sigma), and the number of top neurons (som_k) to use.
Abliteration Logic Update: Refactored the core abliterate function in src/heretic/model.py to process and apply multiple refusal directions simultaneously. This involves calculating and summing individual delta Ws for each direction before applying them to the model weights.
Performance Improvement: Demonstrated improved performance, achieving a lower KL divergence (0.08) from the original model for 'harmless' prompts compared to previous methods, while maintaining effective refusal rates for 'harmful' prompts (3/100).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

config.default.toml
- Added new configuration parameters for enabling multi-directional SOM and controlling its properties (som_x, som_y, som_iterations, som_lr, som_sigma, som_k).
src/heretic/config.py
- Defined new multidirectional_som boolean field and associated integer/float fields for SOM parameters within the Settings class.
src/heretic/main.py
- Added a conditional import for the minisom library, providing a user-friendly error message if not installed.
- Implemented logic to use SOMCalculator for deriving multiple bad_means (refusal directions) per layer when multidirectional_som is enabled.
- Updated the calculation of refusal_directions and orthogonalization steps to handle a list of directions instead of a single direction.
src/heretic/model.py
- Modified the AbliterationParameters dataclass to accept max_weights and min_weights as lists, accommodating multiple refusal directions.
- Refactored the abliterate method to iterate over multiple refusal directions, calculate a total_delta_W by summing individual direction deltas, and apply this combined delta to the model weights.
src/heretic/som.py
- Added a new file defining the SOMCalculator class, which provides methods for initializing, training a MiniSom instance, and extracting the weights of the top-k winning neurons.
src/heretic/utils.py
- Updated the get_trial_parameters function to correctly format and display list-based parameters, such as max_weights and min_weights, for each direction.

Activity

The pull request was opened by kabachuha, introducing the concept and initial implementation of multi-directional refusal suppression using Self-Organizing Maps.
The author provided performance comparisons demonstrating the effectiveness of the new approach in reducing refusal rates for harmful prompts while maintaining low KL divergence for harmless prompts.
The author noted that the current implementation serves as a reference and could be refactored into a plugin in the future.
The author confirmed through testing that the model can generate NSFW and edgy content, indicating successful abliteration.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature for multi-directional refusal suppression using Self-Organizing Maps (SOMs). The implementation touches configuration, main application logic, and model abliteration, and adds a new som.py module.

My review has identified a few issues:

There are some minor violations of the repository style guide in config.default.toml and src/heretic/som.py regarding comment formatting, configuration consistency, and missing type hints.
More importantly, I've found several potential correctness and logic bugs:
- A likely runtime error in src/heretic/main.py when handling the number of neurons from the SOM.
- A critical bug in src/heretic/model.py where the abliteration logic is incorrect for RowNormalization.PRE and RowNormalization.NONE, which could lead to no effect or breaking subsequent operations.
- A potential out-of-bounds access in src/heretic/model.py.
- Unconventional logic in src/heretic/som.py for identifying top neurons, which may not produce the intended results.

Please address these points to ensure the new feature is robust and correct.

src/heretic/model.py

src/heretic/main.py

src/heretic/model.py

src/heretic/som.py

config.default.toml

src/heretic/som.py

p-e-w · 2026-02-27T02:41:39Z

Cool stuff! As you mentioned, this will probably become a plugin, but it's great to have a PR to experiment with.

p-e-w · 2026-02-27T02:44:41Z

src/heretic/model.py

+                    W = W.view(W.shape[0], -1) # Flatten to (out_features, in_features)

                    if self.settings.row_normalization != RowNormalization.NONE:
-                        # Keep a reference to the original weight matrix so we can subtract it later.


Please don't remove comments. This PR will be a lot easier to read without the many unnecessary changes to existing code that are currently in it.

@p-e-w Yes, sure. It is 80%+ vibe-coded and I didn't clean up it at the time of push. I will do it soon

This reverts commit 0be6b87. Win map results in worse results than without it and it hasn't been in the original.

kabachuha · 2026-02-27T14:07:28Z

So, I abliterated gemma 27b now

Comparison to existing huggingface variants I found:

Model	Refusals for "harmful" prompts	KL divergence from original model for "harmless" prompts
google/gemma-3-27b-it (original)	98/100	0 (by definition)
kldzj/gemma-3-27b-it-heretic	2/100	0.36
Userb1az/gemma-3-27b-it-heretic	1/100	0.31
TeeZee/gemma-3-27b-it-heretic-v2	7/100	0.16
LastRef/gemma-3-27b-it-heretic-x-a	0/100	0.12
gemma-3-27b-it-SOM-heretic (this)	4/100	0.09

As it's seen, it has slightly higher refusal count, but much less KL divergence than the other models. Chat with it also shows that it works fine on sensitive and unsafe topics.

I wonder how it, together with heretic, will fare on more advanced models like GraySwanAI/Mistral-7B-Instruct-RR (built through representation correction specifically with abliteration protection in mind, and the SOM paper authors paid additional attention to breaking it - they had a success rate of 25%, while vanilla ablit capped on 5%).

And gpt-oss, of course.

The best would be to test the Multi-direction and Single-direction models on standard benchmarks, but it will require resources and time.

kabachuha · 2026-02-27T17:11:35Z

@p-e-w @red40maxxer Okay, dudes, I think I've just achieved SotA on GPT-OSS abliteration. With five abliteration directions, I've achieved refusal rate of 3 / 100 while also keeping KL of 0.1166.

Now, I will compare all gpt-oss-20b heretics I managed to find on Huggingface to the method from this multidirectional Pull Request.

Model	Refusals for "harmful" prompts	KL divergence from original model for "harmless" prompts
openai/gpt-oss-20b (original)	100/100	0 (by definition)
p-e-w/gpt-oss-20b-heretic	58/100	0.96
kldzj/gpt-oss-20b-heretic-bf16	6/100	1.56
p-e-w/gpt-oss-20b-heretic-v2	49/100	0.125
coder3101/gpt-oss-20b-heretic	19/100	0.293
Chris886621991/gpt-oss-20b-heretic	59/100	0.07
DavidAU/OpenAI-gpt-oss-20B-INSTRUCT-Heretic-Uncensored	19/100	0.293
p-e-w/gpt-oss-20b-heretic-v3	74/100	0.05
kabachuha/gpt-oss-20b-SOMbliterated (this)	3/100	0.117

So, yeah, this method is a bomb. A pipe bomb.

Uploaded to Huggingface: https://huggingface.co/kabachuha/gpt-oss-20b-SOMbliterated

MagicalAlchemist · 2026-02-27T19:09:50Z

I've tried it, and I'm quite surprised by the drastic drop in refusal on the first trial.

MagicalAlchemist/Apriel-1.6-15b-Thinker-Magic_beta-decensored

kabachuha · 2026-02-27T19:26:31Z

@MagicalAlchemist From the diagram here, for example, I can see these two clusters fit just fine on the blue zone when "pulled" with imaginary lines just like a puzzle. So it's quite possible that it was a perfect mix-up.

I'm more shocked with this low KL 👀

MagicalAlchemist · 2026-02-27T20:15:54Z

@MagicalAlchemist From the diagram here, for example, I can see these two clusters fit just fine on the blue zone when "pulled" with imaginary lines just like a puzzle. So it's quite possible that it was a perfect mix-up.

These two dots are the dissidents :D

I'm more shocked with this low KL 👀

likely an MPOA self-healing effect after that hard ablation 🥶

kabachuha · 2026-02-27T20:35:09Z

Now the big refusey boy, gpt-oss-120b.

Model	Refusals for "harmful" prompts	KL divergence from original model for "harmless" prompts
openai/gpt-oss-120b (original)	97/100	0 (by definition)
nightmedia/gpt-oss-120b-heretic-v2-mxfp4-q8-hi-mlx	22/100	0.53
hrktos-37/gpt-oss-120b-heretic	26/100	0.08
TeeZee/gpt-oss-120b-heretic-v1	15/100	0.76
jbeslt/gpt-oss-120b-heretic	21/100	0.95
kldzj/gpt-oss-120b-heretic-v2	22/100	0.53
kldzj/gpt-oss-120b-heretic	19/100	0.92
kabachuha/gpt-oss-120b-SOMbliterated (this)	7/100	0.22

It works so good, it feels unreal 👀

For reproducibility, the run took ~1h 5min on a H100. (400 trials)

Also, for people loving precision over refusals, here is a Pareto selection below. If you want to have really low KLs, I recommend to download this pull request and try it for yourself.

Btw, I ran out of memory saving this, so cannot upload, aargh 😩😩😩

@p-e-w, please fix the LoRA, I wasted a damn hour.

Subjective evaluation: NSFW writing and profanity work fine.

The model has a slight DID for some requests. For example, it will recite the safety policy and agree with that it is allowed to give you the pipe bomb recipe. After agreement in the reasoning, it gives the recipe just as asked and even an attack plan. It distorts the meaning of safety in "yours" safety, so it makes sure you will survive the attack. In the end it gives generic safety and legality advice, but no refusal.

What is this monster?

Merge lora request or return lora saving, @p-e-w, I cannot even save this model. 😡😡😡

kabachuha · 2026-03-03T09:10:18Z

At this moment, for most overwhelming models, the refusal reduction for many directions is quite fast and achieves the floor lower than non-SOM abliterate runs. The adequate refusal elimination speed from multiple directions, as I view myself, happens much faster and better than the ultimate convergence for non-SOM models. Unless the companies will train their models robustly to create multiple refusal directions intentionally (see Mistral-7b-RR, the SOM paper authors paid additional attention to. And the newer gpt-oss-releases (hypothesised gpt-oss-2 aka Aurora Alpha on OpenRouter) can definitely have them), I don't think the parameter reduction is necessary

Scalability to 300b+ models (GLM, Kimi), is another question. In fact, I haven't seen any heretics for this scale models to begin comparing with.

p-e-w · 2026-03-03T09:17:45Z

Have you experimented with different numbers of directions? This PR sets a default of 4, but gpt-oss-20b-SOMbliterated appears to have used 5, unless I'm misunderstanding something.

For models that also have an MLP to abliterate, the number of parameters would grow to 25, which is definitely too many for TPE to converge reliably within 200 trials.

kabachuha · 2026-03-03T09:28:48Z

Have you experimented with different numbers of directions? This PR sets a default of 4, but gpt-oss-20b-SOMbliterated appears to have used 5

I set 4 for testing simplicity, and used 5 because in the paper it has given the best results. I launch gpt-oss-20b with 4 directions now.

Reducing the number of parameters might be hard because the manifold seems very non-linear, as per the additional experiments section (Fig 9-12) in the original paper https://arxiv.org/pdf/2511.08379

The "anti-abliteration" aligned models like Mistral-7b-RR are a special tricky case. I'd like to test this model in heretic too, but it doesn't load for some reason (tokenizer problems?)

kabachuha · 2026-03-03T09:48:22Z

@p-e-w Hmm, yes, I got 3/100, 0.1055 with 4 directions, without additional MPOA. (better than the SOMbliterated version on Huggingface, which used MPOA)

? Which trial do you want to use? (Use arrow keys)
 » [Trial  97] Refusals:  3/100, KL divergence: 0.1055
   [Trial  41] Refusals: 15/100, KL divergence: 0.1009
   [Trial 142] Refusals: 18/100, KL divergence: 0.0543
   [Trial 102] Refusals: 24/100, KL divergence: 0.0482
   [Trial 120] Refusals: 28/100, KL divergence: 0.0424
   [Trial  53] Refusals: 30/100, KL divergence: 0.0370
   [Trial 141] Refusals: 35/100, KL divergence: 0.0343
   [Trial 130] Refusals: 37/100, KL divergence: 0.0342
   [Trial 127] Refusals: 40/100, KL divergence: 0.0213
   [Trial 129] Refusals: 43/100, KL divergence: 0.0198
   [Trial 132] Refusals: 46/100, KL divergence: 0.0191
   [Trial  42] Refusals: 61/100, KL divergence: 0.0179
   [Trial 100] Refusals: 92/100, KL divergence: 0.0175
   [Trial  38] Refusals: 95/100, KL divergence: 0.0168
   [Trial  99] Refusals: 97/100, KL divergence: 0.0111
   [Trial  74] Refusals: 98/100, KL divergence: 0.0011
   Run additional trials

Maybe the number of parameters (convergence) / num of adversarial directions can influence something. need to think about this

The Huggingface version had 3/100, 0.1166 for reference

I wonder how it will stack with MPOA

And here the trials num is 150

p-e-w · 2026-03-03T09:49:18Z

The "anti-abliteration" aligned models like Mistral-7b-RR are a special tricky case. I'd like to test this model in heretic too, but it doesn't load for some reason (tokenizer problems?)

The tokenizer indeed appears to differ from those in all three versions of Mistral-7B-Instruct, but seems to be most similar to that from 0.1. Maybe cloning Mistral-7B-Instruct-v0.1 and then overwriting the .safetensors files with those from Mistral-7B-Instruct-RR will do the trick?

p-e-w · 2026-03-03T09:52:25Z

Maybe the number of parameters (convergence) / num of adversarial directions can influence something. need to think about this

The parameter space we're asking Optuna to search is absolutely massive. With 13 parameters and 200 trials, it cannot even exhaustively check two values per parameter (2^13 = 8192 >> 200). Every reduction in parameter count matters.

kabachuha · 2026-03-03T10:50:15Z

4 directions + MPOA. Lower KL, but +1 refusal.

? Which trial do you want to use? (Use arrow keys)
 » [Trial 155] Refusals:  4/100, KL divergence: 0.0908
   [Trial  92] Refusals:  7/100, KL divergence: 0.0855
   [Trial 166] Refusals:  8/100, KL divergence: 0.0807
   [Trial 126] Refusals: 14/100, KL divergence: 0.0747
   [Trial   2] Refusals: 15/100, KL divergence: 0.0655
   [Trial 157] Refusals: 19/100, KL divergence: 0.0591
   [Trial  64] Refusals: 21/100, KL divergence: 0.0483
   [Trial  27] Refusals: 23/100, KL divergence: 0.0474
   [Trial 168] Refusals: 26/100, KL divergence: 0.0470
   [Trial  93] Refusals: 27/100, KL divergence: 0.0401
   [Trial 146] Refusals: 29/100, KL divergence: 0.0373
   [Trial 178] Refusals: 31/100, KL divergence: 0.0354
   [Trial 101] Refusals: 33/100, KL divergence: 0.0268
   [Trial 198] Refusals: 41/100, KL divergence: 0.0265
   [Trial 164] Refusals: 56/100, KL divergence: 0.0229
   [Trial 192] Refusals: 61/100, KL divergence: 0.0225
   [Trial 122] Refusals: 76/100, KL divergence: 0.0190
   [Trial 119] Refusals: 89/100, KL divergence: 0.0180
   [Trial  66] Refusals: 94/100, KL divergence: 0.0165
   [Trial 123] Refusals: 97/100, KL divergence: 0.0126
   [Trial  87] Refusals: 98/100, KL divergence: 0.0012
   Run additional trials

spikymoth · 2026-03-03T23:36:21Z

FWIW, here's what the Optuna developers recommend for suggesting variables that represent proportions:
https://optuna.readthedocs.io/en/latest/faq.html#how-do-i-suggest-variables-which-represent-the-proportion-that-is-are-in-accordance-with-dirichlet-distribution

I think it would be reasonable to follow this:

Keep a singular max_weight and min_weight per trial
Pass additional proportional variables in accordance with the Dirichlet distribution
Multiply each direction with its proportional variable (after applying the weight)

This still adds som_k variables per component, but keeps the total weight in check (which should also make the comparison with a single direction more fair, as raising the limit for max_weight to 4-5 has been observed to enable stronger ablations when combined with MPOA).

p-e-w · 2026-03-04T12:15:40Z

@kabachuha

I'm breathing down your neck with p-e-w/gpt-oss-20b-heretic-ara:

😄

See #211 for details.

kabachuha · 2026-03-04T12:51:51Z

12 is much higher than 3 or 7 :)

p-e-w · 2026-03-04T13:33:31Z

But my KLD is much lower than yours.

Also, the refusal count is always an approximation, because trigger words can occur in compliant responses, and responses that aren't strictly speaking refusals (and don't contain trigger words) can still be non-compliant in spirit. So the refusals should be viewed more as a rough indicator of compliance, and a difference of less than 10 is not reliably indicative of anything.

erm14254 · 2026-03-04T15:13:04Z

I am curious, is it possible to know how much the degradation going from 0.0657 to 0.1166 affects the model capabilities? With a KL Divergence of +0.0509, what sort of degradation are we talking about here? Small degradation? Big degradation?

p-e-w · 2026-03-04T15:55:29Z

@erm14254

The KL divergence of probability distributions has a clear mathematical interpretation in terms of "surprise", but that doesn't translate to a tangible statement about LLM output quality.

The KLD is still uniquely valuable though, because it's one of very few metrics with the property that as it approaches zero, the model approaches the exact behavior of the original model, almost surely. This is not true for benchmarks, for example.

p-e-w · 2026-03-05T09:06:52Z

Okay, I have enabled TPE optimization for additional parameters in ARA (#211), and am now getting unambiguously better results for gpt-oss-20b than with this PR:

p-e-w/gpt-oss-20b-heretic-ara-v3

This is the same refusal count as kabachuha/gpt-oss-20b-SOMbliterated, but at half the KL divergence. I actually reached 0 refusals on many trials, with KLD as low as 0.25. I have tested this model in chat and it appears to be excellent, giving detailed, pertinent responses to requests rather than evasive non-answers.

Note that I'm still exploring which parameter ranges are appropriate in the general case, so ARA may not yet work that well with other models.

kabachuha · 2026-03-05T09:48:19Z

@p-e-w Great work! I'm curious what is the math / explainability behind ARA and what are its failure cases

Would you mind making some multidimensional PCA plots to show the evolution of the hidden states as ARA approaches the target value, like here below the authors of the SOM paper did?

p-e-w · 2026-03-06T06:55:14Z

Yes, I will show such visualizations in the upcoming writeup, though I will probably use 2D since I find 3D plots very difficult to interpret.

joninco · 2026-03-09T10:39:52Z

Tested SOMA on Qwen3.5-35B-A3B (MoE, 40 layers, 256 experts, 8 routed + 1 shared, GatedDeltaNet hybrid attention). Only the shared expert's mlp.down_proj is targetable as a standard nn.Linear — routed experts use fused 3D tensors.

Setup

8× RTX PRO 6000 Blackwell (96GB), one heretic worker per GPU sharing an Optuna journal (400 trials total)
LLM judge for refusal classification instead of marker-based detection — marker-based showed 3-5/100 refusals where the judge found 33+/100 on this model
Curated dataset: 10 harm categories × 40 prompts, style-matched benign set, 30/10 train/test split
Two patches for Qwen3.5 support:
1. Shared expert discovery in get_layer_modules() (layer.mlp.shared_expert.down_proj)
2. Hybrid layer support — GatedDeltaNet layers use linear_attn.out_proj instead of self_attn.o_proj, so get_abliterable_components() scans all layers

SOMA Pareto Front (judge-validated)

[Trial  17] Refusals: 19/100, KL divergence: 0.5603
[Trial  11] Refusals: 32/100, KL divergence: 0.4203
[Trial  12] Refusals: 33/100, KL divergence: 0.2883
[Trial  12] Refusals: 44/100, KL divergence: 0.2792
[Trial  33] Refusals: 45/100, KL divergence: 0.1536
[Trial  12] Refusals: 55/100, KL divergence: 0.0378
[Trial  32] Refusals: 60/100, KL divergence: 0.0277
[Trial  13] Refusals: 61/100, KL divergence: 0.0198
[Trial  31] Refusals: 62/100, KL divergence: 0.0120
[Trial  11] Refusals: 71/100, KL divergence: 0.0017

ARA Pareto Front on Same Model/Dataset/Judge (for reference)

Refusals: 33/100, KL divergence: 0.6850
Refusals: 35/100, KL divergence: 0.1220
Refusals: 43/100, KL divergence: 0.0260
Refusals: 56/100, KL divergence: 0.0150

Config

multidirectional_som = true
som_x = 4
som_y = 4
som_iterations = 10000
som_lr = 0.01
som_sigma = 0.5
som_k = 4
winsorization_quantile = 0.995
batch_size = 8
max_response_length = 256
n_trials = 50  # per worker × 8 GPUs = 400 total
n_startup_trials = 80

Open to trying something else!

kabachuha · 2026-03-09T13:19:26Z

@joninco Thank you! This is very valuable.

It looks more destructive than ARA, but can push the refusals lower, as I see from the table.

Turns out, Qwen3.5 is harder to crack fully than we first thought

kabachuha · 2026-03-09T13:23:18Z

Very recent (4 Mar 2026) paper on another refusal elimination method. Seems to extend SOM's idea of the refusal manifold and its collapse through not simple centroid direction, but though a morphing optimal transport process.

https://arxiv.org/pdf/2603.04355

p-e-w · 2026-03-09T13:42:24Z

@joninco

Thank you for doing this comparison! It's difficult to tell from the Pareto fronts which approach is better.

There are also lots of potentially confounding factors. You are using custom datasets and custom refusal detection code, as well as custom settings that I consider problematic (n_startup_trials too high; higher is not automatically better because it can cause TPE overconfidence). There is also the worker sharding which I have never tried before and mainline Heretic doesn't yet support.

We need a cleaner experimental setup that doesn't change so many things at once.

p-e-w · 2026-03-09T14:44:27Z

@joninco

Just to have a comparison, could you also run the same test with the current Heretic baseline (that is, just the standard Heretic run without any PR)?

p-e-w · 2026-03-09T14:48:37Z

@kabachuha

It looks more destructive than ARA, but can push the refusals lower, as I see from the table.

I expect that ARA will achieve much lower refusals if the upper range for overcorrect_relative_weight is raised, which is what happened with GPT-OSS. It went from a best result of 30 refusals to 0 refusals by raising the maximum from 1.0 to 1.3.

I'm going to push several small adjustments to the ARA branch tomorrow.

GhostWithAHat · 2026-03-11T13:49:51Z

I merged this into the main branch locally and gave it a try with gpt-oss-20b and Qwen3.5-4B. With gpt-oss-20b i have about the same results as with ARA. For Qwen3.5-4B the difference to ARA is significant:
ARA: 4/100 refusals, KDL 0.1396
SOMA: 2/100 refusals, KLD 0.1301

erm14254 · 2026-03-11T14:36:21Z

I merged this into the main branch locally and gave it a try with gpt-oss-20b and Qwen3.5-4B. With gpt-oss-20b i have about the same results as with ARA. For Qwen3.5-4B the difference to ARA is significant: ARA: 4/100 refusals, KDL 0.1396 SOMA: 2/100 refusals, KLD 0.1301

Based on my testings as of right now best results are obtained like this:

gpt-oss: ARA
Qwen3.5: SOMA

p-e-w · 2026-03-11T15:00:03Z

For Qwen3.5-4B the difference to ARA is significant:
ARA: 4/100 refusals, KDL 0.1396
SOMA: 2/100 refusals, KLD 0.1301

That's actually not a significant difference at all, because keyword-based refusal detection isn't very reliable. If you conclude from this that "the ARA version refuses twice as often as the SOMA version", then I can assure you from many months of experience that this isn't how it works. A difference like that is completely in the noise.

KLD is by far the more reliable of the two metrics, and refusal count should be understood to carry a "±10" after it, at minimum.

Did you run with the latest ARA commit? I also recommend trying to remove mlp.down_proj from the target_components setting to see if it improves the results.

joninco · 2026-03-11T17:17:30Z

@joninco

Just to have a comparison, could you also run the same test with the current Heretic baseline (that is, just the standard Heretic run without any PR)?

Ran the Heretic baseline (no ARA/SOMA) with just my dataset to reduce the number of variables. However, I think the refusal rate isn't very accurate with simple string matching and the Qwen3.5 models. I think I'll just wait for your ARA final changes!

[Trial 178] Refusals: 39/100, KL divergence: 0.0047
[Trial 171] Refusals: 42/100, KL divergence: 0.0020
[Trial 169] Refusals: 43/100, KL divergence: 0.0016
[Trial 102] Refusals: 44/100, KL divergence: 0.0015
[Trial 132] Refusals: 45/100, KL divergence: 0.0010
[Trial 114] Refusals: 46/100, KL divergence: 0.0009
[Trial 9] Refusals: 47/100, KL divergence: 0.0007
[Trial 194] Refusals: 48/100, KL divergence: 0.0003
[Trial 8] Refusals: 60/100, KL divergence: 0.0003
[Trial 122] Refusals: 61/100, KL divergence: 0.0001
[Trial 182] Refusals: 63/100, KL divergence: 0.0001
[Trial 152] Refusals: 66/100, KL divergence: 0.0001

GhostWithAHat · 2026-03-11T18:08:08Z

Did you run with the latest ARA commit?

Just now. Very impressive results with Qwen3.5 4B!
[Trial 179] Refusals: 0/100, KL divergence: 0.3984
[Trial 77] Refusals: 1/100, KL divergence: 0.1338

p-e-w · 2026-03-12T06:27:01Z

@GhostWithAHat

Thanks for the update. Of course, my comment regarding the refusal count also applies when it favors ARA.

Your three results of

Refusals: 1/100, KL divergence: 0.1338

4/100 refusals, KDL 0.1396

2/100 refusals, KLD 0.1301

Are all indistinguishable from a practical standpoint. Neither the difference in refusals nor that in KLD is at all significant. It appears that for this model, ARA and SOMA perform equally well.

p-e-w · 2026-03-12T06:30:45Z

@joninco

I think I'll just wait for your ARA final changes!

The changes to the ARA branch have already been made. It would be great if you could re-run your tests with the latest ARA changes, while keeping everything else the same way you had it for the previous ARA and SOMA tests, so the results can be compared.

kabachuha added 13 commits February 26, 2026 21:12

obtain multiple resid directions with SOMs

6ebabc7

TODO: make main logic work with lists

use multidirection in optuna optimizer

3ad2917

multi-directional ablation

bf5f2ac

add SOM params to config template

46c320a

som is relative import

85a6eb0

tensor to cpu for numpy

e012bb9

match minisom args to reference code

7f3b941

permute layers and directions

ce1c589

duplicate de-duplicated neurons

94b858c

fixup suggest float count

a2a209c

print parameter lists

dad69df

correspond the layers count to vanilla code

07655cd

layerwise direction interpolation

31378cb

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

p-e-w reviewed Feb 27, 2026

View reviewed changes

kabachuha added 7 commits February 27, 2026 10:39

fixup cases with not full normalization

3b111e8

config template consistent with fields

001006d

style guide

c37a046

use repeat instead of expand for duplication

c02869a

layer index clamping

9b2d068

som neurons retrieval through win map

0be6b87

gemini is stupid

e8338ca

This reverts commit 0be6b87. Win map results in worse results than without it and it hasn't been in the original.

p-e-w mentioned this pull request Mar 4, 2026

feat: Arbitrary-Rank Ablation (ARA) #211

Draft

4 tasks

Conversation

kabachuha commented Feb 26, 2026

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p-e-w commented Feb 27, 2026

Uh oh!

p-e-w Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

kabachuha Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

kabachuha commented Feb 27, 2026

Uh oh!

kabachuha commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MagicalAlchemist commented Feb 27, 2026

Uh oh!

kabachuha commented Feb 27, 2026

Uh oh!

MagicalAlchemist commented Feb 27, 2026

Uh oh!

kabachuha commented Feb 27, 2026

Uh oh!

kabachuha commented Mar 3, 2026

Uh oh!

p-e-w commented Mar 3, 2026

Uh oh!

kabachuha commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kabachuha commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Mar 3, 2026

Uh oh!

p-e-w commented Mar 3, 2026

Uh oh!

kabachuha commented Mar 3, 2026

Uh oh!

spikymoth commented Mar 3, 2026

Uh oh!

p-e-w commented Mar 4, 2026

Uh oh!

kabachuha commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Mar 4, 2026

Uh oh!

erm14254 commented Mar 4, 2026

Uh oh!

p-e-w commented Mar 4, 2026

Uh oh!

p-e-w commented Mar 5, 2026

Uh oh!

kabachuha commented Mar 5, 2026

Uh oh!

p-e-w commented Mar 6, 2026

Uh oh!

joninco commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

kabachuha commented Feb 27, 2026 •

edited

Loading

kabachuha commented Mar 3, 2026 •

edited

Loading

kabachuha commented Mar 3, 2026 •

edited

Loading

kabachuha commented Mar 4, 2026 •

edited

Loading

joninco commented Mar 9, 2026 •

edited

Loading

erm14254 commented Mar 11, 2026 •

edited

Loading