[PoC]: Multi-directional refusal supression with Self-Organizing Maps#196
[PoC]: Multi-directional refusal supression with Self-Organizing Maps#196kabachuha wants to merge 21 commits intop-e-w:masterfrom
Conversation
TODO: make main logic work with lists
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the model's refusal suppression mechanism by enabling multi-directional abliteration. It leverages Self-Organizing Maps to automatically identify and utilize multiple distinct refusal vectors, moving beyond a single, averaged direction. This change aims to improve the precision of refusal while minimizing unintended side effects on the model's general capabilities, as evidenced by the improved KL divergence metrics. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature for multi-directional refusal suppression using Self-Organizing Maps (SOMs). The implementation touches configuration, main application logic, and model abliteration, and adds a new som.py module.
My review has identified a few issues:
- There are some minor violations of the repository style guide in
config.default.tomlandsrc/heretic/som.pyregarding comment formatting, configuration consistency, and missing type hints. - More importantly, I've found several potential correctness and logic bugs:
- A likely runtime error in
src/heretic/main.pywhen handling the number of neurons from the SOM. - A critical bug in
src/heretic/model.pywhere the abliteration logic is incorrect forRowNormalization.PREandRowNormalization.NONE, which could lead to no effect or breaking subsequent operations. - A potential out-of-bounds access in
src/heretic/model.py. - Unconventional logic in
src/heretic/som.pyfor identifying top neurons, which may not produce the intended results.
- A likely runtime error in
Please address these points to ensure the new feature is robust and correct.
|
Cool stuff! As you mentioned, this will probably become a plugin, but it's great to have a PR to experiment with. |
| W = W.view(W.shape[0], -1) # Flatten to (out_features, in_features) | ||
|
|
||
| if self.settings.row_normalization != RowNormalization.NONE: | ||
| # Keep a reference to the original weight matrix so we can subtract it later. |
There was a problem hiding this comment.
Please don't remove comments. This PR will be a lot easier to read without the many unnecessary changes to existing code that are currently in it.
There was a problem hiding this comment.
@p-e-w Yes, sure. It is 80%+ vibe-coded and I didn't clean up it at the time of push. I will do it soon
This reverts commit 0be6b87. Win map results in worse results than without it and it hasn't been in the original.
|
So, I abliterated gemma 27b now Comparison to existing huggingface variants I found:
As it's seen, it has slightly higher refusal count, but much less KL divergence than the other models. Chat with it also shows that it works fine on sensitive and unsafe topics. I wonder how it, together with heretic, will fare on more advanced models like GraySwanAI/Mistral-7B-Instruct-RR (built through representation correction specifically with abliteration protection in mind, and the SOM paper authors paid additional attention to breaking it - they had a success rate of 25%, while vanilla ablit capped on 5%). And gpt-oss, of course. The best would be to test the Multi-direction and Single-direction models on standard benchmarks, but it will require resources and time. |
|
@p-e-w @red40maxxer Okay, dudes, I think I've just achieved SotA on GPT-OSS abliteration. With five abliteration directions, I've achieved refusal rate of 3 / 100 while also keeping KL of 0.1166. Now, I will compare all gpt-oss-20b heretics I managed to find on Huggingface to the method from this multidirectional Pull Request.
So, yeah, this method is a bomb. A pipe bomb. Uploaded to Huggingface: https://huggingface.co/kabachuha/gpt-oss-20b-SOMbliterated |
|
I've tried it, and I'm quite surprised by the drastic drop in refusal on the first trial.
MagicalAlchemist/Apriel-1.6-15b-Thinker-Magic_beta-decensored |
@MagicalAlchemist From the diagram here, for example, I can see these two clusters fit just fine on the blue zone when "pulled" with imaginary lines just like a puzzle. So it's quite possible that it was a perfect mix-up. I'm more shocked with this low KL 👀 |
These two dots are the dissidents :D
likely an MPOA self-healing effect after that hard ablation 🥶 |
|
Now the big refusey boy, gpt-oss-120b.
It works so good, it feels unreal 👀 For reproducibility, the run took ~1h 5min on a H100. (400 trials) Also, for people loving precision over refusals, here is a Pareto selection below. If you want to have really low KLs, I recommend to download this pull request and try it for yourself.
Btw, I ran out of memory saving this, so cannot upload, aargh 😩😩😩 @p-e-w, please fix the LoRA, I wasted a damn hour. Subjective evaluation: NSFW writing and profanity work fine. The model has a slight DID for some requests. For example, it will recite the safety policy and agree with that it is allowed to give you the pipe bomb recipe. After agreement in the reasoning, it gives the recipe just as asked and even an attack plan. It distorts the meaning of safety in "yours" safety, so it makes sure you will survive the attack. In the end it gives generic safety and legality advice, but no refusal. What is this monster? Merge lora request or return lora saving, @p-e-w, I cannot even save this model. 😡😡😡 |
|
At this moment, for most overwhelming models, the refusal reduction for many directions is quite fast and achieves the floor lower than non-SOM abliterate runs. The adequate refusal elimination speed from multiple directions, as I view myself, happens much faster and better than the ultimate convergence for non-SOM models. Unless the companies will train their models robustly to create multiple refusal directions intentionally (see Mistral-7b-RR, the SOM paper authors paid additional attention to. And the newer gpt-oss-releases (hypothesised gpt-oss-2 aka Aurora Alpha on OpenRouter) can definitely have them), I don't think the parameter reduction is necessary Scalability to 300b+ models (GLM, Kimi), is another question. In fact, I haven't seen any heretics for this scale models to begin comparing with. |
|
Have you experimented with different numbers of directions? This PR sets a default of 4, but gpt-oss-20b-SOMbliterated appears to have used 5, unless I'm misunderstanding something. For models that also have an MLP to abliterate, the number of parameters would grow to 25, which is definitely too many for TPE to converge reliably within 200 trials. |
I set 4 for testing simplicity, and used 5 because in the paper it has given the best results. I launch gpt-oss-20b with 4 directions now. Reducing the number of parameters might be hard because the manifold seems very non-linear, as per the additional experiments section (Fig 9-12) in the original paper https://arxiv.org/pdf/2511.08379 The "anti-abliteration" aligned models like Mistral-7b-RR are a special tricky case. I'd like to test this model in heretic too, but it doesn't load for some reason (tokenizer problems?) |
|
@p-e-w Hmm, yes, I got 3/100, 0.1055 with 4 directions, without additional MPOA. (better than the SOMbliterated version on Huggingface, which used MPOA) Maybe the number of parameters (convergence) / num of adversarial directions can influence something. need to think about this The Huggingface version had 3/100, 0.1166 for reference I wonder how it will stack with MPOA And here the trials num is 150 |
The tokenizer indeed appears to differ from those in all three versions of Mistral-7B-Instruct, but seems to be most similar to that from 0.1. Maybe cloning Mistral-7B-Instruct-v0.1 and then overwriting the .safetensors files with those from Mistral-7B-Instruct-RR will do the trick? |
The parameter space we're asking Optuna to search is absolutely massive. With 13 parameters and 200 trials, it cannot even exhaustively check two values per parameter (2^13 = 8192 >> 200). Every reduction in parameter count matters. |
|
4 directions + MPOA. Lower KL, but +1 refusal. |
|
FWIW, here's what the Optuna developers recommend for suggesting variables that represent proportions: I think it would be reasonable to follow this:
This still adds |
|
I'm breathing down your neck with p-e-w/gpt-oss-20b-heretic-ara:
😄 See #211 for details. |
|
12 is much higher than 3 or 7 :) |
|
But my KLD is much lower than yours. Also, the refusal count is always an approximation, because trigger words can occur in compliant responses, and responses that aren't strictly speaking refusals (and don't contain trigger words) can still be non-compliant in spirit. So the refusals should be viewed more as a rough indicator of compliance, and a difference of less than 10 is not reliably indicative of anything. |
|
I am curious, is it possible to know how much the degradation going from 0.0657 to 0.1166 affects the model capabilities? With a KL Divergence of +0.0509, what sort of degradation are we talking about here? Small degradation? Big degradation? |
|
The KL divergence of probability distributions has a clear mathematical interpretation in terms of "surprise", but that doesn't translate to a tangible statement about LLM output quality. The KLD is still uniquely valuable though, because it's one of very few metrics with the property that as it approaches zero, the model approaches the exact behavior of the original model, almost surely. This is not true for benchmarks, for example. |
|
Okay, I have enabled TPE optimization for additional parameters in ARA (#211), and am now getting unambiguously better results for gpt-oss-20b than with this PR: p-e-w/gpt-oss-20b-heretic-ara-v3
This is the same refusal count as kabachuha/gpt-oss-20b-SOMbliterated, but at half the KL divergence. I actually reached 0 refusals on many trials, with KLD as low as 0.25. I have tested this model in chat and it appears to be excellent, giving detailed, pertinent responses to requests rather than evasive non-answers. Note that I'm still exploring which parameter ranges are appropriate in the general case, so ARA may not yet work that well with other models. |
|
@p-e-w Great work! I'm curious what is the math / explainability behind ARA and what are its failure cases Would you mind making some multidimensional PCA plots to show the evolution of the hidden states as ARA approaches the target value, like here below the authors of the SOM paper did?
|
|
Yes, I will show such visualizations in the upcoming writeup, though I will probably use 2D since I find 3D plots very difficult to interpret. |
|
Tested SOMA on Qwen3.5-35B-A3B (MoE, 40 layers, 256 experts, 8 routed + 1 shared, GatedDeltaNet hybrid attention). Only the shared expert's Setup
SOMA Pareto Front (judge-validated)ARA Pareto Front on Same Model/Dataset/Judge (for reference)Configmultidirectional_som = true
som_x = 4
som_y = 4
som_iterations = 10000
som_lr = 0.01
som_sigma = 0.5
som_k = 4
winsorization_quantile = 0.995
batch_size = 8
max_response_length = 256
n_trials = 50 # per worker × 8 GPUs = 400 total
n_startup_trials = 80Open to trying something else! |
|
@joninco Thank you! This is very valuable. It looks more destructive than ARA, but can push the refusals lower, as I see from the table. Turns out, Qwen3.5 is harder to crack fully than we first thought |
|
Very recent (4 Mar 2026) paper on another refusal elimination method. Seems to extend SOM's idea of the refusal manifold and its collapse through not simple centroid direction, but though a morphing optimal transport process. https://arxiv.org/pdf/2603.04355
|
|
Thank you for doing this comparison! It's difficult to tell from the Pareto fronts which approach is better. There are also lots of potentially confounding factors. You are using custom datasets and custom refusal detection code, as well as custom settings that I consider problematic ( We need a cleaner experimental setup that doesn't change so many things at once. |
|
Just to have a comparison, could you also run the same test with the current Heretic baseline (that is, just the standard Heretic run without any PR)? |
I expect that ARA will achieve much lower refusals if the upper range for I'm going to push several small adjustments to the ARA branch tomorrow. |
|
I merged this into the main branch locally and gave it a try with gpt-oss-20b and Qwen3.5-4B. With gpt-oss-20b i have about the same results as with ARA. For Qwen3.5-4B the difference to ARA is significant: |
Based on my testings as of right now best results are obtained like this: gpt-oss: ARA |
That's actually not a significant difference at all, because keyword-based refusal detection isn't very reliable. If you conclude from this that "the ARA version refuses twice as often as the SOMA version", then I can assure you from many months of experience that this isn't how it works. A difference like that is completely in the noise. KLD is by far the more reliable of the two metrics, and refusal count should be understood to carry a "±10" after it, at minimum. Did you run with the latest ARA commit? I also recommend trying to remove |
Ran the Heretic baseline (no ARA/SOMA) with just my dataset to reduce the number of variables. However, I think the refusal rate isn't very accurate with simple string matching and the Qwen3.5 models. I think I'll just wait for your ARA final changes! [Trial 178] Refusals: 39/100, KL divergence: 0.0047 |
Just now. Very impressive results with Qwen3.5 4B! |
|
Thanks for the update. Of course, my comment regarding the refusal count also applies when it favors ARA. Your three results of
Are all indistinguishable from a practical standpoint. Neither the difference in refusals nor that in KLD is at all significant. It appears that for this model, ARA and SOMA perform equally well. |
The changes to the ARA branch have already been made. It would be great if you could re-run your tests with the latest ARA changes, while keeping everything else the same way you had it for the previous ARA and SOMA tests, so the results can be compared. |








So, I finally implemented multi-directional abliteration in heretic, porting the Self-Organizing Maps refusal direction determination from #140.
Integrating Multidirections and SOMs turned out to be much easier than I thought, thanks to a local GLM Air derestricted who helped with the code.
Initially I thought it would break everything and I would fail, but no, it worked perfectly with multiple directions optimization. (In my experiment I used 4 directions)
Here is the comparison to the readme reference
Closes #140. I know it will be better to have as a plugin, so this PR serves as a future reference.
From testing (chat with the model), can confirm that it writes NSFW and edgy content just fine. I'm going to sleep and I can upload the model to huggingface if you will ask for it.
Also, directions visualization would be fun to have.
Here is how the parameter listing looks like: