feat: Add 4-bit loading + LoRA support for low VRAM optimization by accemlcc · Pull Request #60 · p-e-w/heretic

accemlcc · 2025-12-01T09:11:58Z

Hi! As discussed on Reddit, this PR implements a workflow to enable abliteration on consumer hardware with limited VRAM.

Changes:

Implements 4-bit loading via bitsandbytes.
Uses LoRA adapters instead of modifying full weights in-place.
Optimization is calculated on the frozen 4-bit model.
Allows saving the result as a small LoRA adapter.

Impact:

Drastically reduces VRAM usage.
As discussed, the refusal vector is dominant enough that finding it in 4-bit precision works perfectly fine when applied to the full model later.
Tested on Llama 3 70B with negligible Perplexity impact (4.19 -> 4.20).

Let me know if you need any changes!

PLEASE NOTE PEW's Technical Implementation Note:

Just for posterity, the "LoRA" approach in this implementation is a bit of a different beast from how LoRAs are normally used.

The idea behind a low-rank adaptation is to factor the adapter matrix (which is simply added to the module matrix) into a product of two low-rank matrices. This drastically decreases the number of trainable parameters, allowing for more efficient training.

But we don't do training here. In fact, we already pre-decide what the matrix product BA should be, from the abliteration parameters alone.

In principle, that makes PEFT overkill for this task, as we could simply use a basic module wrapper that applies the adapter in the forward pass. I've thought about this because I dislike unnecessary dependencies, but the reason PEFT still makes sense is that it is designed to work with Transformers, and saves us a lot of manual messing around with weights. Therefore, I believe that this PR is correct in using PEFT, despite the approach being somewhat unconventional for LoRA adapters.

* perf: optimize abliteration matrix op * refactor: comments and var names correspond with arditi * refactor: fix comments and improve var notation * fix: accidental line change and improve comments --------- Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com>

p-e-w · 2025-12-02T02:47:08Z

Thanks for the PR!

It appears you changed the line endings for every file in the project, which makes the actual changes very difficult to review. Please fix this so I can proceed.

David-AU-github · 2025-12-02T03:58:39Z

Hi! As discussed on Reddit, this PR implements a workflow to enable abliteration on consumer hardware with limited VRAM.

Changes:

* Implements 4-bit loading via `bitsandbytes`.

* Uses LoRA adapters instead of modifying full weights in-place.

* Optimization is calculated on the frozen 4-bit model.

* Allows saving the result as a small LoRA adapter.

Impact:

* Drastically reduces VRAM usage.

* As discussed, the refusal vector is dominant enough that finding it in 4-bit precision works perfectly fine when applied to the full model later.

* Tested on Llama 3 70B with negligible Perplexity impact (4.19 -> 4.20).

Let me know if you need any changes!

Excellent work. Tried your fork on 2 models - both came out excellent.

src/heretic/model.py

- Check for LoRA adapters before attempting LoRA abliteration - Fall back to direct weight modification for nn.Parameter (GPT-OSS) - Ensures compatibility across all model architectures

Vinay-Umrethe · 2025-12-02T20:32:58Z

@accemlcc hey, so I suggested a user to use your fork which at the end was useful to him, but had to use a older commit since your latest one had a issue projector not defined :

projector = torch.outer(
                    layer_refusal_direction,
                    layer_refusal_direction,
                ).to(self.model.dtype)

this was removed in a commit, I gave him the one before this and worked. Informing just incase if this was accidental removal

src/heretic/model.py

p-e-w

Thanks for this amazing PR! I'm super excited about this. Being able to abliterate with 1/4th the VRAM is a quantum leap.

How will exporting models work with this? When calling model.save_pretrained/model.push_to_hub, are the LoRA adapters applied to the tensors before saving?

README.md

config.default.toml

src/heretic/model.py

accemlcc · 2025-12-03T05:52:20Z

Thanks @p-e-w, @red40maxxer, and @Vinayyyy7 for the detailed review! :)

I've pushed a commit that addresses the critical issues:

Fixed projector bug: Restored the missing logic in the fallback path using the efficient rank-1 update method (thanks @red40maxxer!).
Reverted README: Removed the "fork" language.
Updated Print Statement: Changed "matrices" to "modules".

p-e-w · 2025-12-03T06:51:34Z

Reverted README: Removed the "fork" language.

The README changes are still there in this PR.

p-e-w

So here's the big question: Why exactly do we need the use_lora option? My understanding is that the LoRA approach is mathematically equivalent to the existing code (please confirm this!), it simply uses a different mechanism in Transformers, and it has the major advantage of avoiding model reloads. Under which circumstances would we not want that? Quantization should of course be optional and disabled by default.

There's still the open point regarding model export with LoRAs.

src/heretic/model.py

accemlcc · 2025-12-03T08:35:17Z

So here's the big question: Why exactly do we need the use_lora option?

! -->> Mathematically, the operation is equivalent (a Rank-1 update). However, I'd argue we should keep both options with your original approach as the default:

Why direct modification should remain the default:

Simplicity & Predictability: Your original workflow is cleaner and more transparent. No PEFT dependency, no adapter complexity.
Quantization is complex: Not all models react the same way to quantization. For small models (e.g., 1B), quantization doesn't make sense anyway.
Export workflow: With LoRA, users get only the adapter. They need to know how to merge it correctly (merge_and_unload()). Direct modification works out-of-the-box with save_pretrained().

When LoRA is essential:

Large models (70B+) where VRAM is the bottleneck.
Users explicitly want 4-bit quantization.

Proposal:

Keep quantization = "none" and use_lora = True as defaults (your workflow).
Users who need 4-bit can set quantization = "bnb_4bit".
Optionally, we could add an auto-merge function for LoRA export in the future.

p-e-w · 2025-12-03T10:37:13Z

Quantization is complex: Not all models react the same way to quantization. For small models (e.g., 1B), quantization doesn't make sense anyway.

Sure, but using PEFT doesn't require quantization, right? My question is: Under which circumstances would the user set use_lora to False?

Export workflow: With LoRA, users get only the adapter. They need to know how to merge it correctly (merge_and_unload()). Direct modification works out-of-the-box with save_pretrained().

That needs to be changed. We always want to export the full model. LoRAs are very niche with LLMs, and virtually nobody uses them. When Heretic is run with use_lora=True, the exported model should still be the full transformer with the adapters merged in.

accemlcc · 2025-12-14T10:27:52Z

Status Update: BNB 8-bit Quantization Removed

After extensive testing and debugging, I've removed support for bnb_8bit quantization. Here's why:

The Problem
BitsAndBytes 8-bit quantization is fundamentally incompatible with the LoRA-based abliteration approach. The issue lies in how bitsandbytes handles quantization state:

When loading a model with load_in_8bit=True, weights are stored as Int8Params with associated scale factors (SCB/CB)
These scale factors are required to dequantize weights back to float32 for LoRA abliteration calculations (lora_A = v^T W)
However, bitsandbytes clears SCB/CB after each forward pass to save memory
By the time abliterate()is called, multiple forward passes have already occurred (batch size detection, prefix check, residual calculation), so the quantization state is gone
Without SCB/CB, the int8 values are interpreted directly as floats (-128 to 127), producing completely wrong weight matrices and causing abnormally high KL divergence values (~15-19 instead of ~0.03-0.1)

Attempted Solutions
I explored several approaches:

Using llm_int8_has_fp16_weight=True to preserve FP16 weights → Breaks model loading
Caching dequantized weights before first forward pass → Too invasive and adds significant complexity
Re-quantizing on demand → Would require modifying bitsandbytes internals

Resolution
Given that 4-bit quantization works flawlessly and provides even better memory savings than 8-bit, I've removed bnb_8bit entirely. This keeps the codebase clean and avoids user frustration with a broken feature.

Files changed:

config.py
: Removed BNB_8BIT from enum
model.py
: Removed 8-bit config and related checks
main.py
: Removed 8-bit from merge strategy check
config.default.toml
: Updated documentation

p-e-w · 2025-12-14T10:32:33Z

Ok, no problem. Good you noticed this before we merge.

…l_name, fix type hints, remove GPT-OSS MoE, update assertion

…ht access

p-e-w · 2025-12-14T11:08:21Z

Can you do the full run and post the outcomes (Pareto front)? (Please don't paste the full output here though.)

accemlcc · 2025-12-14T11:52:34Z

Can you do the full run and post the outcomes (Pareto front)? (Please don't paste the full output here though.)

heretic --batch-size=128 openai/gpt-oss-20b

[Trial 140] Refusals: 54/100, KL divergence: 0.0781
[Trial 138] Refusals: 55/100, KL divergence: 0.0724
[Trial 141] Refusals: 59/100, KL divergence: 0.0656
[Trial 71] Refusals: 61/100, KL divergence: 0.0646
[Trial 70] Refusals: 63/100, KL divergence: 0.0567
[Trial 156] Refusals: 64/100, KL divergence: 0.0481
[Trial 152] Refusals: 67/100, KL divergence: 0.0447
[Trial 164] Refusals: 70/100, KL divergence: 0.0385
[Trial 83] Refusals: 72/100, KL divergence: 0.0365
[Trial 79] Refusals: 75/100, KL divergence: 0.0348
[Trial 165] Refusals: 77/100, KL divergence: 0.0322
[Trial 55] Refusals: 89/100, KL divergence: 0.0258
[Trial 112] Refusals: 90/100, KL divergence: 0.0242
[Trial 120] Refusals: 91/100, KL divergence: 0.0217
[Trial 103] Refusals: 92/100, KL divergence: 0.0212
[Trial 51] Refusals: 96/100, KL divergence: 0.0183
[Trial 190] Refusals: 97/100, KL divergence: 0.0115
[Trial 198] Refusals: 98/100, KL divergence: 0.0005

src/heretic/model.py

…filters

…unexpected types

p-e-w

Okay. After changing the two logging messages as indicated, please go over the diff carefully one more time to make sure you think it's good to go, then let me know. I will then merge this mammoth pull request.

p-e-w · 2025-12-14T13:09:55Z

src/heretic/main.py

@@ -328,7 +396,7 @@ def objective(trial: Trial) -> tuple[float, float]:
        for name, value in get_trial_parameters(trial).items():
            print(f"  * {name} = [bold]{value}[/]")
        print("* Reloading model...")


Suggested change

print("* Reloading model...")

print("* Resetting model...")

p-e-w · 2025-12-14T13:10:06Z

src/heretic/main.py

@@ -427,7 +495,7 @@ def objective_wrapper(trial: Trial) -> tuple[float, float]:
        print()
        print(f"Restoring model from trial [bold]{trial.user_attrs['index']}[/]...")
        print("* Reloading model...")


Suggested change

print("* Reloading model...")

print("* Resetting model...")

p-e-w · 2025-12-14T13:15:30Z

src/heretic/model.py

+                    modules[component] = []
+                modules[component].append(module)
+            else:
+                # Assert for unexpected types (catches architecture changes)


👍 Good stuff!

… reset_model

accemlcc · 2025-12-14T14:13:58Z

mmm... I checked everything again, it looks good, but you're obviously much better at code review than I am.

I also uploaded some new models to HF today, all with very good results. In my opinion, the PR is fully functional.

p-e-w · 2025-12-14T14:50:01Z

Merged! Thank you for this giant leap forward!

accemlcc · 2025-12-14T14:59:15Z

Phew... that was an intense but also instructive experience for me as my first PR on GitHub ever. Next time, I might start with something like “hello world” ;-)

Thanks again for your reviews!

p-e-w · 2025-12-14T15:04:36Z

That's pretty big for a first PR! Hope you won't stop there :)

Add files via upload

c7a952d

This comment was marked as off-topic.

Sign in to view

accemlcc added 2 commits December 2, 2025 06:40

Fix line endings to LF

3cacbae

Resolve merge conflict by keeping LoRA implementation

8f1fafd

p-e-w reviewed Dec 2, 2025

View reviewed changes

src/heretic/model.py Show resolved Hide resolved

accemlcc force-pushed the master branch from 5e563fa to 8f1fafd Compare December 2, 2025 06:00

accemlcc added 2 commits December 2, 2025 07:06

Add hybrid approach for GPT-OSS compatibility

cc06d27

- Check for LoRA adapters before attempting LoRA abliteration - Fall back to direct weight modification for nn.Parameter (GPT-OSS) - Ensures compatibility across all model architectures

Merge upstream/master - keep hybrid approach

c6bc76d

Vinay-Umrethe mentioned this pull request Dec 2, 2025

Support aya-expanse-32b-Q4-mlx #61

Open

red40maxxer reviewed Dec 2, 2025

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

p-e-w mentioned this pull request Dec 3, 2025

add support for hybrid layer models: conv, mamba #43

Open

p-e-w reviewed Dec 3, 2025

View reviewed changes

Fix projector bug, update print statement, revert README

5657491

accemlcc added 2 commits December 3, 2025 07:58

Revert README changes to match upstream

86b8852

Fix import sorting for ruff

529a91a

p-e-w reviewed Dec 3, 2025

View reviewed changes

src/heretic/model.py Show resolved Hide resolved

src/heretic/model.py Show resolved Hide resolved

src/heretic/model.py Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

accemlcc added 4 commits December 3, 2025 08:16

Fix reload_model for evaluate_model, add type hints and validation

bdd3f57

Apply ruff formatting

1e0b38b

Replace load_in_4bit with quantization enum

e8c9bd0

Fix precision loss: use FP32 refusal direction directly

018d926

p-e-w reviewed Dec 3, 2025

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

Move r assignment into non-LoRA path

e4b8eb5

accemlcc added 2 commits December 14, 2025 11:48

Address p-e-w review feedback: rename reset_model, remove loaded_mode…

34b34fc

…l_name, fix type hints, remove GPT-OSS MoE, update assertion

Restore skip logic for non-LoRA modules and fix 4-bit base_layer.weig…

da6aa82

…ht access

p-e-w reviewed Dec 14, 2025

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

p-e-w reviewed Dec 14, 2025

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

p-e-w reviewed Dec 14, 2025

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

p-e-w reviewed Dec 14, 2025

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

accemlcc added 3 commits December 14, 2025 13:06

Remove defensive lora_A check per review - get_layer_modules already …

dc3178f

…filters

Fix try_add: nest component init inside Module check, add assert for …

2493c77

…unexpected types

Add note about module.weight assumption for type checking

e0f8324

p-e-w approved these changes Dec 14, 2025

View reviewed changes

accemlcc added 2 commits December 14, 2025 15:08

Change 'Reloading model' to 'Resetting model' in logging

6bd2446

Merge upstream master, resolve conflict: keep parameter printing with…

f685d14

… reset_model

p-e-w merged commit 243f821 into p-e-w:master Dec 14, 2025
4 checks passed

p-e-w mentioned this pull request Dec 14, 2025

fix: multi-gpu and non-2d matrix abliteration #78

Closed

spikymoth mentioned this pull request Dec 14, 2025

Implement Magnitude-Preserving Orthogonal Ablation #52

Merged

accemlcc mentioned this pull request Dec 16, 2025

Longer optimization runs can produce worse Pareto fronts than shorter runs #88

Closed

This was referenced Dec 16, 2025

perf: Multigpu fixes #72

Closed

Fix for multigpu - tensors on the same GPU #71

Closed

feat: add continuous optimization option(latest changes updated) #76

Merged

accemlcc mentioned this pull request Dec 17, 2025

Missing documentation - heretic hardware requirements #89

Closed

p-e-w mentioned this pull request Dec 22, 2025

Project Vision/Roadmap #91

Open

19 tasks

kabachuha mentioned this pull request Feb 18, 2026

feat: blockwise loading #179

Open

Conversation

accemlcc commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

p-e-w commented Dec 2, 2025

Uh oh!

David-AU-github commented Dec 2, 2025

Uh oh!

Uh oh!

Vinay-Umrethe commented Dec 2, 2025

Uh oh!

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

accemlcc commented Dec 3, 2025

Uh oh!

p-e-w commented Dec 3, 2025

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

accemlcc commented Dec 3, 2025

Uh oh!

p-e-w commented Dec 3, 2025

Uh oh!

accemlcc commented Dec 14, 2025

Uh oh!

p-e-w commented Dec 14, 2025

Uh oh!

p-e-w commented Dec 14, 2025

Uh oh!

accemlcc commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

p-e-w Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

p-e-w Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

p-e-w Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

accemlcc commented Dec 14, 2025

Uh oh!

Uh oh!

p-e-w commented Dec 14, 2025

Uh oh!

accemlcc commented Dec 14, 2025

Uh oh!

p-e-w commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

accemlcc commented Dec 1, 2025 •

edited

Loading

accemlcc commented Dec 14, 2025 •

edited

Loading