Skip to content

feat: Add 4-bit loading + LoRA support for low VRAM optimization#60

Merged
p-e-w merged 57 commits intop-e-w:masterfrom
accemlcc:master
Dec 14, 2025
Merged

feat: Add 4-bit loading + LoRA support for low VRAM optimization#60
p-e-w merged 57 commits intop-e-w:masterfrom
accemlcc:master

Conversation

@accemlcc
Copy link
Contributor

@accemlcc accemlcc commented Dec 1, 2025

Hi! As discussed on Reddit, this PR implements a workflow to enable abliteration on consumer hardware with limited VRAM.

Changes:

  • Implements 4-bit loading via bitsandbytes.
  • Uses LoRA adapters instead of modifying full weights in-place.
  • Optimization is calculated on the frozen 4-bit model.
  • Allows saving the result as a small LoRA adapter.

Impact:

  • Drastically reduces VRAM usage.
  • As discussed, the refusal vector is dominant enough that finding it in 4-bit precision works perfectly fine when applied to the full model later.
  • Tested on Llama 3 70B with negligible Perplexity impact (4.19 -> 4.20).

Let me know if you need any changes!


PLEASE NOTE PEW's Technical Implementation Note:

Just for posterity, the "LoRA" approach in this implementation is a bit of a different beast from how LoRAs are normally used.

The idea behind a low-rank adaptation is to factor the adapter matrix (which is simply added to the module matrix) into a product of two low-rank matrices. This drastically decreases the number of trainable parameters, allowing for more efficient training.

But we don't do training here. In fact, we already pre-decide what the matrix product BA should be, from the abliteration parameters alone.

In principle, that makes PEFT overkill for this task, as we could simply use a basic module wrapper that applies the adapter in the forward pass. I've thought about this because I dislike unnecessary dependencies, but the reason PEFT still makes sense is that it is designed to work with Transformers, and saves us a lot of manual messing around with weights. Therefore, I believe that this PR is correct in using PEFT, despite the approach being somewhat unconventional for LoRA adapters.

@accemlcc

This comment was marked as off-topic.

* perf: optimize abliteration matrix op

* refactor: comments and var names correspond with arditi

* refactor: fix comments and improve var notation

* fix: accidental line change and improve comments

---------

Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com>
@p-e-w
Copy link
Owner

p-e-w commented Dec 2, 2025

Thanks for the PR!

It appears you changed the line endings for every file in the project, which makes the actual changes very difficult to review. Please fix this so I can proceed.

@David-AU-github
Copy link

Hi! As discussed on Reddit, this PR implements a workflow to enable abliteration on consumer hardware with limited VRAM.

Changes:

* Implements 4-bit loading via `bitsandbytes`.

* Uses LoRA adapters instead of modifying full weights in-place.

* Optimization is calculated on the frozen 4-bit model.

* Allows saving the result as a small LoRA adapter.

Impact:

* Drastically reduces VRAM usage.

* As discussed, the refusal vector is dominant enough that finding it in 4-bit precision works perfectly fine when applied to the full model later.

* Tested on Llama 3 70B with negligible Perplexity impact (4.19 -> 4.20).

Let me know if you need any changes!

Excellent work. Tried your fork on 2 models - both came out excellent.

- Check for LoRA adapters before attempting LoRA abliteration
- Fall back to direct weight modification for nn.Parameter (GPT-OSS)
- Ensures compatibility across all model architectures
@Vinay-Umrethe
Copy link
Contributor

@accemlcc hey, so I suggested a user to use your fork which at the end was useful to him, but had to use a older commit since your latest one had a issue projector not defined :

projector = torch.outer(
                    layer_refusal_direction,
                    layer_refusal_direction,
                ).to(self.model.dtype)

this was removed in a commit, I gave him the one before this and worked. Informing just incase if this was accidental removal

Copy link
Owner

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this amazing PR! I'm super excited about this. Being able to abliterate with 1/4th the VRAM is a quantum leap.

How will exporting models work with this? When calling model.save_pretrained/model.push_to_hub, are the LoRA adapters applied to the tensors before saving?

@accemlcc
Copy link
Contributor Author

accemlcc commented Dec 3, 2025

Thanks @p-e-w, @red40maxxer, and @Vinayyyy7 for the detailed review! :)

I've pushed a commit that addresses the critical issues:

  1. Fixed projector bug: Restored the missing logic in the fallback path using the efficient rank-1 update method (thanks @red40maxxer!).
  2. Reverted README: Removed the "fork" language.
  3. Updated Print Statement: Changed "matrices" to "modules".

@p-e-w
Copy link
Owner

p-e-w commented Dec 3, 2025

Reverted README: Removed the "fork" language.

The README changes are still there in this PR.

Copy link
Owner

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here's the big question: Why exactly do we need the use_lora option? My understanding is that the LoRA approach is mathematically equivalent to the existing code (please confirm this!), it simply uses a different mechanism in Transformers, and it has the major advantage of avoiding model reloads. Under which circumstances would we not want that? Quantization should of course be optional and disabled by default.

There's still the open point regarding model export with LoRAs.

@accemlcc
Copy link
Contributor Author

accemlcc commented Dec 3, 2025

So here's the big question: Why exactly do we need the use_lora option?

! -->> Mathematically, the operation is equivalent (a Rank-1 update). However, I'd argue we should keep both options with your original approach as the default:

Why direct modification should remain the default:

  • Simplicity & Predictability: Your original workflow is cleaner and more transparent. No PEFT dependency, no adapter complexity.
  • Quantization is complex: Not all models react the same way to quantization. For small models (e.g., 1B), quantization doesn't make sense anyway.
  • Export workflow: With LoRA, users get only the adapter. They need to know how to merge it correctly (merge_and_unload()). Direct modification works out-of-the-box with save_pretrained().

When LoRA is essential:

  • Large models (70B+) where VRAM is the bottleneck.
  • Users explicitly want 4-bit quantization.

Proposal:

  • Keep quantization = "none" and use_lora = True as defaults (your workflow).
  • Users who need 4-bit can set quantization = "bnb_4bit".
  • Optionally, we could add an auto-merge function for LoRA export in the future.

@p-e-w
Copy link
Owner

p-e-w commented Dec 3, 2025

Quantization is complex: Not all models react the same way to quantization. For small models (e.g., 1B), quantization doesn't make sense anyway.

Sure, but using PEFT doesn't require quantization, right? My question is: Under which circumstances would the user set use_lora to False?

Export workflow: With LoRA, users get only the adapter. They need to know how to merge it correctly (merge_and_unload()). Direct modification works out-of-the-box with save_pretrained().

That needs to be changed. We always want to export the full model. LoRAs are very niche with LLMs, and virtually nobody uses them. When Heretic is run with use_lora=True, the exported model should still be the full transformer with the adapters merged in.

@accemlcc
Copy link
Contributor Author

Status Update: BNB 8-bit Quantization Removed

After extensive testing and debugging, I've removed support for bnb_8bit quantization. Here's why:

The Problem
BitsAndBytes 8-bit quantization is fundamentally incompatible with the LoRA-based abliteration approach. The issue lies in how bitsandbytes handles quantization state:

  1. When loading a model with load_in_8bit=True, weights are stored as Int8Params with associated scale factors (SCB/CB)
  2. These scale factors are required to dequantize weights back to float32 for LoRA abliteration calculations (lora_A = v^T W)
  3. However, bitsandbytes clears SCB/CB after each forward pass to save memory
  4. By the time abliterate()is called, multiple forward passes have already occurred (batch size detection, prefix check, residual calculation), so the quantization state is gone
  5. Without SCB/CB, the int8 values are interpreted directly as floats (-128 to 127), producing completely wrong weight matrices and causing abnormally high KL divergence values (~15-19 instead of ~0.03-0.1)

Attempted Solutions
I explored several approaches:

  • Using llm_int8_has_fp16_weight=True to preserve FP16 weights → Breaks model loading
  • Caching dequantized weights before first forward pass → Too invasive and adds significant complexity
  • Re-quantizing on demand → Would require modifying bitsandbytes internals

Resolution
Given that 4-bit quantization works flawlessly and provides even better memory savings than 8-bit, I've removed bnb_8bit entirely. This keeps the codebase clean and avoids user frustration with a broken feature.

Files changed:

config.py
: Removed BNB_8BIT from enum
model.py
: Removed 8-bit config and related checks
main.py
: Removed 8-bit from merge strategy check
config.default.toml
: Updated documentation

@p-e-w
Copy link
Owner

p-e-w commented Dec 14, 2025

Ok, no problem. Good you noticed this before we merge.

@p-e-w
Copy link
Owner

p-e-w commented Dec 14, 2025

Can you do the full run and post the outcomes (Pareto front)? (Please don't paste the full output here though.)

@accemlcc
Copy link
Contributor Author

accemlcc commented Dec 14, 2025

Can you do the full run and post the outcomes (Pareto front)? (Please don't paste the full output here though.)

heretic --batch-size=128 openai/gpt-oss-20b

[Trial 140] Refusals: 54/100, KL divergence: 0.0781
[Trial 138] Refusals: 55/100, KL divergence: 0.0724
[Trial 141] Refusals: 59/100, KL divergence: 0.0656
[Trial 71] Refusals: 61/100, KL divergence: 0.0646
[Trial 70] Refusals: 63/100, KL divergence: 0.0567
[Trial 156] Refusals: 64/100, KL divergence: 0.0481
[Trial 152] Refusals: 67/100, KL divergence: 0.0447
[Trial 164] Refusals: 70/100, KL divergence: 0.0385
[Trial 83] Refusals: 72/100, KL divergence: 0.0365
[Trial 79] Refusals: 75/100, KL divergence: 0.0348
[Trial 165] Refusals: 77/100, KL divergence: 0.0322
[Trial 55] Refusals: 89/100, KL divergence: 0.0258
[Trial 112] Refusals: 90/100, KL divergence: 0.0242
[Trial 120] Refusals: 91/100, KL divergence: 0.0217
[Trial 103] Refusals: 92/100, KL divergence: 0.0212
[Trial 51] Refusals: 96/100, KL divergence: 0.0183
[Trial 190] Refusals: 97/100, KL divergence: 0.0115
[Trial 198] Refusals: 98/100, KL divergence: 0.0005

Copy link
Owner

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. After changing the two logging messages as indicated, please go over the diff carefully one more time to make sure you think it's good to go, then let me know. I will then merge this mammoth pull request.

@@ -328,7 +396,7 @@ def objective(trial: Trial) -> tuple[float, float]:
for name, value in get_trial_parameters(trial).items():
print(f" * {name} = [bold]{value}[/]")
print("* Reloading model...")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("* Reloading model...")
print("* Resetting model...")

@@ -427,7 +495,7 @@ def objective_wrapper(trial: Trial) -> tuple[float, float]:
print()
print(f"Restoring model from trial [bold]{trial.user_attrs['index']}[/]...")
print("* Reloading model...")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("* Reloading model...")
print("* Resetting model...")

modules[component] = []
modules[component].append(module)
else:
# Assert for unexpected types (catches architecture changes)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good stuff!

@accemlcc
Copy link
Contributor Author

mmm... I checked everything again, it looks good, but you're obviously much better at code review than I am.

I also uploaded some new models to HF today, all with very good results. In my opinion, the PR is fully functional.

@p-e-w p-e-w merged commit 243f821 into p-e-w:master Dec 14, 2025
4 checks passed
@p-e-w
Copy link
Owner

p-e-w commented Dec 14, 2025

Merged! Thank you for this giant leap forward!

@accemlcc
Copy link
Contributor Author

Phew... that was an intense but also instructive experience for me as my first PR on GitHub ever. Next time, I might start with something like “hello world” ;-)

Thanks again for your reviews!

@p-e-w
Copy link
Owner

p-e-w commented Dec 14, 2025

That's pretty big for a first PR! Hope you won't stop there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants