feat: Add 4-bit loading + LoRA support for low VRAM optimization#60
feat: Add 4-bit loading + LoRA support for low VRAM optimization#60p-e-w merged 57 commits intop-e-w:masterfrom
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
* perf: optimize abliteration matrix op * refactor: comments and var names correspond with arditi * refactor: fix comments and improve var notation * fix: accidental line change and improve comments --------- Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com>
|
Thanks for the PR! It appears you changed the line endings for every file in the project, which makes the actual changes very difficult to review. Please fix this so I can proceed. |
Excellent work. Tried your fork on 2 models - both came out excellent. |
- Check for LoRA adapters before attempting LoRA abliteration - Fall back to direct weight modification for nn.Parameter (GPT-OSS) - Ensures compatibility across all model architectures
|
@accemlcc hey, so I suggested a user to use your fork which at the end was useful to him, but had to use a older commit since your latest one had a issue projector not defined : projector = torch.outer(
layer_refusal_direction,
layer_refusal_direction,
).to(self.model.dtype)this was removed in a commit, I gave him the one before this and worked. Informing just incase if this was accidental removal |
p-e-w
left a comment
There was a problem hiding this comment.
Thanks for this amazing PR! I'm super excited about this. Being able to abliterate with 1/4th the VRAM is a quantum leap.
How will exporting models work with this? When calling model.save_pretrained/model.push_to_hub, are the LoRA adapters applied to the tensors before saving?
|
Thanks @p-e-w, @red40maxxer, and @Vinayyyy7 for the detailed review! :) I've pushed a commit that addresses the critical issues:
|
The README changes are still there in this PR. |
p-e-w
left a comment
There was a problem hiding this comment.
So here's the big question: Why exactly do we need the use_lora option? My understanding is that the LoRA approach is mathematically equivalent to the existing code (please confirm this!), it simply uses a different mechanism in Transformers, and it has the major advantage of avoiding model reloads. Under which circumstances would we not want that? Quantization should of course be optional and disabled by default.
There's still the open point regarding model export with LoRAs.
! -->> Mathematically, the operation is equivalent (a Rank-1 update). However, I'd argue we should keep both options with your original approach as the default: Why direct modification should remain the default:
When LoRA is essential:
Proposal:
|
Sure, but using PEFT doesn't require quantization, right? My question is: Under which circumstances would the user set
That needs to be changed. We always want to export the full model. LoRAs are very niche with LLMs, and virtually nobody uses them. When Heretic is run with |
|
Status Update: BNB 8-bit Quantization Removed After extensive testing and debugging, I've removed support for bnb_8bit quantization. Here's why: The Problem
Attempted Solutions
Resolution Files changed: config.py |
|
Ok, no problem. Good you noticed this before we merge. |
…l_name, fix type hints, remove GPT-OSS MoE, update assertion
|
Can you do the full run and post the outcomes (Pareto front)? (Please don't paste the full output here though.) |
heretic --batch-size=128 openai/gpt-oss-20b [Trial 140] Refusals: 54/100, KL divergence: 0.0781 |
p-e-w
left a comment
There was a problem hiding this comment.
Okay. After changing the two logging messages as indicated, please go over the diff carefully one more time to make sure you think it's good to go, then let me know. I will then merge this mammoth pull request.
src/heretic/main.py
Outdated
| @@ -328,7 +396,7 @@ def objective(trial: Trial) -> tuple[float, float]: | |||
| for name, value in get_trial_parameters(trial).items(): | |||
| print(f" * {name} = [bold]{value}[/]") | |||
| print("* Reloading model...") | |||
There was a problem hiding this comment.
| print("* Reloading model...") | |
| print("* Resetting model...") |
src/heretic/main.py
Outdated
| @@ -427,7 +495,7 @@ def objective_wrapper(trial: Trial) -> tuple[float, float]: | |||
| print() | |||
| print(f"Restoring model from trial [bold]{trial.user_attrs['index']}[/]...") | |||
| print("* Reloading model...") | |||
There was a problem hiding this comment.
| print("* Reloading model...") | |
| print("* Resetting model...") |
| modules[component] = [] | ||
| modules[component].append(module) | ||
| else: | ||
| # Assert for unexpected types (catches architecture changes) |
|
mmm... I checked everything again, it looks good, but you're obviously much better at code review than I am. I also uploaded some new models to HF today, all with very good results. In my opinion, the PR is fully functional. |
|
Merged! Thank you for this giant leap forward! |
|
Phew... that was an intense but also instructive experience for me as my first PR on GitHub ever. Next time, I might start with something like “hello world” ;-) Thanks again for your reviews! |
|
That's pretty big for a first PR! Hope you won't stop there :) |
Hi! As discussed on Reddit, this PR implements a workflow to enable abliteration on consumer hardware with limited VRAM.
Changes:
bitsandbytes.Impact:
Let me know if you need any changes!
PLEASE NOTE PEW's Technical Implementation Note:
Just for posterity, the "LoRA" approach in this implementation is a bit of a different beast from how LoRAs are normally used.
The idea behind a low-rank adaptation is to factor the adapter matrix (which is simply added to the module matrix) into a product of two low-rank matrices. This drastically decreases the number of trainable parameters, allowing for more efficient training.
But we don't do training here. In fact, we already pre-decide what the matrix product BA should be, from the abliteration parameters alone.
In principle, that makes PEFT overkill for this task, as we could simply use a basic module wrapper that applies the adapter in the forward pass. I've thought about this because I dislike unnecessary dependencies, but the reason PEFT still makes sense is that it is designed to work with Transformers, and saves us a lot of manual messing around with weights. Therefore, I believe that this PR is correct in using PEFT, despite the approach being somewhat unconventional for LoRA adapters.