Skip to content

Sharded LoRA merge#162

Open
kabachuha wants to merge 10 commits intop-e-w:masterfrom
kabachuha:sharded
Open

Sharded LoRA merge#162
kabachuha wants to merge 10 commits intop-e-w:masterfrom
kabachuha:sharded

Conversation

@kabachuha
Copy link

Vanilla LoRA peft merge requires the full dequantized model to reside in RAM, that is disastrous for very big models which weight much more than the RAM capability (they can be abliterated in 4bit thanks for BNB, but the saving is problematic).

The PR finds the actual location of the HF model and goes through its shards .safetensors by .safetensors. For each tensor, if it is targeted by LoRA, the corresponding LoRA key is mapped and the weight is modified by adding the matrix product.

In the end, the script checks whether all the LoRA keys have been inserted for integrity.

Closes #156.

P.S. Also hope you will return the LoRA export functionality eventually 😬

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kabachuha, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the capability to merge LoRA adapters into large language models by implementing a sharded merging approach. This new method circumvents the high memory demands of traditional merging, which often requires dequantizing the entire model into RAM, making it feasible to merge adapters into models that exceed system memory capacity. The change introduces a new merge strategy that processes model weights in smaller, manageable chunks, thereby drastically reducing the memory footprint and preventing system instability.

Highlights

  • New Sharded LoRA Merge Strategy: Introduced a new merge_sharded option for merging LoRA adapters into a full model, specifically designed for large models to significantly reduce RAM usage during the merge process.
  • Reduced RAM Footprint: The sharded merge process allows for merging large models with very low RAM usage (typically <10 GB), addressing the issue of dequantizing the entire model to system RAM which can lead to freezes for very big models.
  • Iterative Shard Processing: The new strategy processes model shards one by one, identifying LoRA-targeted tensors within each shard, applying the LoRA weight addition, and saving the merged shard, ensuring efficient memory management.
  • Improved User Feedback: Updated warning messages and RAM estimation outputs to clearly differentiate between non-sharded and sharded merging modes, providing better guidance to users regarding memory requirements.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/heretic/main.py
    • Added a new print statement to inform users about the low RAM usage benefit of sharded merging.
    • Updated existing print statements to clarify RAM estimates for non-sharded merging.
    • Introduced a new merge_sharded option in the merge strategy selection menu.
    • Modified the saving logic to call the new save_sharded method when the merge_sharded strategy is chosen.
  • src/heretic/model.py
    • Imported necessary modules: os, glob, json, safe_open, and save_file for file system and safetensors operations.
    • Implemented a new save_sharded method to perform the sharded LoRA merge. This method handles resolving model paths, parsing LoRA keys, iterating through model shards, merging LoRA weights into base tensors, and saving the merged shards along with other model files.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sharded LoRA merge capability to reduce memory usage. The implementation is mostly correct, but there are several critical issues in src/heretic/model.py that need to be addressed. These include an incorrect function signature violating the style guide, a potential crash due to a None value, and a faulty shape assertion that will cause merging to fail. I've also pointed out areas for improvement in code clarity, such as refactoring confusing logic and moving an import out of a loop. Addressing these points will make the new feature more robust and maintainable.

Comment on lines +336 to +337
assert lora_A.shape[0] == tensor.shape[0], \
f"Shape mismatch for {key}: tensor {tensor.shape}, lora_A {lora_A.shape}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The shape verification for the LoRA weights is incorrect. The assertion lora_A.shape[0] == tensor.shape[0] compares the LoRA rank (r) with the base weight's output features, which will likely fail.

The correct checks should verify that the output dimensions of lora_B and input dimensions of lora_A match the base tensor's dimensions.

Suggested change
assert lora_A.shape[0] == tensor.shape[0], \
f"Shape mismatch for {key}: tensor {tensor.shape}, lora_A {lora_A.shape}"
# Verify shapes. W is (out, in), A is (r, in), B is (out, r).
assert lora_B.shape[0] == tensor.shape[0] and lora_A.shape[1] == tensor.shape[1], \
f"Shape mismatch for {key}: W={tensor.shape}, A={lora_A.shape}, B={lora_B.shape}"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the lora is computed fine, this should be fine as well

kabachuha and others added 7 commits February 15, 2026 18:05
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
print("[yellow]This can lead to system freezes if you run out of memory.[/]")
print(
"[yellow] However, you can choose to save the model shard-by-shard. It is slightly slower, but requires <10 GB of RAM for most models.[/]"
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this extra complexity. Always using sharded merging should be fine.

# 1. resolve the local path
if not os.path.exists(model_path):
print("Model path not found locally, attempting to locate in HF cache.")
hf_hub_location = os.environ.get("HF_HUB_CACHE", os.path.join(os.environ.get("HOME", ""), ".cache/huggingface/hub"))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be dicey. I think that if Xet is used (which HF is aggressively migrating towards), caches work quite differently, and aren't simply folders containing the model repository. It's an object storage similar to Git.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-e-w In this case heretic can do "from_pretrained" on the model and then do "save_pretrained" into a temp folder, on which the sharded ablation will be performed. Sharded modification is just too good not to have it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, a warning similar to the current one ram should be displayed, if the model wasn't saved to a local folder beforehand (and add a hint in the documentation what local saving is preferred and maybe enable it by default)

@p-e-w
Copy link
Owner

p-e-w commented Feb 15, 2026

Xet is my main concern. I'm not on my workstation right now to verify this, but IIRC, with Xet, downloaded models are not stored in a plain folder that simply contains the safetensor files.

@kabachuha
Copy link
Author

Yes, I just looked into the .cache save folders and their structure is really complicated.

So, I suggest this variant: from now on, by default, heretic saves the model into a local dir via huggingface_hub download. Say, ./models/.... with .gitignore. The location is to be set in the .toml config if the user wants to put it into another place, so it will not change the "fully automated" philisophy.

For backwards compatibility, the model is copied or moved via from_pretrained and save_pretrained with a warning.

@p-e-w
Copy link
Owner

p-e-w commented Feb 16, 2026

So, I suggest this variant: from now on, by default, heretic saves the model into a local dir via huggingface_hub download.

No, that doesn't fly. The user may already have the model cached locally (the whole purpose of the cache is for multiple applications to share the same model data). We can't just download the whole thing again, wasting I don't know how many Gigabytes of disk space.

There is also the maintainability perspective. Managing the download ourselves is a hack because it bypasses Transformers' built-in download system, and the sharded merge itself is also a hack because it bypasses Transformers' built-in export system. You can't build a reliable piece of software by layering hacks on top of hacks.

This is obviously unfortunate. Sharded merge is cool. But it's not cool enough to turn the core of our system into a brittle mess. If this cannot be made to work directly with Xet, I'm afraid the approach will have to be abandoned. 😞

@kabachuha
Copy link
Author

Maybe we still can have it, but with obvious warnings and documentation? For example, in many prebuilt server configurations, there is not greater than 128+ gb for abliterating and saving even 70b models. (they require 141+ GB RAM)

And there is currently no warning about the RAM usage before the start of the abliteration process, this can take the low-RAM users by surprise after they have already run trials

@kabachuha
Copy link
Author

kabachuha commented Feb 16, 2026

Take this configuration: 1xH100, 128 GB RAM. This way you can abliterate even a 100B+ model in 4bit precision, but there is absolutely no way of saving it unless you rent a ton of additional RAM

Or a more amateur 2x4090 / 5090 for numerous LLaMA 70b fine-tunes

@kabachuha
Copy link
Author

The perfect approach would be, of course, to fix the LoRA system as with LoRA the user can decide what to do with it themselves. Save with peft, sharded or not, manually or simply convert with llama.cpp with no hustle and use it with compact Unsloth quants without quantizing the model again themselves (and losing the advantage of dynamic quants)

Is it possible for you look into the lora export system again?

@p-e-w
Copy link
Owner

p-e-w commented Feb 16, 2026

Take this configuration: 1xH100, 128 GB RAM.

That's an extremely uncommon configuration. Pairing a $20,000 GPU with $300 worth of RAM (maybe a little more expensive currently) makes very little sense. A typical research workstation with an H100 would have 512 GB of RAM or even more.

Maintaining the quality and usability of an already complex program like Heretic requires keeping complexity in check. As far as I can tell, having the VRAM to abliterate a model in Q4 but not the RAM to merge it in BF16 (that is, RAM < ~4 * VRAM), and no alternative setup, should be a very unusual situation. We cannot introduce that degree of complexity at such a fundamental level for such a corner case.

Is it possible for you look into the lora export system again?

As far as I can tell, we are doing everything correctly, and this is likely a bug in PEFT, or a problem with the interaction of Transformers and PEFT (perhaps a version mismatch). This is especially likely considering that saving the LoRA locally appears to work while the HF upload fails.

@kabachuha
Copy link
Author

Take this configuration: 1xH100, 128 GB RAM.

This is a rentable configuration on one of the clouds I use. Mind you, where I live (Russia) H100 / H200 availability is extremely low and even 3xH100 servers are all rented-out for unknown amount of time, so hobbyists are having not much choice when renting.

This is especially likely considering that saving the LoRA locally appears to work while the HF upload fails.

I think it is fixable and can be reported to the huggingface repo in an issue / push a pull request, one needs to simply look there.

We cannot introduce that degree of complexity at such a fundamental level for such a corner case.

The VRAM / RAM availability fraction is getting skewed recently because of the RAM market collapse and it is quite important for people quantizing ~70b models on home setups.

@dariox1337
Copy link

I'm sorry, but can this function be merged at least as an option for manually downloaded files? There are people who prefer to run apps in isolated environments and use pre-downloaded shards. So hf cache isn't even used.

@p-e-w
Copy link
Owner

p-e-w commented Feb 16, 2026

The VRAM / RAM availability fraction is getting skewed recently because of the RAM market collapse and it is quite important for people quantizing ~70b models on home setups.

Please understand that I would love to have this feature. But a custom download system with major caveats regarding disk space, plus a custom export system, both of which replace systems that are already built into Transformers, is simply too high a price to pay IMO. Doing things like that is a superhighway towards a program that has weird bugs that nobody can understand, and that breaks in unforeseeable ways when Transformers changes something.

This is ultimately a feature that belongs upstream. Upstream doesn't have it yet, but that doesn't mean Heretic is the right place to put it.

@kabachuha
Copy link
Author

This is ultimately a feature that belongs upstream.

PEFT is probably not the place too. It's all about playing with LoRAs in runtime, maybe with save and load once, not with handling the saved model's .safetensors files, because it's not related to the PEFT's runtime LoRAs architecture, and because of it a code to handle the shards is needed here. On contrary, because it doesn't really below to the PEFT or the transformers codebase, it has to be implemented here.

that breaks in unforeseeable ways when Transformers changes something.

Safetensors format is fixed and it simply contains the weights corresponding to the model's layers. The PEFT LoRA adapter formed in heretic targets these modules with lora A and lora B linked to the model's layers. The iteration logic outlined in this pull request goes over the model's parameters shard by shard, the linked A and B matrices are found in the dictionary. The chance of PEFT or transformers changing the safetensors storage is zero, otherwise it will break all tools, including LLaMA.cpp, built around transformers and the Huggingface admins definitely don't want this.

The carryover of cached Xet transformer files can be circumvented completely if we will use

from_pretrained(..., cache_dir="...") or with hf_hub_download, enabled by default into a created ./models subdirectory, with the HF_HUB_CACHE downloads disabled at all.

@p-e-w
Copy link
Owner

p-e-w commented Feb 16, 2026

PEFT is probably not the place too.

Where else? This is a standard use case and not at all specific to Heretic. Training a LoRA and merging it into the base model to create a merged model is completely generic functionality. It's not even specific to abliteration. It either belongs in Transformers, or in PEFT, or in a dedicated library.

@p-e-w
Copy link
Owner

p-e-w commented Feb 16, 2026

The carryover of cached Xet transformer files can be circumvented completely if we will use

from_pretrained(..., cache_dir="...") or with hf_hub_download, enabled by default into a created ./models subdirectory, with the HF_HUB_CACHE downloads disabled at all.

Yes, but this re-downloads/duplicates models that are already in the user's HF cache, breaking a key assumption about how the HF ecosystem is supposed to work.

It also fails in the other direction: Models downloaded with Heretic aren't available to other programs using Transformers.

Again: I wish it wasn't so. But the ecosystem just isn't ready yet to implement this feature cleanly.

@kabachuha
Copy link
Author

It either belongs in Transformers, or in PEFT, or in a dedicated library.

This is literally a single 84 lines function, with solidified logic, depending on arguably not-breakable peft<->transformers naming correspondences.

Models downloaded with Heretic aren't available to other programs using Transformers.

You can perfectly pass local paths to any transformers from_pretrained-based scripts.

As for the disk space and if you really need to use HF's cache system, with heretic you are supposed to abliterate the model once and forget it for good. It is not much a sacrifice to download the model twice.

Disk space is MUCH more available than RAM, with its prices gone insane.

@spikymoth
Copy link
Contributor

FWIW, a Strix Halo-based system (~$2000 before recent price hikes) can also have up to 128GB of (semi-unified) RAM, and can't be upgraded. While such a system is not as powerful (and would probably take a long time on models that require 4-bit quantization to be loaded into RAM), I think this feature would still be quite nice to have.

I do downloads with hf download --local-dir which still works fine with XET as far as I can tell, but doesn't create the opaque cache structure (it creates a directory with the raw files and a .cache directory for metadata).

@p-e-w
Copy link
Owner

p-e-w commented Feb 17, 2026

I think the right way forward here is just to make LoRA export work correctly. There are too many externalities to handle, and hacks like maintaining our own cache not only add tremendous complexity, they also risk surprising the user in unpleasant ways when we eat up their disk space.

If we offer to export a LoRA as an alternative to loading the full model into system RAM, the user can do what they want with it, including performing the merge process of their choice using other tools.

@kabachuha
Copy link
Author

@p-e-w Any updates on the LoRA situation?

@p-e-w
Copy link
Owner

p-e-w commented Feb 22, 2026

@kabachuha Unfortunately I don't have time to work on that at the moment. It will take a few weeks for me to get to this task at least.

@kabachuha
Copy link
Author

@p-e-w FYI, unsloth can do local xet-HF / local-dir / remote hub resolution, copying and sharded LoRA merge out of the box. We can include it as an optional dependency for low-RAM systems

@p-e-w
Copy link
Owner

p-e-w commented Mar 3, 2026

Can it directly upload to HF, without making a full copy of the model locally on disk?

@kabachuha
Copy link
Author

I don't think so. It will require a lot of hacking: copy a file, apply lora, do push file to hub in a loop

Why do you care of the disk space so badly? It is much cheaper than RAM / renting RAM. Heretic can add a disclaimer before the start, just like there is a disclaimer about the RAM spike when saving (btw, the disclaimer is too late, the user can abliterate the model and then be shocked that they cannot save it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature request] Sharded LoRA merge

4 participants