Conversation
Summary of ChangesHello @kabachuha, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the capability to merge LoRA adapters into large language models by implementing a sharded merging approach. This new method circumvents the high memory demands of traditional merging, which often requires dequantizing the entire model into RAM, making it feasible to merge adapters into models that exceed system memory capacity. The change introduces a new merge strategy that processes model weights in smaller, manageable chunks, thereby drastically reducing the memory footprint and preventing system instability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a sharded LoRA merge capability to reduce memory usage. The implementation is mostly correct, but there are several critical issues in src/heretic/model.py that need to be addressed. These include an incorrect function signature violating the style guide, a potential crash due to a None value, and a faulty shape assertion that will cause merging to fail. I've also pointed out areas for improvement in code clarity, such as refactoring confusing logic and moving an import out of a loop. Addressing these points will make the new feature more robust and maintainable.
| assert lora_A.shape[0] == tensor.shape[0], \ | ||
| f"Shape mismatch for {key}: tensor {tensor.shape}, lora_A {lora_A.shape}" |
There was a problem hiding this comment.
The shape verification for the LoRA weights is incorrect. The assertion lora_A.shape[0] == tensor.shape[0] compares the LoRA rank (r) with the base weight's output features, which will likely fail.
The correct checks should verify that the output dimensions of lora_B and input dimensions of lora_A match the base tensor's dimensions.
| assert lora_A.shape[0] == tensor.shape[0], \ | |
| f"Shape mismatch for {key}: tensor {tensor.shape}, lora_A {lora_A.shape}" | |
| # Verify shapes. W is (out, in), A is (r, in), B is (out, r). | |
| assert lora_B.shape[0] == tensor.shape[0] and lora_A.shape[1] == tensor.shape[1], \ | |
| f"Shape mismatch for {key}: W={tensor.shape}, A={lora_A.shape}, B={lora_B.shape}" |
There was a problem hiding this comment.
If the lora is computed fine, this should be fine as well
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
| print("[yellow]This can lead to system freezes if you run out of memory.[/]") | ||
| print( | ||
| "[yellow] However, you can choose to save the model shard-by-shard. It is slightly slower, but requires <10 GB of RAM for most models.[/]" | ||
| ) |
There was a problem hiding this comment.
I don't think we need this extra complexity. Always using sharded merging should be fine.
| # 1. resolve the local path | ||
| if not os.path.exists(model_path): | ||
| print("Model path not found locally, attempting to locate in HF cache.") | ||
| hf_hub_location = os.environ.get("HF_HUB_CACHE", os.path.join(os.environ.get("HOME", ""), ".cache/huggingface/hub")) |
There was a problem hiding this comment.
This could be dicey. I think that if Xet is used (which HF is aggressively migrating towards), caches work quite differently, and aren't simply folders containing the model repository. It's an object storage similar to Git.
There was a problem hiding this comment.
@p-e-w In this case heretic can do "from_pretrained" on the model and then do "save_pretrained" into a temp folder, on which the sharded ablation will be performed. Sharded modification is just too good not to have it
There was a problem hiding this comment.
Of course, a warning similar to the current one ram should be displayed, if the model wasn't saved to a local folder beforehand (and add a hint in the documentation what local saving is preferred and maybe enable it by default)
|
Xet is my main concern. I'm not on my workstation right now to verify this, but IIRC, with Xet, downloaded models are not stored in a plain folder that simply contains the safetensor files. |
|
Yes, I just looked into the .cache save folders and their structure is really complicated. So, I suggest this variant: from now on, by default, heretic saves the model into a local dir via For backwards compatibility, the model is copied or moved via |
No, that doesn't fly. The user may already have the model cached locally (the whole purpose of the cache is for multiple applications to share the same model data). We can't just download the whole thing again, wasting I don't know how many Gigabytes of disk space. There is also the maintainability perspective. Managing the download ourselves is a hack because it bypasses Transformers' built-in download system, and the sharded merge itself is also a hack because it bypasses Transformers' built-in export system. You can't build a reliable piece of software by layering hacks on top of hacks. This is obviously unfortunate. Sharded merge is cool. But it's not cool enough to turn the core of our system into a brittle mess. If this cannot be made to work directly with Xet, I'm afraid the approach will have to be abandoned. 😞 |
|
Maybe we still can have it, but with obvious warnings and documentation? For example, in many prebuilt server configurations, there is not greater than 128+ gb for abliterating and saving even 70b models. (they require 141+ GB RAM) And there is currently no warning about the RAM usage before the start of the abliteration process, this can take the low-RAM users by surprise after they have already run trials |
|
Take this configuration: 1xH100, 128 GB RAM. This way you can abliterate even a 100B+ model in 4bit precision, but there is absolutely no way of saving it unless you rent a ton of additional RAM Or a more amateur 2x4090 / 5090 for numerous LLaMA 70b fine-tunes |
|
The perfect approach would be, of course, to fix the LoRA system as with LoRA the user can decide what to do with it themselves. Save with peft, sharded or not, manually or simply convert with llama.cpp with no hustle and use it with compact Unsloth quants without quantizing the model again themselves (and losing the advantage of dynamic quants) Is it possible for you look into the lora export system again? |
That's an extremely uncommon configuration. Pairing a $20,000 GPU with $300 worth of RAM (maybe a little more expensive currently) makes very little sense. A typical research workstation with an H100 would have 512 GB of RAM or even more. Maintaining the quality and usability of an already complex program like Heretic requires keeping complexity in check. As far as I can tell, having the VRAM to abliterate a model in Q4 but not the RAM to merge it in BF16 (that is, RAM < ~4 * VRAM), and no alternative setup, should be a very unusual situation. We cannot introduce that degree of complexity at such a fundamental level for such a corner case.
As far as I can tell, we are doing everything correctly, and this is likely a bug in PEFT, or a problem with the interaction of Transformers and PEFT (perhaps a version mismatch). This is especially likely considering that saving the LoRA locally appears to work while the HF upload fails. |
This is a rentable configuration on one of the clouds I use. Mind you, where I live (Russia) H100 / H200 availability is extremely low and even 3xH100 servers are all rented-out for unknown amount of time, so hobbyists are having not much choice when renting.
I think it is fixable and can be reported to the huggingface repo in an issue / push a pull request, one needs to simply look there.
The VRAM / RAM availability fraction is getting skewed recently because of the RAM market collapse and it is quite important for people quantizing ~70b models on home setups. |
|
I'm sorry, but can this function be merged at least as an option for manually downloaded files? There are people who prefer to run apps in isolated environments and use pre-downloaded shards. So hf cache isn't even used. |
Please understand that I would love to have this feature. But a custom download system with major caveats regarding disk space, plus a custom export system, both of which replace systems that are already built into Transformers, is simply too high a price to pay IMO. Doing things like that is a superhighway towards a program that has weird bugs that nobody can understand, and that breaks in unforeseeable ways when Transformers changes something. This is ultimately a feature that belongs upstream. Upstream doesn't have it yet, but that doesn't mean Heretic is the right place to put it. |
PEFT is probably not the place too. It's all about playing with LoRAs in runtime, maybe with save and load once, not with handling the saved model's .safetensors files, because it's not related to the PEFT's runtime LoRAs architecture, and because of it a code to handle the shards is needed here. On contrary, because it doesn't really below to the PEFT or the transformers codebase, it has to be implemented here.
Safetensors format is fixed and it simply contains the weights corresponding to the model's layers. The PEFT LoRA adapter formed in heretic targets these modules with lora A and lora B linked to the model's layers. The iteration logic outlined in this pull request goes over the model's parameters shard by shard, the linked A and B matrices are found in the dictionary. The chance of PEFT or transformers changing the safetensors storage is zero, otherwise it will break all tools, including LLaMA.cpp, built around transformers and the Huggingface admins definitely don't want this. The carryover of cached Xet transformer files can be circumvented completely if we will use from_pretrained(..., cache_dir="...") or with hf_hub_download, enabled by default into a created |
Where else? This is a standard use case and not at all specific to Heretic. Training a LoRA and merging it into the base model to create a merged model is completely generic functionality. It's not even specific to abliteration. It either belongs in Transformers, or in PEFT, or in a dedicated library. |
Yes, but this re-downloads/duplicates models that are already in the user's HF cache, breaking a key assumption about how the HF ecosystem is supposed to work. It also fails in the other direction: Models downloaded with Heretic aren't available to other programs using Transformers. Again: I wish it wasn't so. But the ecosystem just isn't ready yet to implement this feature cleanly. |
This is literally a single 84 lines function, with solidified logic, depending on arguably not-breakable peft<->transformers naming correspondences.
You can perfectly pass local paths to any transformers from_pretrained-based scripts. As for the disk space and if you really need to use HF's cache system, with heretic you are supposed to abliterate the model once and forget it for good. It is not much a sacrifice to download the model twice. Disk space is MUCH more available than RAM, with its prices gone insane. |
|
FWIW, a Strix Halo-based system (~$2000 before recent price hikes) can also have up to 128GB of (semi-unified) RAM, and can't be upgraded. While such a system is not as powerful (and would probably take a long time on models that require 4-bit quantization to be loaded into RAM), I think this feature would still be quite nice to have. I do downloads with |
|
I think the right way forward here is just to make LoRA export work correctly. There are too many externalities to handle, and hacks like maintaining our own cache not only add tremendous complexity, they also risk surprising the user in unpleasant ways when we eat up their disk space. If we offer to export a LoRA as an alternative to loading the full model into system RAM, the user can do what they want with it, including performing the merge process of their choice using other tools. |
|
@p-e-w Any updates on the LoRA situation? |
|
@kabachuha Unfortunately I don't have time to work on that at the moment. It will take a few weeks for me to get to this task at least. |
|
@p-e-w FYI, unsloth can do local xet-HF / local-dir / remote hub resolution, copying and sharded LoRA merge out of the box. We can include it as an optional dependency for low-RAM systems |
|
Can it directly upload to HF, without making a full copy of the model locally on disk? |
|
I don't think so. It will require a lot of hacking: copy a file, apply lora, do push file to hub in a loop Why do you care of the disk space so badly? It is much cheaper than RAM / renting RAM. Heretic can add a disclaimer before the start, just like there is a disclaimer about the RAM spike when saving (btw, the disclaimer is too late, the user can abliterate the model and then be shocked that they cannot save it) |
Vanilla LoRA peft merge requires the full dequantized model to reside in RAM, that is disastrous for very big models which weight much more than the RAM capability (they can be abliterated in 4bit thanks for BNB, but the saving is problematic).
The PR finds the actual location of the HF model and goes through its shards .safetensors by .safetensors. For each tensor, if it is targeted by LoRA, the corresponding LoRA key is mapped and the weight is modified by adding the matrix product.
In the end, the script checks whether all the LoRA keys have been inserted for integrity.
Closes #156.
P.S. Also hope you will return the LoRA export functionality eventually 😬