fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero#151
fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero#151cpagac wants to merge 1 commit intop-e-w:masterfrom
Conversation
Summary of ChangesHello @cpagac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the application's capability by integrating comprehensive support for FP8/NVFP4 pre-quantized models. It introduces the necessary configuration, model loading adjustments, and merge strategy considerations to seamlessly work with these models. Additionally, a critical bug fix addresses a division-by-zero scenario in the evaluation logic, improving the robustness of the system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively introduces support for FP8/NVFP4 quantized models and adds an optional dependency for it. The changes are logical and well-integrated. The PR also includes a small but important fix for a potential division-by-zero error in the evaluator.
While the repository's style guide suggests that pull requests should contain only a single semantic change (rule #9), the included bug fix is minor and improves the codebase. For future contributions, it would be ideal to separate features and bug fixes into distinct pull requests to adhere to the guidelines.
I've added a couple of suggestions in src/heretic/model.py to refactor some duplicated code, which will improve the maintainability of the model loading logic.
5803c07 to
957a31c
Compare
What about mxfp4 for older hardware? |
The implementation here is actually format-agnostic — the "fp8" dtype path simply uses torch_dtype=torch.bfloat16 and lets HuggingFace auto-detect the model's built-in quantization_config. So if an MXFP4 model is published on HF with the appropriate config, this same loading path should handle it without changes. The naming is admittedly NVFP4-centric since that's what was tested against, but the mechanism itself isn't tied to any specific sub-format. Open to renaming the token to something more general to actually convey this feature, if that makes sense. |
da75297 to
3925d5e
Compare
|
Please explain in a bit more detail what exactly is going on here. My understanding is that this loads models with their built-in quantization, just like is already supported for MXFP4. If so, what do we need the extra |
Fair point, looking into this more and testing it, dtype="auto" already handles FP8 pre-quantized models correctly. HuggingFace auto-detects the quantization_config from the model's config.json the same way it does for MXFP4. (I also confirmed this by loading nvidia/Llama-3.1-8B-Instruct-FP8 with just dtype="auto" and no special handling, which gives an end product of it loading and generating fine. ) My original thinking was that FP8 needed an explicit opt-in, as BNB_4BIT does, where the user tells heretic to quantize, and heretic applies it at load time. As such, I created an "fp8" dtype token following the same pattern. I was originally coding this fork for Nemotron support and ran into an issue on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8. I can rework the PR to remove the "fp8" token, the torch_dtype workaround, and the fp-quant dependency. The one thing I think is still relevant is get_merged_model(). Right now, it only has a special CPU reload path for BNB_4BIT models, since quantized weights can't have LoRA adapters merged into them directly. Pre-quantized models (FP8, MXFP4, etc.) share the same limitation: you would need to reload the base model in full precision on the CPU, apply the LoRA weights, and then merge. Without that, I thinkthe merge step would either fail or produce a corrupt model. |
|
Ok. Please trim this PR down to the necessary stuff, then ping me for another review. |
|
Done. Removed the "fp8" dtype token, QuantizationMethod.FP8, the fp-quant dependency, the torch_dtype workaround, and the kwargs dict pattern. Two things left: the evaluator div-by-zero fix, and an extension to get_merged_model() and obtain_merge_strategy() that detects pre quantized models (FP8, MXFP4, GPTQ, etc.) via model.config.quantization_config rather than just checking for BNB_4BIT. Without that, the CPU reload path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation. Ready for re-review @p-e-w |
| """ | ||
|
|
||
| if settings.quantization == QuantizationMethod.BNB_4BIT: | ||
| is_quantized = getattr(model.model.config, "quantization_config", None) is not None |
There was a problem hiding this comment.
Does this work for models quantized on-the-fly by Heretic, i.e., with bitsandbytes?
There was a problem hiding this comment.
Yes — verified with both models below. The model.config.quantization_config check correctly detects BNB models since HuggingFace stores the BitsAndBytesConfig there on load.
|
Which models have you tested this with? Can you link to Hugging Face uploads made with this PR? |
|
Any update? |
Yes sorry. I'm currently a student so time can be tight as i've been streched thin with homework but i'm working on testing with a few more models and should be done by this weekend. |
…d models - Fix division-by-zero in evaluator when base_refusals is 0: return refusals rather than 0, so ablation that introduces new refusals is penalized correctly - Extend get_merged_model() and obtain_merge_strategy() to detect any pre-quantized model via model.config.quantization_config, not just BNB_4BIT — covers FP8, MXFP4, GPTQ, AWQ, and any format HuggingFace auto-detects from the model's config.json
ecc391e to
45053b5
Compare
|
First off, thank you for your patience and apologies for the delay. Tested with Qwen/Qwen2-0.5B-Instruct --quantization bnb_4bit and mistralai/Mistral-7B-Instruct-v0.3 --quantization bnb_4bit — CPU reload warning showed on both, merges completed successfully.
Worth noting: AWQ and GPTQ models fail earlier in heretic's abliteration step (.weight attribute missing on quantized layers), so we can't reach the merge step to test those end-to-end. That's a pre-existing limitation, not something this PR introduced or claims to fix. To summarize: this PR fixes the evaluator div-by-zero (verified with a unit test), and extends the merge path detection to cover any model HuggingFace auto-detects as quantized via model.config.quantization_config, not just BNB_4BIT. |
|
So which models can now be processed that couldn't be before? |
|
Honestly, none that I can demonstrate end-to-end today. My thought was that since BNB_4BIT is still a quantized format going through the same merge path, it would serve as a valid proxy, the CPU reload logic is the same regardless of what triggered it. But in hindsight that misses the point, BNB was already handled before this PR so that section doesn't prove anything new. Now theoretically, any pre-quantized model where HuggingFace auto-sets quantization_config (FP8, MXFP4) would now correctly hit the CPU reload path instead of skipping it. The logic is sound, but yeah I can't prove it end-to-end as those models fail before reaching the merge step its either missing kernel support at load time (FP8) or the .weight attribute issue in the abliteration step (AWQ/GPTQ). That said, I don't think it's permanently theoretical. Once the abliteration step supports quantized layers (right now it assumes .weight on every linear layer which breaks AWQ/GPTQ/FP8), and kernel availability for FP8/MXFP4 improves. Both in my eyes feel like natural next steps for this project. My guess is that these models would flow all the way through and this detection would actually matter but once again its still theoretical as of now. From what I see there are three options going forward (correct me if I'm wrong):
|
|
Given the extreme complexity of the ecosystem (hardware, model architectures), I prefer to only merge changes that actually allow us to do something we couldn't do before. Comprehensively testing such changes is very time-consuming for me and so far, I have always managed to discover unforeseen problems afterwards. I think (1) is the best way forward. The div-by-zero fix is sound and simple, let's just merge that from a clean PR and keep this one for when it becomes relevant. |
Summary
path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation
Test plan