fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero by cpagac · Pull Request #151 · p-e-w/heretic

cpagac · 2026-02-14T04:27:15Z

Summary

Fixes a division-by-zero in the evaluator when base_refusals is 0 (returns refusals rather than 0, so ablation that introduces new refusals is penalized correctly)
Extends get_merged_model() and obtain_merge_strategy() to detect pre-quantized models (FP8, MXFP4, GPTQ, etc.) via model.config.quantization_config, not just BNB_4BIT — without this, the CPU reload
path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation

Test plan

- Verify BNB 4-bit quantization and LoRA merge/export still works
- Verify LoRA merge/export works for a pre-quantized model (FP8 or MXFP4) - untested; nvidia FP8 format requires vLLM/TensorRT-LLM and can't run inference via standard HuggingFace transformers
- Verify evaluator handles models with 0 base refusals without crashing

gemini-code-assist · 2026-02-14T04:27:32Z

Summary of Changes

Hello @cpagac, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's capability by integrating comprehensive support for FP8/NVFP4 pre-quantized models. It introduces the necessary configuration, model loading adjustments, and merge strategy considerations to seamlessly work with these models. Additionally, a critical bug fix addresses a division-by-zero scenario in the evaluation logic, improving the robustness of the system.

Highlights

FP8/NVFP4 Quantization Support: Introduced support for FP8/NVFP4 pre-quantized models, including a new QuantizationMethod.FP8 enum value and a specific _FP8_DTYPE_TOKEN.
Model Loading Logic: Implemented specific model loading logic for FP8 models, using torch.bfloat16 for compute and allowing HuggingFace to auto-detect the model's built-in quantization_config.
Merge Strategy Enhancement: Extended the obtain_merge_strategy function to correctly identify and handle FP8 quantized models during the CPU reload path for merging.
Evaluator Robustness: Resolved a potential division-by-zero error in the evaluator's get_score method when base_refusals is zero.
Optional Dependency: Added fp-quant as a new optional dependency (pip install heretic-llm[fp8]) for FP8 functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pyproject.toml
- Added fp8 as an optional dependency, requiring fp-quant>=0.1.0.
src/heretic/config.py
- Added FP8 to the QuantizationMethod enum.
- Updated the description for the quantization setting to include "fp8 (FP8/NVFP4 on-the-fly quantization)".
src/heretic/evaluator.py
- Modified the refusals_score calculation in get_score to prevent division by zero if self.base_refusals is 0.
src/heretic/main.py
- Imported _FP8_DTYPE_TOKEN from model.
- Modified obtain_merge_strategy to accept a model argument.
- Updated obtain_merge_strategy to check for both BNB_4BIT and _FP8_DTYPE_TOKEN when determining if a model is quantized, affecting merge strategy prompts.
- Updated calls to obtain_merge_strategy to pass the model object.
src/heretic/model.py
- Defined _FP8_DTYPE_TOKEN constant.
- Added _loaded_dtype attribute to the Model class to track the loaded dtype.
- Modified the __init__ method's model loading logic to specifically handle _FP8_DTYPE_TOKEN by using torch_dtype=torch.bfloat16 instead of dtype= and letting HuggingFace auto-detect quantization.
- Updated the success message for FP8 models during loading.
- Modified _get_quantization_config to consider _FP8_DTYPE_TOKEN when setting compute_dtype for BitsAndBytesConfig.
- Extended get_merged_model to include _FP8_DTYPE_TOKEN in the check for quantized models requiring special handling.
- Adjusted reset_model to apply the same FP8-specific loading logic as in __init__.

Activity

This pull request represents the quantization-only features, extracted from a larger previous pull request (feat: NemotronH architecture support #150) as per review feedback.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively introduces support for FP8/NVFP4 quantized models and adds an optional dependency for it. The changes are logical and well-integrated. The PR also includes a small but important fix for a potential division-by-zero error in the evaluator.

While the repository's style guide suggests that pull requests should contain only a single semantic change (rule #9), the included bug fix is minor and improves the codebase. For future contributions, it would be ideal to separate features and bug fixes into distinct pull requests to adhere to the guidelines.

I've added a couple of suggestions in src/heretic/model.py to refactor some duplicated code, which will improve the maintainability of the model loading logic.

src/heretic/model.py

kabachuha · 2026-02-14T11:26:51Z

NVFP4

What about mxfp4 for older hardware?

cpagac · 2026-02-14T17:37:25Z

NVFP4

What about mxfp4 for older hardware?

The implementation here is actually format-agnostic — the "fp8" dtype path simply uses torch_dtype=torch.bfloat16 and lets HuggingFace auto-detect the model's built-in quantization_config.

So if an MXFP4 model is published on HF with the appropriate config, this same loading path should handle it without changes. The naming is admittedly NVFP4-centric since that's what was tested against, but the mechanism itself isn't tied to any specific sub-format. Open to renaming the token to something more general to actually convey this feature, if that makes sense.

src/heretic/evaluator.py

pyproject.toml

src/heretic/model.py

p-e-w · 2026-02-16T07:22:25Z

Please explain in a bit more detail what exactly is going on here. My understanding is that this loads models with their built-in quantization, just like is already supported for MXFP4. If so, what do we need the extra dtype for? If the model tensors are quantized, we always want to load them in that quantized format, no questions asked.

cpagac · 2026-02-16T18:14:30Z

Please explain in a bit more detail what exactly is going on here. My understanding is that this loads models with their built-in quantization, just like is already supported for MXFP4. If so, what do we need the extra dtype for? If the model tensors are quantized, we always want to load them in that quantized format, no questions asked.

Fair point, looking into this more and testing it, dtype="auto" already handles FP8 pre-quantized models correctly. HuggingFace auto-detects the quantization_config from the model's config.json the same way it does for MXFP4. (I also confirmed this by loading nvidia/Llama-3.1-8B-Instruct-FP8 with just dtype="auto" and no special handling, which gives an end product of it loading and generating fine. )

My original thinking was that FP8 needed an explicit opt-in, as BNB_4BIT does, where the user tells heretic to quantize, and heretic applies it at load time. As such, I created an "fp8" dtype token following the same pattern. I was originally coding this fork for Nemotron support and ran into an issue on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.

I can rework the PR to remove the "fp8" token, the torch_dtype workaround, and the fp-quant dependency. The one thing I think is still relevant is get_merged_model(). Right now, it only has a special CPU reload path for BNB_4BIT models, since quantized weights can't have LoRA adapters merged into them directly. Pre-quantized models (FP8, MXFP4, etc.) share the same limitation: you would need to reload the base model in full precision on the CPU, apply the LoRA weights, and then merge. Without that, I thinkthe merge step would either fail or produce a corrupt model.

p-e-w · 2026-02-17T06:30:32Z

Ok. Please trim this PR down to the necessary stuff, then ping me for another review.

cpagac · 2026-02-19T01:46:55Z

Done. Removed the "fp8" dtype token, QuantizationMethod.FP8, the fp-quant dependency, the torch_dtype workaround, and the kwargs dict pattern.

Two things left: the evaluator div-by-zero fix, and an extension to get_merged_model() and obtain_merge_strategy() that detects pre quantized models (FP8, MXFP4, GPTQ, etc.) via model.config.quantization_config rather than just checking for BNB_4BIT. Without that, the CPU reload path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation.

Ready for re-review @p-e-w

p-e-w · 2026-02-20T08:35:51Z

src/heretic/main.py

    """

-    if settings.quantization == QuantizationMethod.BNB_4BIT:
+    is_quantized = getattr(model.model.config, "quantization_config", None) is not None


Does this work for models quantized on-the-fly by Heretic, i.e., with bitsandbytes?

Yes — verified with both models below. The model.config.quantization_config check correctly detects BNB models since HuggingFace stores the BitsAndBytesConfig there on load.

src/heretic/main.py

p-e-w · 2026-02-20T08:38:17Z

Which models have you tested this with? Can you link to Hugging Face uploads made with this PR?

p-e-w · 2026-03-06T07:58:53Z

Any update?

cpagac · 2026-03-07T05:43:04Z

Any update?

Yes sorry. I'm currently a student so time can be tight as i've been streched thin with homework but i'm working on testing with a few more models and should be done by this weekend.

…d models - Fix division-by-zero in evaluator when base_refusals is 0: return refusals rather than 0, so ablation that introduces new refusals is penalized correctly - Extend get_merged_model() and obtain_merge_strategy() to detect any pre-quantized model via model.config.quantization_config, not just BNB_4BIT — covers FP8, MXFP4, GPTQ, AWQ, and any format HuggingFace auto-detects from the model's config.json

cpagac · 2026-03-09T05:33:23Z

First off, thank you for your patience and apologies for the delay.

Tested with Qwen/Qwen2-0.5B-Instruct --quantization bnb_4bit and mistralai/Mistral-7B-Instruct-v0.3 --quantization bnb_4bit — CPU reload warning showed on both, merges completed successfully.
Upload links:

Worth noting: AWQ and GPTQ models fail earlier in heretic's abliteration step (.weight attribute missing on quantized layers), so we can't reach the merge step to test those end-to-end. That's a pre-existing limitation, not something this PR introduced or claims to fix.

To summarize: this PR fixes the evaluator div-by-zero (verified with a unit test), and extends the merge path detection to cover any model HuggingFace auto-detects as quantized via model.config.quantization_config, not just BNB_4BIT.

p-e-w · 2026-03-09T14:41:00Z

So which models can now be processed that couldn't be before?

cpagac · 2026-03-09T15:34:05Z

Honestly, none that I can demonstrate end-to-end today. My thought was that since BNB_4BIT is still a quantized format going through the same merge path, it would serve as a valid proxy, the CPU reload logic is the same regardless of what triggered it. But in hindsight that misses the point, BNB was already handled before this PR so that section doesn't prove anything new.

Now theoretically, any pre-quantized model where HuggingFace auto-sets quantization_config (FP8, MXFP4) would now correctly hit the CPU reload path instead of skipping it. The logic is sound, but yeah I can't prove it end-to-end as those models fail before reaching the merge step its either missing kernel support at load time (FP8) or the .weight attribute issue in the abliteration step (AWQ/GPTQ).

That said, I don't think it's permanently theoretical. Once the abliteration step supports quantized layers (right now it assumes .weight on every linear layer which breaks AWQ/GPTQ/FP8), and kernel availability for FP8/MXFP4 improves. Both in my eyes feel like natural next steps for this project. My guess is that these models would flow all the way through and this detection would actually matter but once again its still theoretical as of now.

From what I see there are three options going forward (correct me if I'm wrong):

Remove the merge path change, keep just the evaluator div-by-zero fix (which is clean and has a unit test) on a separate PR, and let this one sit until the repo is at a state where it's actually valuable.
Close the PR entirely and call it a loss — the last thing I want to do is add code that depreciates the value of the tool.
Remove the merge path changes now and just merge the evaluator fix as-is.

p-e-w · 2026-03-10T07:56:11Z

Given the extreme complexity of the ecosystem (hardware, model architectures), I prefer to only merge changes that actually allow us to do something we couldn't do before. Comprehensively testing such changes is very time-consuming for me and so far, I have always managed to discover unforeseen problems afterwards.

I think (1) is the best way forward. The div-by-zero fix is sound and simple, let's just merge that from a clean PR and keep this one for when it becomes relevant.

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

cpagac force-pushed the fp8-quantization-support branch from 5803c07 to 957a31c Compare February 14, 2026 04:51

cpagac force-pushed the fp8-quantization-support branch from da75297 to 3925d5e Compare February 16, 2026 02:27

p-e-w reviewed Feb 16, 2026

View reviewed changes

src/heretic/evaluator.py Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

cpagac changed the title ~~feat: add FP8/NVFP4 quantization support~~ fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero Feb 19, 2026

p-e-w mentioned this pull request Feb 19, 2026

fix: handle FP8 model weights in LoRA adapters and merge #182

Open

3 tasks

p-e-w reviewed Feb 20, 2026

View reviewed changes

cpagac force-pushed the fp8-quantization-support branch from ecc391e to 45053b5 Compare March 9, 2026 04:40

cpagac mentioned this pull request Mar 13, 2026

fix: prevent div-by-zero in evaluator when base_refusals is 0 #225

Merged

Conversation

cpagac commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kabachuha commented Feb 14, 2026

Uh oh!

cpagac commented Feb 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p-e-w commented Feb 16, 2026

Uh oh!

cpagac commented Feb 16, 2026

Uh oh!

p-e-w commented Feb 17, 2026

Uh oh!

cpagac commented Feb 19, 2026

Uh oh!

p-e-w Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

cpagac Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

p-e-w commented Feb 20, 2026

Uh oh!

p-e-w commented Mar 6, 2026

Uh oh!

cpagac commented Mar 7, 2026

Uh oh!

cpagac commented Mar 9, 2026

Uh oh!

p-e-w commented Mar 9, 2026

Uh oh!

cpagac commented Mar 9, 2026

Uh oh!

p-e-w commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpagac commented Feb 14, 2026 •

edited

Loading