[Help Wanted] Optimize binary matmul kernel

The [current kernel](https://github.com/FasterDecoding/BitDelta/blob/main/bitdelta/binary_gemm_kernel.py#L297) is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperforms naive simultaneous inference). Can expect an additional 4-8x latency improvement if further optimized. 

I don't have much kernel optimization experience yet, though - if anyone in the OSS community is interested, would love some help! 


Afterwards, it'd be super interesting to run some benchmarks against LoRA-based multi-tenant systems like Punica/S-LoRA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help Wanted] Optimize binary matmul kernel #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Help Wanted] Optimize binary matmul kernel #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions