The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperforms naive simultaneous inference). Can expect an additional 4-8x latency improvement if further optimized.
I don't have much kernel optimization experience yet, though - if anyone in the OSS community is interested, would love some help!
Afterwards, it'd be super interesting to run some benchmarks against LoRA-based multi-tenant systems like Punica/S-LoRA.