-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Zero-Inference support TP? #892
Comments
Zero-inference is composable with Megatron-style TP. That is the TP is implemented in the client. |
Hello, is that composable with kv cache offloading? But I cannot find its API... @tjruwase Thanks! |
I mean, only offload kv cache while keeping whole model weights on GPUs. All the example code looks like for a single GPU. |
I assume you are referring to kv cache offloading in the latest zero-inference. We did not evaluate with TP, but I expect it should work. |
Thanks! But how can I make it work? Do you have example command? |
I tried to set num_gpus to 2, but seems it will make two identical model on each GPU at the same time. |
This is because your model has not been pre-processed by a TP framework like Megatron. ZeRO-Inference will not perform the TP slicing on any model. |
Below are commands for single-gpu inference with kv-cache-offload. |
Yes, you are right! Thanks! And single-gpu inference with kv-cache-offload's performance is really nice! But I have a question: I found that fork of transformers actually allocate buffer for KV cache, which seems not compatible. It will still allocate self.num_heads for the kv cache on each GPU. So basically there is not an official implementation for TP + Zero-Inference + KV offload that I can run it directly. Please correct me if I am wrong. Are you planning to add this feature in the future? Btw, will TP helps under this setting? (since the attn computation are all on CPU anyway) Thanks! |
Glad that kv-cache-offload performance might be good for your scenario. Yes, you are correct there is no official implementation of TP + ZeRO-Inference + KV Offload. Unfortunately, we don't have bandwidth for this right now. But we welcome community contributions. Yes, I agree that TP won't add much benefit to kv offload since (1) memory pressure is mostly reduced, and (2) attn computation is on CPU. |
Thank you very much! Nice work! |
No description provided.
The text was updated successfully, but these errors were encountered: