Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Zero-Inference support TP? #892

Open
preminstrel opened this issue Apr 16, 2024 · 11 comments
Open

Does Zero-Inference support TP? #892

preminstrel opened this issue Apr 16, 2024 · 11 comments

Comments

@preminstrel
Copy link

No description provided.

@tjruwase
Copy link
Contributor

Zero-inference is composable with Megatron-style TP. That is the TP is implemented in the client.

@preminstrel
Copy link
Author

Hello, is that composable with kv cache offloading? But I cannot find its API... @tjruwase Thanks!

@preminstrel
Copy link
Author

I mean, only offload kv cache while keeping whole model weights on GPUs. All the example code looks like for a single GPU.

@tjruwase
Copy link
Contributor

I assume you are referring to kv cache offloading in the latest zero-inference. We did not evaluate with TP, but I expect it should work.

@preminstrel
Copy link
Author

Thanks! But how can I make it work? Do you have example command?

@preminstrel
Copy link
Author

preminstrel commented Apr 16, 2024

I tried to set num_gpus to 2, but seems it will make two identical model on each GPU at the same time.

@tjruwase
Copy link
Contributor

This is because your model has not been pre-processed by a TP framework like Megatron. ZeRO-Inference will not perform the TP slicing on any model.

@tjruwase
Copy link
Contributor

Thanks! But how can I make it work? Do you have example command?

Below are commands for single-gpu inference with kv-cache-offload.
https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/zero_inference#token-generation-with-zero-inference

@preminstrel
Copy link
Author

Yes, you are right! Thanks! And single-gpu inference with kv-cache-offload's performance is really nice! But I have a question:

I found that fork of transformers actually allocate buffer for KV cache, which seems not compatible. It will still allocate self.num_heads for the kv cache on each GPU.

So basically there is not an official implementation for TP + Zero-Inference + KV offload that I can run it directly. Please correct me if I am wrong.

Are you planning to add this feature in the future? Btw, will TP helps under this setting? (since the attn computation are all on CPU anyway)

Thanks!

@preminstrel preminstrel changed the title Does Zero-Inference supports TP? Does Zero-Inference support TP? Apr 16, 2024
@tjruwase
Copy link
Contributor

Glad that kv-cache-offload performance might be good for your scenario.

Yes, you are correct there is no official implementation of TP + ZeRO-Inference + KV Offload. Unfortunately, we don't have bandwidth for this right now. But we welcome community contributions.

Yes, I agree that TP won't add much benefit to kv offload since (1) memory pressure is mostly reduced, and (2) attn computation is on CPU.

@preminstrel
Copy link
Author

Thank you very much! Nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants