Does Zero-Inference support TP? #892

preminstrel · 2024-04-16T15:57:30Z

No description provided.

tjruwase · 2024-04-16T16:12:35Z

Zero-inference is composable with Megatron-style TP. That is the TP is implemented in the client.

preminstrel · 2024-04-16T16:14:30Z

Hello, is that composable with kv cache offloading? But I cannot find its API... @tjruwase Thanks!

preminstrel · 2024-04-16T16:34:33Z

I mean, only offload kv cache while keeping whole model weights on GPUs. All the example code looks like for a single GPU.

tjruwase · 2024-04-16T16:35:07Z

I assume you are referring to kv cache offloading in the latest zero-inference. We did not evaluate with TP, but I expect it should work.

preminstrel · 2024-04-16T16:36:15Z

Thanks! But how can I make it work? Do you have example command?

preminstrel · 2024-04-16T16:39:12Z

I tried to set num_gpus to 2, but seems it will make two identical model on each GPU at the same time.

tjruwase · 2024-04-16T16:47:35Z

This is because your model has not been pre-processed by a TP framework like Megatron. ZeRO-Inference will not perform the TP slicing on any model.

tjruwase · 2024-04-16T16:48:14Z

Thanks! But how can I make it work? Do you have example command?

Below are commands for single-gpu inference with kv-cache-offload.
https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/zero_inference#token-generation-with-zero-inference

preminstrel · 2024-04-16T17:55:52Z

Yes, you are right! Thanks! And single-gpu inference with kv-cache-offload's performance is really nice! But I have a question:

I found that fork of transformers actually allocate buffer for KV cache, which seems not compatible. It will still allocate self.num_heads for the kv cache on each GPU.

So basically there is not an official implementation for TP + Zero-Inference + KV offload that I can run it directly. Please correct me if I am wrong.

Are you planning to add this feature in the future? Btw, will TP helps under this setting? (since the attn computation are all on CPU anyway)

Thanks!

tjruwase · 2024-04-16T18:15:17Z

Glad that kv-cache-offload performance might be good for your scenario.

Yes, you are correct there is no official implementation of TP + ZeRO-Inference + KV Offload. Unfortunately, we don't have bandwidth for this right now. But we welcome community contributions.

Yes, I agree that TP won't add much benefit to kv offload since (1) memory pressure is mostly reduced, and (2) attn computation is on CPU.

preminstrel · 2024-04-16T18:19:41Z

Thank you very much! Nice work!

preminstrel changed the title ~~Does Zero-Inference supports TP?~~ Does Zero-Inference support TP? Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Zero-Inference support TP? #892

Does Zero-Inference support TP? #892

preminstrel commented Apr 16, 2024

tjruwase commented Apr 16, 2024

preminstrel commented Apr 16, 2024

preminstrel commented Apr 16, 2024

tjruwase commented Apr 16, 2024

preminstrel commented Apr 16, 2024

preminstrel commented Apr 16, 2024 •

edited

Loading

tjruwase commented Apr 16, 2024

tjruwase commented Apr 16, 2024

preminstrel commented Apr 16, 2024

tjruwase commented Apr 16, 2024

preminstrel commented Apr 16, 2024

Does Zero-Inference support TP? #892

Does Zero-Inference support TP? #892

Comments

preminstrel commented Apr 16, 2024

tjruwase commented Apr 16, 2024

preminstrel commented Apr 16, 2024

preminstrel commented Apr 16, 2024

tjruwase commented Apr 16, 2024

preminstrel commented Apr 16, 2024

preminstrel commented Apr 16, 2024 • edited Loading

tjruwase commented Apr 16, 2024

tjruwase commented Apr 16, 2024

preminstrel commented Apr 16, 2024

tjruwase commented Apr 16, 2024

preminstrel commented Apr 16, 2024

preminstrel commented Apr 16, 2024 •

edited

Loading