Add low_cpu_mem_usage flag in inference test #221

lokoppakmsft · 2022-11-15T23:07:59Z

No description provided.

lokoppakmsft · 2022-11-15T23:11:19Z

poedator · 2023-08-31T15:42:20Z

Can confirm: when the model is loaded in safetensors format, this can reduce the memory usage by a factor of 5+.

When experimenting with llama2-70b, we found that the memory usage before this fix was over 260GB per process before it OOMed. After the fix, it took <250GB in total. This is likely because safetensors can memmap the weight files into the same os-wide cache such that different ranks point to the same memory.

To reproduce:

deepspeed --num_gpus 4 inference-test.py --model meta-llama/Llama-2-70b-hf  --batch_size 2 --dtype float16 --max_new_tokens 32 --test_performance

lokoppakmsft added 2 commits November 15, 2022 23:06

Add low_cpu_mem_usage flag in inference test

50814b2

fix hardcoded False

e1aa27c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add low_cpu_mem_usage flag in inference test #221

Add low_cpu_mem_usage flag in inference test #221

lokoppakmsft commented Nov 15, 2022

lokoppakmsft commented Nov 15, 2022

poedator commented Aug 31, 2023

Add low_cpu_mem_usage flag in inference test #221

Are you sure you want to change the base?

Add low_cpu_mem_usage flag in inference test #221

Conversation

lokoppakmsft commented Nov 15, 2022

lokoppakmsft commented Nov 15, 2022

poedator commented Aug 31, 2023