Spurious all gather performance drop. #384

etiennemlb · 2024-04-29T17:16:15Z

I am benchmarking a machine with the 175B GPT3 case. I run on 32 nodes equipped with 4 MI250X each. The interconnect is Slignshot (the one on Frontier, ORNL's machine).

I get spurious all gather drops that always occur during the optimizer step but never during forward or backward pass. For a given iteration it may get slower, or not even when the same nodes are used across runs.

The slowdown is on the order of 50. It would take 30s instead of ~0.6s. It happens in
DeepSpeedZeroOptimizer_Stage3::step() -> _post_step -> persistent_parameters[0].all_gather.

I use zero 3 and activation checkpointing, rccl (nccl), adam and no pipeline or tensor parallelism.

Did I overlook a setting ? Did someone experience something similar on Cray Slingshot/AMD hardware ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious all gather performance drop. #384

Spurious all gather performance drop. #384

etiennemlb commented Apr 29, 2024

Spurious all gather performance drop. #384

Spurious all gather performance drop. #384

Comments

etiennemlb commented Apr 29, 2024