You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am benchmarking a machine with the 175B GPT3 case. I run on 32 nodes equipped with 4 MI250X each. The interconnect is Slignshot (the one on Frontier, ORNL's machine).
I get spurious all gather drops that always occur during the optimizer step but never during forward or backward pass. For a given iteration it may get slower, or not even when the same nodes are used across runs.
The slowdown is on the order of 50. It would take 30s instead of ~0.6s. It happens in
DeepSpeedZeroOptimizer_Stage3::step() -> _post_step -> persistent_parameters[0].all_gather.
I use zero 3 and activation checkpointing, rccl (nccl), adam and no pipeline or tensor parallelism.
Did I overlook a setting ? Did someone experience something similar on Cray Slingshot/AMD hardware ?
The text was updated successfully, but these errors were encountered:
I am benchmarking a machine with the 175B GPT3 case. I run on 32 nodes equipped with 4 MI250X each. The interconnect is Slignshot (the one on Frontier, ORNL's machine).
I get spurious all gather drops that always occur during the optimizer step but never during forward or backward pass. For a given iteration it may get slower, or not even when the same nodes are used across runs.
The slowdown is on the order of 50. It would take 30s instead of ~0.6s. It happens in
DeepSpeedZeroOptimizer_Stage3::step() -> _post_step -> persistent_parameters[0].all_gather.
I use zero 3 and activation checkpointing, rccl (nccl), adam and no pipeline or tensor parallelism.
Did I overlook a setting ? Did someone experience something similar on Cray Slingshot/AMD hardware ?
The text was updated successfully, but these errors were encountered: