-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sequence parallel with communication overlap #5691
Conversation
This reverts commit cb15ffa.
overlapping only happens when computation doesn't depend on communication? |
@Edenzzzz Yes, manual sync of some dependencies is required。 |
@inkcherry, many thanks for this excellent contribution to DeepSpeed codebase. To help with our review, could you please add (1) unit test(s) and (2) numbers on parallel performance improvements (throughput and latency) to the pull request? Your continuous and remarkable contributions to DeepSpeed are appreciated. |
@inkcherry Thanks for your insight! Can I ask why we need sp_stream here, as it seems to be never used, e.g. by torch.cuda.stream(sp_stream)? |
hi @Edenzzzz |
SP is a fantastic piece of work, it is very elegant and concise, at the current stage, a transformer layer's forward and backward passes involve 8 all-to-all operations, with 5 opportunities for overlapping communication:
Forward pass: The QKV matrix operations can be pipelined alongside some of the all-to-all communications.
Backward pass: DQ, DK, DV all-to-all communications can be pipelined alongside matrix operations.
Backward pass: DO_w can be parallel with DO_input, involving matrix operations and all-to-all communications. Similar overlap-comm strategies are used in Megatron for TP/TP-sp parallelism.
I tested under conditions of 1N8C zero1, disabled activation checkpointing, ds-sp=8, and gbs=16:
1B 64K
7B 16K
They showed over 10% improvement (where I found that for mega-ds, using split QKV itself can also enhance performance due to reducing slice + cat operations in fwd/bwd), despite some TFLOPs already performing at a relatively good level.
co-work with microsoft/Megatron-DeepSpeed#415