sequence parallel with communication overlap #5691

inkcherry · 2024-06-21T15:34:51Z

SP is a fantastic piece of work, it is very elegant and concise， at the current stage, a transformer layer's forward and backward passes involve 8 all-to-all operations, with 5 opportunities for overlapping communication:

Forward pass: The QKV matrix operations can be pipelined alongside some of the all-to-all communications.
Backward pass: DQ, DK, DV all-to-all communications can be pipelined alongside matrix operations.
Backward pass: DO_w can be parallel with DO_input, involving matrix operations and all-to-all communications. Similar overlap-comm strategies are used in Megatron for TP/TP-sp parallelism.
I tested under conditions of 1N8C zero1, disabled activation checkpointing, ds-sp=8, and gbs=16:
1B 64K
7B 16K
They showed over 10% improvement (where I found that for mega-ds, using split QKV itself can also enhance performance due to reducing slice + cat operations in fwd/bwd), despite some TFLOPs already performing at a relatively good level.
co-work with microsoft/Megatron-DeepSpeed#415

This reverts commit cb15ffa.

Edenzzzz · 2024-06-28T03:44:02Z

overlapping only happens when computation doesn't depend on communication?

inkcherry · 2024-07-05T09:10:19Z

overlapping only happens when computation doesn't depend on communication?

@Edenzzzz Yes, manual sync of some dependencies is required。

remove v

inkcherry · 2024-07-10T06:13:47Z

we set gbs=2 ,sp=4, seq_len=16K,model size =1B, zero_stage=1, disable=activation_checkpoint, use-flash-attn-v2.

without this patch
with this patch, enable splitqkv+sp-overlap-comm
with this path, disable splitqkv+sp-overlap-comm

we list the loss curve & grad norm curve and they are consistent.

samadejacobs · 2024-07-16T22:01:32Z

@inkcherry, many thanks for this excellent contribution to DeepSpeed codebase. To help with our review, could you please add (1) unit test(s) and (2) numbers on parallel performance improvements (throughput and latency) to the pull request? Your continuous and remarkable contributions to DeepSpeed are appreciated.

Edenzzzz · 2024-08-16T04:33:26Z

@inkcherry Thanks for your insight! Can I ask why we need sp_stream here, as it seems to be never used, e.g. by torch.cuda.stream(sp_stream)?

inkcherry · 2024-08-30T04:52:45Z

@inkcherry Thanks for your insight! Can I ask why we need sp_stream here, as it seems to be never used, e.g. by torch.cuda.stream(sp_stream)?

hi @Edenzzzz
apology for missing your comments, I noticed that DeepSpeed's sequence parallel is designed in a modular way, which means we can't freely insert communication calls (comm) anywhere we want to use async_op. When pytorch computation kernel launched before communication one, the communication one will automatically sync with the default stream, we need to use a custom stream or even an event to achieve parallelism between computation and communication. It's also crucial to maintain the dependencies between them properly. The stream setup for this is in Megatron-DeepSpeed implementation.
Here are two of the three cases mentioned in this PR that fall into this category, using an additional stream. The other case uses async_op=True with all2all.

inkcherry added 12 commits May 21, 2024 07:37

fix ds-sp grad scale for zero0

cb15ffa

enable o compute async

a037a53

enable qk bwd async all2all

42d1284

fwd optimi

6919af4

fix1 remove linear arg, remove note

39596ac

async qkv fwd, optimi cpu ,make fwd call fast

eb760c0

update

c7d3374

refine code

70a6d0c

refine code

65afd89

Revert "fix ds-sp grad scale for zero0"

4b3518e

This reverts commit cb15ffa.

Merge remote-tracking branch 'upstream/master' into sp_overlap_comm

634d6d9

fix format

54b5ce3

inkcherry requested a review from mrwyattii as a code owner June 21, 2024 15:34

fix format

c9f0c0a

tjruwase requested review from samadejacobs and tohtana and removed request for mrwyattii June 21, 2024 22:17

inkcherry added 2 commits July 5, 2024 16:31

refine code

0862aa3

add register for v, ensuring they launch on a single thread.

1c596dd

inkcherry mentioned this pull request Jul 5, 2024

support split qkv linear and sp overlap comm microsoft/Megatron-DeepSpeed#415

Open

tjruwase and others added 3 commits July 8, 2024 15:40

Merge branch 'master' into sp_overlap_comm

96e7696

remove v

2fbbd5e

remove v

765a664

remove v

inkcherry and others added 2 commits July 10, 2024 07:11

fix notes and format

171eb67

Merge branch 'master' into sp_overlap_comm

020ab5f

tohtana approved these changes Jul 22, 2024

View reviewed changes

loadams merged commit 17ed7c7 into microsoft:master Aug 1, 2024
11 checks passed

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequence parallel with communication overlap #5691

sequence parallel with communication overlap #5691

inkcherry commented Jun 21, 2024 •

edited

Loading

Edenzzzz commented Jun 28, 2024

inkcherry commented Jul 5, 2024 •

edited

Loading

inkcherry commented Jul 10, 2024 •

edited

Loading

samadejacobs commented Jul 16, 2024

Edenzzzz commented Aug 16, 2024

inkcherry commented Aug 30, 2024 •

edited

Loading

sequence parallel with communication overlap #5691

sequence parallel with communication overlap #5691

Conversation

inkcherry commented Jun 21, 2024 • edited Loading

Edenzzzz commented Jun 28, 2024

inkcherry commented Jul 5, 2024 • edited Loading

inkcherry commented Jul 10, 2024 • edited Loading

samadejacobs commented Jul 16, 2024

Edenzzzz commented Aug 16, 2024

inkcherry commented Aug 30, 2024 • edited Loading

inkcherry commented Jun 21, 2024 •

edited

Loading

inkcherry commented Jul 5, 2024 •

edited

Loading

inkcherry commented Jul 10, 2024 •

edited

Loading

inkcherry commented Aug 30, 2024 •

edited

Loading