deepspeed-chat: print mean stage1/2 loss periodically #780

mosheisland · 2023-10-19T08:33:57Z

Print mean loss periodically based on deepspeed 'steps_per_print' configuration. So, mean loss is printed on an optimizer step boundary. To reduce log clutter, only rank 0 loss is printed.

This commit modifies current print_loss functionality of stage1:

Print mean loss at optimizer step boundary instead of at every micro-step
Print periodically based on ds_config['steps_per_print']
Print only at global rank 0

The commit adds print_loss functionality for stage2.

Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192

Currently, chatbot assumes OPTForCausalLM model. Modify it to use the required model from the checkpoint. Change-Id: I04cbc28f87c7be4fc89a3fac39a3e5634b151b32 Signed-off-by: Moshe Island <[email protected]>

DeepSpeed's bf16_optimizer does not have an overflow attribute. This is ok since bf16 dtype has same range as fp32 and is not expected to overflow. Therefore, for bf16, always return no overflow. Change-Id: I66a2204f3af81e52e7fa8d024afafdbbc7494327 Signed-off-by: Moshe Island <[email protected]>

Currently, only disable_dropout configuration is supported. However, some models (e.g. Bloom) have a default of dropout=0 in model config. Therefore, modify to support explicit dropout configuration. Also, update accordingly existing training scripts. Change-Id: I5ee96a77ca2b58d9787573a48009e2af36a270b0 Signed-off-by: Moshe Island <[email protected]>

Add support for periodic evaluation during rm reward model training. Configurable via added arguments: --eval_interval and --eval_iters. The default configuration is backward compatible. In addition, display also the score of the rejected predictions. Change-Id: Ib377fd731fe676c01114c087581a30777a3f3f49 Signed-off-by: Moshe Island <[email protected]>

Using loss in fp32 improved accuracy for bf16 training for all 3 stages. By default, all 3 stages will calculate loss in fp32 when using bf16. This can be disabled by using --no_bf16_to_fp32_loss. While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model. Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405 Signed-off-by: Moshe Island <[email protected]>

Current default name used to detect LN layers is "LayerNorm.weight". This does not work for the following models: - opt: uses "layer_norm" - llama: uses "norm" and "layernorm" - bloom: uses "layernorm" and "ln_f" Therefore, modify the default names to accomodate for the above. Also, compare names in lower-caps to capture models with different caps. Change-Id: I5b805df2663c62daf3d9c8a31a973742e344e76b Signed-off-by: Moshe Island <[email protected]>

When using lora only, get_optimizer_grouped_parameters() returns a list of 3 parameter groups, where only the second is not empty. Then, deepspeed removes empty parameter groups. [ref: DeepSpeedEngine._configure_optimizer() deepspeed v0.10.3] However, the lr_scheduler still contains 3 groups. This causes the lr scheduler to update the lora params with the wrong lr. Fix it by removing all empty groups in get_optimizer_grouped_parameters(). Change-Id: I520841312bdedd6a572cf4c827e0bbf06f983575 Signed-off-by: Moshe Island <[email protected]>

When using only optimize lora, we still need to train the v_head parameter. Change-Id: I252c3ee69819997bf336482c6779b070f2e76df8 Signed-off-by: Moshe Island <[email protected]>

Bloom-560m model has high variance in its last LN layer weight. This causes accuracy issues in bf16 stage2 training. Therefore, reset the parameters of the last LN layer before training. This is a good practice in any case where we replace the classifier that follows the LN. In addition, in case we are using only optimize lora, we need to force the training of the LN parameters that were reset. Change-Id: I323d8947907eb4a1cc0fa6354bdaf0cbbf33a68d Signed-off-by: Moshe Island <[email protected]>

Currently, ppl is calculated for local worker and then averaged over data parallel workers. Fix it by first averaging the loss over data parallel workers and then caclulate ppl of averaged loss. While at it, print loss in evaluate. Change-Id: Ic4108ca48a18b326677d80c1eee81c535b3a27a9 Signed-off-by: Moshe Island <[email protected]>

Fix args when calling create_critic_model(). Change-Id: I845a4f024ca50915076184692f44ee8a1b7016a2 Signed-off-by: Moshe Island <[email protected]>

Stages 1 & 2 append '<|endoftext|>' marker to all samples. However, some tokenizers (e.g. OPT, Bloom), encode this marker as a sequence of subword tokens. This commit adds an optional support to add the EOT marker as a special token to force the tokenizer to encode it as a single token. Change-Id: If98d348fcaa7d6685e755aabe305e23e7649c367 Signed-off-by: Moshe Island <[email protected]>

At stage 1 and 2, print average loss periodically based on deepspeed configuration: steps_per_print. The commit modifies current print_loss functionality of stage1: - Print average loss instead of local iteration loss - Print only at global rank 0 Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192 Signed-off-by: Moshe Island <[email protected]>

Due to reward high variance, display also reward EMA. While at it, print the total number of iterations. Change-Id: I3a6b287af8087cbc075ba12764035d77070ae93d Signed-off-by: Moshe Island <[email protected]>

In stage3, if all the generated answers to the given prompts are too short, use last valid micro-batch of prompts and answers of this worker. Change-Id: I7878e3b10cc6fa81ce8364ca3e4a3569cfb350a8 Signed-off-by: Moshe Island <[email protected]>

Enables to configure print answers interval in stage3. Change-Id: I6440f401f602e7e7f763b3ec8e45029a74dd72b7 Signed-off-by: Moshe Island <[email protected]>

In case the prompts are too long, the prompts are used but they are arbitrary sliced at start to fit into the configured max prompt length. This arbitrary slicing sometimes causes prompts to be less meaningful. Which in turn, causes the generator to generate garbage. This phenomena was observed to de-stabilize RLHF stage3. To fix it, we filter prompts that are too long. In addition, dataset rebuild flag is propogated to other required consumers. Change-Id: I440f09decf0784e4c2c8167a893006dff312281b Signed-off-by: Moshe Island <[email protected]>

Change-Id: I1fd19529c94c89cc62d3b4a2b20b17fc4f4773bf Signed-off-by: Moshe Island <[email protected]>

Change-Id: I40012d374121accbeb2c45729ac5532cf6cfedbb Signed-off-by: Moshe Island <[email protected]>

Change-Id: I205e41f889af0cf0162fc33b8f0c4e40dde4c7a3 Signed-off-by: Moshe Island <[email protected]>

Add support for Habana Gaudi acceleration device. Main changes include: - Use accelerator abstraction layer - Do noy use cuda kernels (e.f. FusedAdam) - HPU utilizes graph mode and requires additional apis (e.g. hpu_mark_step) Change-Id: Ifb9fc25bfd62a5299859f1203376494b87ca87e0 Signed-off-by: Moshe Island <[email protected]>

Implement reward model loss calculation in a way that prevents dynamic shapes. This speeds up HPU execution. Change-Id: I50c6ebdadca5cf6d3548c31c614730e7dead825c Signed-off-by: Moshe Island <[email protected]>

Print mean loss periodically based on deepspeed 'steps_per_print' configuration. So, mean loss is printed on an optimizer step boundary. To reduce log clutter, only rank 0 loss is printed. This commit modifies current print_loss functionality of stage1: - Print mean loss at optimizer step boundary instead of at every micro-step - Print periodically based on ds_config['steps_per_print'] - Print only at global rank 0 The commit adds print_loss functionality for stage2. Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192 Signed-off-by: Moshe Island <[email protected]>

tjruwase · 2023-11-29T16:00:34Z

@mosheisland, apologies for the delay in merging this PR. Can you please help resolve the conflicts? Thanks!

loadams · 2024-07-18T23:38:46Z

@mosheisland, apologies for the delay in merging this PR. Can you please help resolve the conflicts? Thanks!

Hi @mosheisland - could you review the merge conflicts and we can get this merged?

loadams · 2024-11-04T18:27:00Z

Closing as stale and the repo was refactored

mosheisland added 23 commits September 26, 2023 11:19

deepspeed-chat: support any model in chatbot

9f72c16

Currently, chatbot assumes OPTForCausalLM model. Modify it to use the required model from the checkpoint. Change-Id: I04cbc28f87c7be4fc89a3fac39a3e5634b151b32 Signed-off-by: Moshe Island <[email protected]>

deepspeed-chat: train v_head when only optimizing lora

759bf63

When using only optimize lora, we still need to train the v_head parameter. Change-Id: I252c3ee69819997bf336482c6779b070f2e76df8 Signed-off-by: Moshe Island <[email protected]>

deepspeed-chat: fix rw_eval

d2bca11

Fix args when calling create_critic_model(). Change-Id: I845a4f024ca50915076184692f44ee8a1b7016a2 Signed-off-by: Moshe Island <[email protected]>

deepspeed-chat: display reward ema in stage3

2225fa1

Due to reward high variance, display also reward EMA. While at it, print the total number of iterations. Change-Id: I3a6b287af8087cbc075ba12764035d77070ae93d Signed-off-by: Moshe Island <[email protected]>

deepspeed-chat: support print answers interval

ec2c6c8

Enables to configure print answers interval in stage3. Change-Id: I6440f401f602e7e7f763b3ec8e45029a74dd72b7 Signed-off-by: Moshe Island <[email protected]>

deepspeed-chat [internal]: support using torch adamw

07e4742

Change-Id: I1fd19529c94c89cc62d3b4a2b20b17fc4f4773bf Signed-off-by: Moshe Island <[email protected]>

deepspeed-chat [internal]: add bloom training scripts

36894a0

Change-Id: I40012d374121accbeb2c45729ac5532cf6cfedbb Signed-off-by: Moshe Island <[email protected]>

deepspeed-chat [internal]: fix bad access to DeepSpeedEngine.model

8aab515

Change-Id: I205e41f889af0cf0162fc33b8f0c4e40dde4c7a3 Signed-off-by: Moshe Island <[email protected]>

deepspeed-chat [internal]: optimize stage2 for hpu

e94eb22

Implement reward model loss calculation in a way that prevents dynamic shapes. This speeds up HPU execution. Change-Id: I50c6ebdadca5cf6d3548c31c614730e7dead825c Signed-off-by: Moshe Island <[email protected]>

mosheisland requested review from jeffra, tjruwase, ShadenSmith, conglongli, awan-10, eltonzheng and minjiaz as code owners October 19, 2023 08:33

mosheisland requested review from RezaYazdaniAminabadi, duli2012, mrwyattii, yaozhewei, arashb and xiaoxiawu-microsoft as code owners October 19, 2023 08:33

tjruwase and others added 3 commits October 30, 2023 22:04

Merge branch 'master' into 12_print_step_1_2_loss_periodically

891c93f

Merge branch 'master' into 12_print_step_1_2_loss_periodically

05a56d4

Merge branch 'master' into 12_print_step_1_2_loss_periodically

0d6a7b5

tjruwase approved these changes Nov 8, 2023

View reviewed changes

lekurile approved these changes Dec 8, 2023

View reviewed changes

loadams removed request for arashb, ShadenSmith, jeffra, duli2012, conglongli, awan-10, mrwyattii, yaozhewei, eltonzheng, minjiaz, RezaYazdaniAminabadi and xiaoxiawu-microsoft November 4, 2024 17:06

loadams added 3 commits November 4, 2024 09:31

Fix merge conflicts

e4c5460

Merge master from upstream

d378c25

Un-add files

fa87108

loadams closed this Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed-chat: print mean stage1/2 loss periodically #780

deepspeed-chat: print mean stage1/2 loss periodically #780

mosheisland commented Oct 19, 2023

tjruwase commented Nov 29, 2023

loadams commented Jul 18, 2024

loadams commented Nov 4, 2024

deepspeed-chat: print mean stage1/2 loss periodically #780

deepspeed-chat: print mean stage1/2 loss periodically #780

Conversation

mosheisland commented Oct 19, 2023

tjruwase commented Nov 29, 2023

loadams commented Jul 18, 2024

loadams commented Nov 4, 2024