-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed-chat: print mean stage1/2 loss periodically #780
Closed
mosheisland
wants to merge
29
commits into
microsoft:master
from
mosheisland:12_print_step_1_2_loss_periodically
Closed
deepspeed-chat: print mean stage1/2 loss periodically #780
mosheisland
wants to merge
29
commits into
microsoft:master
from
mosheisland:12_print_step_1_2_loss_periodically
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently, chatbot assumes OPTForCausalLM model. Modify it to use the required model from the checkpoint. Change-Id: I04cbc28f87c7be4fc89a3fac39a3e5634b151b32 Signed-off-by: Moshe Island <[email protected]>
DeepSpeed's bf16_optimizer does not have an overflow attribute. This is ok since bf16 dtype has same range as fp32 and is not expected to overflow. Therefore, for bf16, always return no overflow. Change-Id: I66a2204f3af81e52e7fa8d024afafdbbc7494327 Signed-off-by: Moshe Island <[email protected]>
Currently, only disable_dropout configuration is supported. However, some models (e.g. Bloom) have a default of dropout=0 in model config. Therefore, modify to support explicit dropout configuration. Also, update accordingly existing training scripts. Change-Id: I5ee96a77ca2b58d9787573a48009e2af36a270b0 Signed-off-by: Moshe Island <[email protected]>
Add support for periodic evaluation during rm reward model training. Configurable via added arguments: --eval_interval and --eval_iters. The default configuration is backward compatible. In addition, display also the score of the rejected predictions. Change-Id: Ib377fd731fe676c01114c087581a30777a3f3f49 Signed-off-by: Moshe Island <[email protected]>
Using loss in fp32 improved accuracy for bf16 training for all 3 stages. By default, all 3 stages will calculate loss in fp32 when using bf16. This can be disabled by using --no_bf16_to_fp32_loss. While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model. Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405 Signed-off-by: Moshe Island <[email protected]>
Current default name used to detect LN layers is "LayerNorm.weight". This does not work for the following models: - opt: uses "layer_norm" - llama: uses "norm" and "layernorm" - bloom: uses "layernorm" and "ln_f" Therefore, modify the default names to accomodate for the above. Also, compare names in lower-caps to capture models with different caps. Change-Id: I5b805df2663c62daf3d9c8a31a973742e344e76b Signed-off-by: Moshe Island <[email protected]>
When using lora only, get_optimizer_grouped_parameters() returns a list of 3 parameter groups, where only the second is not empty. Then, deepspeed removes empty parameter groups. [ref: DeepSpeedEngine._configure_optimizer() deepspeed v0.10.3] However, the lr_scheduler still contains 3 groups. This causes the lr scheduler to update the lora params with the wrong lr. Fix it by removing all empty groups in get_optimizer_grouped_parameters(). Change-Id: I520841312bdedd6a572cf4c827e0bbf06f983575 Signed-off-by: Moshe Island <[email protected]>
When using only optimize lora, we still need to train the v_head parameter. Change-Id: I252c3ee69819997bf336482c6779b070f2e76df8 Signed-off-by: Moshe Island <[email protected]>
Bloom-560m model has high variance in its last LN layer weight. This causes accuracy issues in bf16 stage2 training. Therefore, reset the parameters of the last LN layer before training. This is a good practice in any case where we replace the classifier that follows the LN. In addition, in case we are using only optimize lora, we need to force the training of the LN parameters that were reset. Change-Id: I323d8947907eb4a1cc0fa6354bdaf0cbbf33a68d Signed-off-by: Moshe Island <[email protected]>
Currently, ppl is calculated for local worker and then averaged over data parallel workers. Fix it by first averaging the loss over data parallel workers and then caclulate ppl of averaged loss. While at it, print loss in evaluate. Change-Id: Ic4108ca48a18b326677d80c1eee81c535b3a27a9 Signed-off-by: Moshe Island <[email protected]>
Fix args when calling create_critic_model(). Change-Id: I845a4f024ca50915076184692f44ee8a1b7016a2 Signed-off-by: Moshe Island <[email protected]>
Stages 1 & 2 append '<|endoftext|>' marker to all samples. However, some tokenizers (e.g. OPT, Bloom), encode this marker as a sequence of subword tokens. This commit adds an optional support to add the EOT marker as a special token to force the tokenizer to encode it as a single token. Change-Id: If98d348fcaa7d6685e755aabe305e23e7649c367 Signed-off-by: Moshe Island <[email protected]>
At stage 1 and 2, print average loss periodically based on deepspeed configuration: steps_per_print. The commit modifies current print_loss functionality of stage1: - Print average loss instead of local iteration loss - Print only at global rank 0 Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192 Signed-off-by: Moshe Island <[email protected]>
Due to reward high variance, display also reward EMA. While at it, print the total number of iterations. Change-Id: I3a6b287af8087cbc075ba12764035d77070ae93d Signed-off-by: Moshe Island <[email protected]>
In stage3, if all the generated answers to the given prompts are too short, use last valid micro-batch of prompts and answers of this worker. Change-Id: I7878e3b10cc6fa81ce8364ca3e4a3569cfb350a8 Signed-off-by: Moshe Island <[email protected]>
Enables to configure print answers interval in stage3. Change-Id: I6440f401f602e7e7f763b3ec8e45029a74dd72b7 Signed-off-by: Moshe Island <[email protected]>
In case the prompts are too long, the prompts are used but they are arbitrary sliced at start to fit into the configured max prompt length. This arbitrary slicing sometimes causes prompts to be less meaningful. Which in turn, causes the generator to generate garbage. This phenomena was observed to de-stabilize RLHF stage3. To fix it, we filter prompts that are too long. In addition, dataset rebuild flag is propogated to other required consumers. Change-Id: I440f09decf0784e4c2c8167a893006dff312281b Signed-off-by: Moshe Island <[email protected]>
Change-Id: I1fd19529c94c89cc62d3b4a2b20b17fc4f4773bf Signed-off-by: Moshe Island <[email protected]>
Change-Id: I40012d374121accbeb2c45729ac5532cf6cfedbb Signed-off-by: Moshe Island <[email protected]>
Change-Id: I205e41f889af0cf0162fc33b8f0c4e40dde4c7a3 Signed-off-by: Moshe Island <[email protected]>
Add support for Habana Gaudi acceleration device. Main changes include: - Use accelerator abstraction layer - Do noy use cuda kernels (e.f. FusedAdam) - HPU utilizes graph mode and requires additional apis (e.g. hpu_mark_step) Change-Id: Ifb9fc25bfd62a5299859f1203376494b87ca87e0 Signed-off-by: Moshe Island <[email protected]>
Implement reward model loss calculation in a way that prevents dynamic shapes. This speeds up HPU execution. Change-Id: I50c6ebdadca5cf6d3548c31c614730e7dead825c Signed-off-by: Moshe Island <[email protected]>
Print mean loss periodically based on deepspeed 'steps_per_print' configuration. So, mean loss is printed on an optimizer step boundary. To reduce log clutter, only rank 0 loss is printed. This commit modifies current print_loss functionality of stage1: - Print mean loss at optimizer step boundary instead of at every micro-step - Print periodically based on ds_config['steps_per_print'] - Print only at global rank 0 The commit adds print_loss functionality for stage2. Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192 Signed-off-by: Moshe Island <[email protected]>
mosheisland
requested review from
jeffra,
tjruwase,
ShadenSmith,
conglongli,
awan-10,
eltonzheng and
minjiaz
as code owners
October 19, 2023 08:33
mosheisland
requested review from
RezaYazdaniAminabadi,
duli2012,
mrwyattii,
yaozhewei,
arashb and
xiaoxiawu-microsoft
as code owners
October 19, 2023 08:33
tjruwase
approved these changes
Nov 8, 2023
@mosheisland, apologies for the delay in merging this PR. Can you please help resolve the conflicts? Thanks! |
lekurile
approved these changes
Dec 8, 2023
Hi @mosheisland - could you review the merge conflicts and we can get this merged? |
loadams
removed request for
arashb,
ShadenSmith,
jeffra,
duli2012,
conglongli,
awan-10,
mrwyattii,
yaozhewei,
eltonzheng,
minjiaz,
RezaYazdaniAminabadi and
xiaoxiawu-microsoft
November 4, 2024 17:06
Closing as stale and the repo was refactored |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Print mean loss periodically based on deepspeed 'steps_per_print' configuration. So, mean loss is printed on an optimizer step boundary. To reduce log clutter, only rank 0 loss is printed.
This commit modifies current print_loss functionality of stage1:
The commit adds print_loss functionality for stage2.
Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192