Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed-chat: print mean stage1/2 loss periodically #780

Conversation

mosheisland
Copy link
Contributor

Print mean loss periodically based on deepspeed 'steps_per_print' configuration. So, mean loss is printed on an optimizer step boundary. To reduce log clutter, only rank 0 loss is printed.

This commit modifies current print_loss functionality of stage1:

  • Print mean loss at optimizer step boundary instead of at every micro-step
  • Print periodically based on ds_config['steps_per_print']
  • Print only at global rank 0

The commit adds print_loss functionality for stage2.

Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192

Currently, chatbot assumes OPTForCausalLM model.
Modify it to use the required model from the checkpoint.

Change-Id: I04cbc28f87c7be4fc89a3fac39a3e5634b151b32
Signed-off-by: Moshe Island <[email protected]>
DeepSpeed's bf16_optimizer does not have an overflow attribute.
This is ok since bf16 dtype has same range as fp32 and is not expected to
overflow.
Therefore, for bf16, always return no overflow.

Change-Id: I66a2204f3af81e52e7fa8d024afafdbbc7494327
Signed-off-by: Moshe Island <[email protected]>
Currently, only disable_dropout configuration is supported.
However, some models (e.g. Bloom) have a default of dropout=0 in model config.
Therefore, modify to support explicit dropout configuration.
Also, update accordingly existing training scripts.

Change-Id: I5ee96a77ca2b58d9787573a48009e2af36a270b0
Signed-off-by: Moshe Island <[email protected]>
Add support for periodic evaluation during rm reward model training.
Configurable via added arguments: --eval_interval and --eval_iters.
The default configuration is backward compatible.

In addition, display also the score of the rejected predictions.

Change-Id: Ib377fd731fe676c01114c087581a30777a3f3f49
Signed-off-by: Moshe Island <[email protected]>
Using loss in fp32 improved accuracy for bf16 training for all 3 stages.
By default, all 3 stages will calculate loss in fp32 when using bf16.
This can be disabled by using --no_bf16_to_fp32_loss.

While at it, fix stage2 reward model creation: pass zero_stage to
create_critic_model.

Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405
Signed-off-by: Moshe Island <[email protected]>
Current default name used to detect LN layers is "LayerNorm.weight".
This does not work for the following models:
- opt: uses "layer_norm"
- llama: uses "norm" and "layernorm"
- bloom: uses "layernorm" and "ln_f"

Therefore, modify the default names to accomodate for the above.
Also, compare names in lower-caps to capture models with different caps.

Change-Id: I5b805df2663c62daf3d9c8a31a973742e344e76b
Signed-off-by: Moshe Island <[email protected]>
When using lora only, get_optimizer_grouped_parameters() returns a list of 3
parameter groups, where only the second is not empty.
Then, deepspeed removes empty parameter groups.
[ref: DeepSpeedEngine._configure_optimizer() deepspeed v0.10.3]
However, the lr_scheduler still contains 3 groups.
This causes the lr scheduler to update the lora params with the wrong lr.

Fix it by removing all empty groups in get_optimizer_grouped_parameters().

Change-Id: I520841312bdedd6a572cf4c827e0bbf06f983575
Signed-off-by: Moshe Island <[email protected]>
When using only optimize lora, we still need to train the v_head parameter.

Change-Id: I252c3ee69819997bf336482c6779b070f2e76df8
Signed-off-by: Moshe Island <[email protected]>
Bloom-560m model has high variance in its last LN layer weight.
This causes accuracy issues in bf16 stage2 training.
Therefore, reset the parameters of the last LN layer before training.
This is a good practice in any case where we replace the classifier that
follows the LN.

In addition, in case we are using only optimize lora, we need to force the
training of the LN parameters that were reset.

Change-Id: I323d8947907eb4a1cc0fa6354bdaf0cbbf33a68d
Signed-off-by: Moshe Island <[email protected]>
Currently, ppl is calculated for local worker and then averaged over data
parallel workers. Fix it by first averaging the loss over data parallel
workers and then caclulate ppl of averaged loss.

While at it, print loss in evaluate.

Change-Id: Ic4108ca48a18b326677d80c1eee81c535b3a27a9
Signed-off-by: Moshe Island <[email protected]>
Fix args when calling create_critic_model().

Change-Id: I845a4f024ca50915076184692f44ee8a1b7016a2
Signed-off-by: Moshe Island <[email protected]>
Stages 1 & 2 append '<|endoftext|>' marker to all samples.
However, some tokenizers (e.g. OPT, Bloom), encode this marker as a sequence
of subword tokens.

This commit adds an optional support to add the EOT marker as a special token
to force the tokenizer to encode it as a single token.

Change-Id: If98d348fcaa7d6685e755aabe305e23e7649c367
Signed-off-by: Moshe Island <[email protected]>
At stage 1 and 2, print average loss periodically based on deepspeed
configuration: steps_per_print.
The commit modifies current print_loss functionality of stage1:
  - Print average loss instead of local iteration loss
  - Print only at global rank 0

Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192
Signed-off-by: Moshe Island <[email protected]>
Due to reward high variance, display also reward EMA.
While at it, print the total number of iterations.

Change-Id: I3a6b287af8087cbc075ba12764035d77070ae93d
Signed-off-by: Moshe Island <[email protected]>
In stage3, if all the generated answers to the given prompts are too short,
use last valid micro-batch of prompts and answers of this worker.

Change-Id: I7878e3b10cc6fa81ce8364ca3e4a3569cfb350a8
Signed-off-by: Moshe Island <[email protected]>
Enables to configure print answers interval in stage3.

Change-Id: I6440f401f602e7e7f763b3ec8e45029a74dd72b7
Signed-off-by: Moshe Island <[email protected]>
In case the prompts are too long, the prompts are used but they are arbitrary
sliced at start to fit into the configured max prompt length.
This arbitrary slicing sometimes causes prompts to be less meaningful.
Which in turn, causes the generator to generate garbage.
This phenomena was observed to de-stabilize RLHF stage3.
To fix it, we filter prompts that are too long.

In addition, dataset rebuild flag is propogated to other required consumers.

Change-Id: I440f09decf0784e4c2c8167a893006dff312281b
Signed-off-by: Moshe Island <[email protected]>
Change-Id: I1fd19529c94c89cc62d3b4a2b20b17fc4f4773bf
Signed-off-by: Moshe Island <[email protected]>
Change-Id: I40012d374121accbeb2c45729ac5532cf6cfedbb
Signed-off-by: Moshe Island <[email protected]>
Change-Id: I205e41f889af0cf0162fc33b8f0c4e40dde4c7a3
Signed-off-by: Moshe Island <[email protected]>
Add support for Habana Gaudi acceleration device.
Main changes include:
- Use accelerator abstraction layer
- Do noy use cuda kernels (e.f. FusedAdam)
- HPU utilizes graph mode and requires additional apis (e.g. hpu_mark_step)

Change-Id: Ifb9fc25bfd62a5299859f1203376494b87ca87e0
Signed-off-by: Moshe Island <[email protected]>
Implement reward model loss calculation in a way that prevents dynamic shapes.
This speeds up HPU execution.

Change-Id: I50c6ebdadca5cf6d3548c31c614730e7dead825c
Signed-off-by: Moshe Island <[email protected]>
Print mean loss periodically based on deepspeed 'steps_per_print'
configuration. So, mean loss is printed on an optimizer step boundary.
To reduce log clutter, only rank 0 loss is printed.

This commit modifies current print_loss functionality of stage1:
  - Print mean loss at optimizer step boundary instead of at every micro-step
  - Print periodically based on ds_config['steps_per_print']
  - Print only at global rank 0

The commit adds print_loss functionality for stage2.

Change-Id: I430d88cbbbbb2dd2fe7784dbadac69e522d5a192
Signed-off-by: Moshe Island <[email protected]>
@tjruwase
Copy link
Contributor

@mosheisland, apologies for the delay in merging this PR. Can you please help resolve the conflicts? Thanks!

@loadams
Copy link
Contributor

loadams commented Jul 18, 2024

@mosheisland, apologies for the delay in merging this PR. Can you please help resolve the conflicts? Thanks!

Hi @mosheisland - could you review the merge conflicts and we can get this merged?

@loadams
Copy link
Contributor

loadams commented Nov 4, 2024

Closing as stale and the repo was refactored

@loadams loadams closed this Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants