You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use default command, it seems to use 29500 as master_port.
However, the master_port seems unchangable,even when I use "--master_port 29501" or change it using "deepspeed.init_distributed(dist_backend='nccl', distributed_port=config.master_port)"
error message:
[W1120 21:36:50.764587163 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 127.0.0.1:29500 - retrying (try=3, timeout=1800000ms, delay=1496ms): Connection reset by peer
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc06bba0446 in /data/wujiahao/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/lib/libc10.so)
...
The text was updated successfully, but these errors were encountered:
@lovedoubledan, can you share your full command-line to show the example code?
my command line is :
deepspeed --include localhost:4,7 train_stage2.py
--config_file config/gptir3_notokenloss_plus.yaml
--deepspeed --deepspeed_config config/deepspeed_config/gptir.json --master_port 20815
and my code is like:
parser = argparse.ArgumentParser()
# Input Parameters
parser.add_argument('--config_file', type=str, default="config/gptir3_notokenloss_plus.yaml")
parser.add_argument("--local_rank",
type=int,
default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument("--master_port",
type=int,
default=20815)
parser = deepspeed.add_config_arguments(parser)
# parser.add_argument('--deepspeed_config', type=str, default="config/deepspeed_config/gptir.json")
config = parser.parse_args()
...
model_engine, optimizer, _, _ = deepspeed.initialize(args=config,
model=net,
model_parameters=net.configure_parameters(),
distributed_port=config.master_port)
Hi @lovedoubledan - can you share a repro code snippet with us? Also do you see any warnings printed about the port? And could you try setting the master port in the ds_config as well to see if that works?
When I use default command, it seems to use 29500 as master_port.
However, the master_port seems unchangable,even when I use "--master_port 29501" or change it using "deepspeed.init_distributed(dist_backend='nccl', distributed_port=config.master_port)"
error message:
[W1120 21:36:50.764587163 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 127.0.0.1:29500 - retrying (try=3, timeout=1800000ms, delay=1496ms): Connection reset by peer
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc06bba0446 in /data/wujiahao/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/lib/libc10.so)
...
The text was updated successfully, but these errors were encountered: