You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to utilize all the computational resources to speed up. My code works fine on a single node, multi-GPUs mode (which means I did most part for DDP training right).
But when it comes to multi-nodes, I found my code always stops at DDP initializing. (stop at 2/4, I guess one node initializes correctly, but another one sucks. )
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
Traceback (most recent call last):
File "/home/tw2112/codes/s2s/aux_with_neg_wiki/cool_test/test.py", line 74, in <module>
trainer.fit(model)
File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 863, in _run
self.accelerator.setup_environment()
File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 30, in setup_environment
super().setup_environment()
File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in setup_environment
self.training_type_plugin.setup_environment()
File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 168, in setup_environment
self.setup_distributed()
File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 253, in setup_distributed
self.init_ddp_connection()
File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 323, in init_ddp_connection
torch.distributed.init_process_group(
File "/ext3/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/ext3/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
To Reproduce
I follow this #8707 and borrow the simple model from it, that problem still occurs:
test.py:
importloggingimportosimporttorchfromtorch.nnimportfunctionalasFfromtorch.utils.dataimportDataLoaderfromtorchvision.datasetsimportMNISTimporttorchvision.transformsastransformsimportpytorch_lightningasptldefget_logger(name=__name__, level=logging.INFO):
"""Initializes python logger."""logger=logging.getLogger(name)
logger.setLevel(level)
# this ensures all logging levels get marked with the rank zero decorator# otherwise logs would get multiplied for each GPU process in multi-GPU setupforlevelin ("debug", "info", "warning", "error", "exception", "fatal", "critical"):
setattr(logger, level, ptl.utilities.rank_zero_only(getattr(logger, level)))
returnloggerlog=get_logger(__name__)
classCoolModel(ptl.LightningModule):
def__init__(self):
super(CoolModel, self).__init__()
# not the best model...self.l1=torch.nn.Linear(28*28, 10)
defforward(self, x):
returntorch.relu(self.l1(x.view(x.size(0), -1)))
defmy_loss(self, y_hat, y):
returnF.cross_entropy(y_hat, y)
deftraining_step(self, batch, batch_nb):
x, y=batchy_hat=self.forward(x)
return {'loss': self.my_loss(y_hat, y)}
defvalidation_step(self, batch, batch_nb):
x, y=batchy_hat=self.forward(x)
return {'val_loss': self.my_loss(y_hat, y)}
defvalidation_end(self, outputs):
avg_loss=torch.stack([x['val_loss'] forxinoutputs]).mean()
return {'avg_val_loss': avg_loss}
defconfigure_optimizers(self):
return [torch.optim.Adam(self.parameters(), lr=0.02)]
deftrain_dataloader(self):
returnDataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
defval_dataloader(self):
returnDataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
deftest_dataloader(self):
returnDataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
frompytorch_lightningimportTrainermodel=CoolModel()
trainer=Trainer(max_epochs=1, gpus=4num_nodes=2, accelerator='ddp')
trainer.fit(model)
Due to my HPC cluster, I have to use singularity to load conda environment instead of using module in other posts.
🐛 Bug
I'm trying to utilize all the computational resources to speed up. My code works fine on a single node, multi-GPUs mode (which means I did most part for DDP training right).
But when it comes to multi-nodes, I found my code always stops at DDP initializing. (stop at 2/4, I guess one node initializes correctly, but another one sucks. )
To Reproduce
I follow this #8707 and borrow the simple model from it, that problem still occurs:
test.py:
Due to my HPC cluster, I have to use singularity to load conda environment instead of using
modulein other posts.run.slurm
Expected behavior
all process start
Environment
- GPU:
- NVIDIA Quadro RTX 8000
- NVIDIA Quadro RTX 8000
- NVIDIA Quadro RTX 8000
- NVIDIA Quadro RTX 8000
- available: True
- version: 11.1
- numpy: 1.21.2
- pyTorch_debug: False
- pyTorch_version: 1.8.1
- pytorch-lightning: 1.4.9
- tqdm: 4.62.2
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: Proposal for help #1 SMP Fri Oct 16 13:38:49 EDT 2020
this is the environment output of one node.
Additional context
I'm new to slurm... maybe I made some stupid mistake...
cc @awaelchli @rohitgr7