Multi-ndoes DDP training hanging when initializing.

## 🐛 Bug

I'm trying to utilize all the computational resources to speed up. My code works fine on a single node, multi-GPUs mode (which means I did most part for DDP training right). 

But when it comes to multi-nodes, I found my code always stops at DDP initializing. (stop at 2/4, I guess one node initializes correctly, but another one sucks. )
```
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
Traceback (most recent call last):
  File "/home/tw2112/codes/s2s/aux_with_neg_wiki/cool_test/test.py", line 74, in <module>
    trainer.fit(model)
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 863, in _run
    self.accelerator.setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 30, in setup_environment
    super().setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in setup_environment
    self.training_type_plugin.setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 168, in setup_environment
    self.setup_distributed()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 253, in setup_distributed
    self.init_ddp_connection()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 323, in init_ddp_connection
    torch.distributed.init_process_group(
  File "/ext3/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/ext3/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)



```

### To Reproduce
I follow this #8707 and borrow the simple model from it, that problem still occurs:

test.py:
```python
import logging
import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms

import pytorch_lightning as ptl

def get_logger(name=__name__, level=logging.INFO):
    """Initializes python logger."""

    logger = logging.getLogger(name)
    logger.setLevel(level)

    # this ensures all logging levels get marked with the rank zero decorator
    # otherwise logs would get multiplied for each GPU process in multi-GPU setup
    for level in ("debug", "info", "warning", "error", "exception", "fatal", "critical"):
        setattr(logger, level, ptl.utilities.rank_zero_only(getattr(logger, level)))

    return logger


log = get_logger(__name__)

class CoolModel(ptl.LightningModule):

    def __init__(self):
        super(CoolModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def my_loss(self, y_hat, y):
        return F.cross_entropy(y_hat, y)

    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'loss': self.my_loss(y_hat, y)}

    def validation_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': self.my_loss(y_hat, y)}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        return {'avg_val_loss': avg_loss}

    def configure_optimizers(self):
        return [torch.optim.Adam(self.parameters(), lr=0.02)]

    def train_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def val_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def test_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)


from pytorch_lightning import Trainer

model = CoolModel()


trainer = Trainer(max_epochs=1, gpus=4 num_nodes=2, accelerator='ddp')

trainer.fit(model)
```
Due to my HPC cluster, I have to use **singularity** to load conda environment instead of using `module` in other posts.

run.slurm
```shell
#!/bin/bash
#!/bin/bash
#SBATCH --output=./%j_%x.out
#SBATCH --error=./%j_%x.err
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=4
#SBATCH --export=ALL
#SBATCH --time=2-00:00:00
#SBATCH --gres=gpu:4
#SBATCH --mem=10G
#SBATCH --account=cds
#SBATCH -c 4



srun singularity exec --nv  --overlay $SCRATCH/overlay2/overlay-50G-10M.ext3:ro   /scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif /bin/bash -c "
cd
source /ext3/env.sh
conda activate
cd /home/tw2112/codes/s2s/aux_with_neg_wiki/cool_test
python test.py
"
```





### Expected behavior

all process start
### Environment

* CUDA:
        - GPU:
                - NVIDIA Quadro RTX 8000
                - NVIDIA Quadro RTX 8000
                - NVIDIA Quadro RTX 8000
                - NVIDIA Quadro RTX 8000
        - available:         True
        - version:           11.1
* Packages:
        - numpy:             1.21.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.8.1
        - pytorch-lightning: 1.4.9
        - tqdm:              4.62.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.5
        - version:           #1 SMP Fri Oct 16 13:38:49 EDT 2020

this is the environment output of one node.


### Additional context

I'm new to slurm... maybe I made some stupid mistake...


cc @awaelchli @rohitgr7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-ndoes DDP training hanging when initializing. #10098

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-ndoes DDP training hanging when initializing. #10098

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions