Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl error when loading large data #12905

Open
1 task done
Leo-aetech opened this issue May 20, 2024 · 2 comments
Open
1 task done

nccl error when loading large data #12905

Leo-aetech opened this issue May 20, 2024 · 2 comments
Labels
question Further information is requested

Comments

@Leo-aetech
Copy link

Leo-aetech commented May 20, 2024

Search before asking

Question

I ran the code to learn object365.
I keep getting timelimte errors in the data loader.
Due to the nccl error, time I tried to change all related environment variables as follows. However, the timelimit still does not change and the problem occurs. I would appreciate it if you could let me know what I missed.

I've been looking for issues for 3 weeks and I've been looking on google, but there's no solution. Help me..

Code
`os.environ["NCCL_TIMEOUT"] = "28800"
os.environ['NCCL_DEBUG'] = 'INFO'
os.environ["NCCL_DEBUG_SUBSYS"] = "ALL"

os.environ['NCCL_BLOCKING_WAIT'] = '1'
os.environ['NCCL_IB_DISABLE'] = '1'`

Additional

log
(data_manager) aetech@aetech:~/PycharmProjects/torch/JSW_test/ultralytics$ python train.py --model yolov8m --ex "Obje
ct365 experiment" --run "Yolov8-m-gpu*4"
현재 : 2024-05-20 08:48:09.332108
NCCL_TIMEOUT: 28800
NCCL_DEBUG: INFO
New https://pypi.org/project/ultralytics/8.2.18 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.2.2 🚀 Python-3.9.18 torch-1.12.1+cu116 CUDA:0 (NVIDIA GeForce RTX 3090, 24268MiB)
CUDA:1 (NVIDIA GeForce RTX 3090, 24268MiB)
CUDA:2 (NVIDIA GeForce RTX 3090, 24268MiB)
CUDA:3 (NVIDIA GeForce RTX 3090, 24266MiB)
WARNING ⚠️ Upgrade to torch>=2.0.0 for deterministic training.
engine/trainer: task=detect, mode=train, model=yolov8m.pt, data=Objects365.yaml, epochs=150, time=None, patience=100,
batch=32, imgsz=640, save=True, save_period=-1, cache=False, device=[0, 1, 2, 3], workers=0, project=None, name=trai
n100, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, re
ct=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scal
e=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=
None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visual
ize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=F
alse, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width
=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, worksp
ace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, wa
rmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7
, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosa
ic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsor
t.yaml, save_dir=runs/detect/train100
Overriding model.yaml nc=80 with nc=365

               from  n    params  module                                       arguments

0 -1 1 1392 ultralytics.nn.modules.conv.Conv [3, 48, 3, 2]
1 -1 1 41664 ultralytics.nn.modules.conv.Conv [48, 96, 3, 2]
2 -1 2 111360 ultralytics.nn.modules.block.C2f [96, 96, 2, True]
3 -1 1 166272 ultralytics.nn.modules.conv.Conv [96, 192, 3, 2]
4 -1 4 813312 ultralytics.nn.modules.block.C2f [192, 192, 4, True]
5 -1 1 664320 ultralytics.nn.modules.conv.Conv [192, 384, 3, 2]
6 -1 4 3248640 ultralytics.nn.modules.block.C2f [384, 384, 4, True]
7 -1 1 1991808 ultralytics.nn.modules.conv.Conv [384, 576, 3, 2]
8 -1 2 3985920 ultralytics.nn.modules.block.C2f [576, 576, 2, True]
9 -1 1 831168 ultralytics.nn.modules.block.SPPF [576, 576, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 2 1993728 ultralytics.nn.modules.block.C2f [960, 384, 2]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 2 517632 ultralytics.nn.modules.block.C2f [576, 192, 2]
16 -1 1 332160 ultralytics.nn.modules.conv.Conv [192, 192, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 2 1846272 ultralytics.nn.modules.block.C2f [576, 384, 2]
19 -1 1 1327872 ultralytics.nn.modules.conv.Conv [384, 384, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 2 4207104 ultralytics.nn.modules.block.C2f [960, 576, 2]
22 [15, 18, 21] 1 3987031 ultralytics.nn.modules.head.Detect [365, [192, 384, 576]]
Model summary: 295 layers, 26067655 parameters, 26067639 gradients, 80.2 GFLOPs

Transferred 469/475 items from pretrained weights
DDP: debug command /home/aetech/anaconda3/envs/data_manager/bin/python -m torch.distributed.run --nproc_per_node 4 --
master_port 58543 /home/aetech/.config/Ultralytics/DDP/_temp_wp_ynbyp140653279240144.py
WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overload
ed, please further tune the variable for optimal performance in your application as needed.


Ultralytics YOLOv8.2.11 🚀 Python-3.9.18 torch-1.12.1+cu116 CUDA:0 (NVIDIA GeForce RTX 3090, 24268MiB)
CUDA:1 (NVIDIA GeForce RTX 3090, 24268MiB)
CUDA:2 (NVIDIA GeForce RTX 3090, 24268MiB)
CUDA:3 (NVIDIA GeForce RTX 3090, 24266MiB)
WARNING ⚠️ Upgrade to torch>=2.0.0 for deterministic training.
Overriding model.yaml nc=80 with nc=365
Transferred 469/475 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
aetech:1311:1311 [0] NCCL INFO Bootstrap : Using enp179s0f0:192.168.0.58<0>
aetech:1311:1311 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
aetech:1311:1311 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
aetech:1311:1311 [0] NCCL INFO NET/Socket : Using [0]enp179s0f0:192.168.0.58<0>
aetech:1311:1311 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.6
aetech:1311:1367 [0] NCCL INFO bootstrap.cc:107 Mem Alloc Size 28 pointer 0x7f8b18000b20
aetech:1311:1368 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f8b10002f70
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b1000d190
aetech:1311:1368 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f8b9cc00200
aetech:1311:1368 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f8b1000d810
aetech:1311:1368 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f8b10059820
aetech:1311:1368 [0] NCCL INFO init.cc:305 Mem Alloc Size 16 pointer 0x7f8b10059870
aetech:1311:1368 [0] NCCL INFO init.cc:306 Mem Alloc Size 16 pointer 0x7f8b10059890
aetech:1311:1368 [0] NCCL INFO init.cc:309 Mem Alloc Size 32 pointer 0x7f8b100598b0
aetech:1311:1368 [0] NCCL INFO init.cc:310 Mem Alloc Size 32 pointer 0x7f8b100598e0
aetech:1311:1368 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f8b10059910
aetech:1311:1367 [0] NCCL INFO bootstrap.cc:121 Mem Alloc Size 112 pointer 0x7f8b18008430
aetech:1311:1367 [0] NCCL INFO bootstrap.cc:122 Mem Alloc Size 112 pointer 0x7f8b180084b0
aetech:1312:1312 [1] NCCL INFO Bootstrap : Using enp179s0f0:192.168.0.58<0>
aetech:1314:1314 [2] NCCL INFO Bootstrap : Using enp179s0f0:192.168.0.58<0>
aetech:1315:1315 [3] NCCL INFO Bootstrap : Using enp179s0f0:192.168.0.58<0>
aetech:1312:1312 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
aetech:1314:1314 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
aetech:1315:1315 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
aetech:1315:1315 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
aetech:1312:1312 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
aetech:1314:1314 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
aetech:1314:1314 [2] NCCL INFO NET/Socket : Using [0]enp179s0f0:192.168.0.58<0>
aetech:1312:1312 [1] NCCL INFO NET/Socket : Using [0]enp179s0f0:192.168.0.58<0>
aetech:1315:1315 [3] NCCL INFO NET/Socket : Using [0]enp179s0f0:192.168.0.58<0>
aetech:1314:1314 [2] NCCL INFO Using network Socket
aetech:1312:1312 [1] NCCL INFO Using network Socket
aetech:1315:1315 [3] NCCL INFO Using network Socket
aetech:1314:1369 [2] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f0848002f70
aetech:1315:1371 [3] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f0818002f70
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f084800d190
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f081800d190
aetech:1312:1370 [1] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f47e8002f70
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e800d190
aetech:1314:1369 [2] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f084fc00000
aetech:1315:1371 [3] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f081fc00000
aetech:1314:1369 [2] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f084800df80
aetech:1314:1369 [2] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f0848059f90
aetech:1314:1369 [2] NCCL INFO init.cc:305 Mem Alloc Size 16 pointer 0x7f0848059fe0
aetech:1314:1369 [2] NCCL INFO init.cc:306 Mem Alloc Size 16 pointer 0x7f084805a000
aetech:1314:1369 [2] NCCL INFO init.cc:309 Mem Alloc Size 32 pointer 0x7f084805a020
aetech:1314:1369 [2] NCCL INFO init.cc:310 Mem Alloc Size 32 pointer 0x7f084805a050
aetech:1312:1370 [1] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f47efc00000
aetech:1314:1369 [2] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f084805a080
aetech:1315:1371 [3] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f081800df80
aetech:1315:1371 [3] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f0818059f90
aetech:1315:1371 [3] NCCL INFO init.cc:305 Mem Alloc Size 16 pointer 0x7f0818059fe0
aetech:1315:1371 [3] NCCL INFO init.cc:306 Mem Alloc Size 16 pointer 0x7f081805a000
aetech:1315:1371 [3] NCCL INFO init.cc:309 Mem Alloc Size 32 pointer 0x7f081805a020
aetech:1315:1371 [3] NCCL INFO init.cc:310 Mem Alloc Size 32 pointer 0x7f081805a050
aetech:1312:1370 [1] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f47e800df80
aetech:1315:1371 [3] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f081805a080
aetech:1312:1370 [1] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f47e8059f90
aetech:1312:1370 [1] NCCL INFO init.cc:305 Mem Alloc Size 16 pointer 0x7f47e8059fe0
aetech:1312:1370 [1] NCCL INFO init.cc:306 Mem Alloc Size 16 pointer 0x7f47e805a000
aetech:1312:1370 [1] NCCL INFO init.cc:309 Mem Alloc Size 32 pointer 0x7f47e805a020
aetech:1312:1370 [1] NCCL INFO init.cc:310 Mem Alloc Size 32 pointer 0x7f47e805a050
aetech:1312:1370 [1] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f47e805a080
aetech:1312:1370 [1] NCCL INFO bootstrap.cc:376 Mem Alloc Size 112 pointer 0x7f47e805a110
aetech:1314:1369 [2] NCCL INFO bootstrap.cc:376 Mem Alloc Size 112 pointer 0x7f084805a110
aetech:1315:1371 [3] NCCL INFO bootstrap.cc:376 Mem Alloc Size 112 pointer 0x7f081805a110
aetech:1311:1368 [0] NCCL INFO bootstrap.cc:376 Mem Alloc Size 112 pointer 0x7f8b100599a0
aetech:1315:1371 [3] NCCL INFO bootstrap.cc:381 Mem Alloc Size 112 pointer 0x7f081805a190
aetech:1311:1368 [0] NCCL INFO bootstrap.cc:381 Mem Alloc Size 112 pointer 0x7f8b10059a20
aetech:1315:1371 [3] NCCL INFO bootstrap.cc:383 Mem Alloc Size 12 pointer 0x7f081805a210
aetech:1314:1369 [2] NCCL INFO bootstrap.cc:381 Mem Alloc Size 112 pointer 0x7f084805a190
aetech:1312:1370 [1] NCCL INFO bootstrap.cc:381 Mem Alloc Size 112 pointer 0x7f47e805a190
aetech:1311:1368 [0] NCCL INFO bootstrap.cc:383 Mem Alloc Size 12 pointer 0x7f8b10059aa0
aetech:1314:1369 [2] NCCL INFO bootstrap.cc:383 Mem Alloc Size 12 pointer 0x7f084805a210
aetech:1312:1370 [1] NCCL INFO bootstrap.cc:383 Mem Alloc Size 12 pointer 0x7f47e805a210
aetech:1315:1371 [3] NCCL INFO init.cc:510 Mem Alloc Size 256 pointer 0x7f081805a440
aetech:1314:1369 [2] NCCL INFO init.cc:510 Mem Alloc Size 256 pointer 0x7f084805a440
aetech:1311:1368 [0] NCCL INFO init.cc:510 Mem Alloc Size 256 pointer 0x7f8b10059d90
aetech:1312:1370 [1] NCCL INFO init.cc:510 Mem Alloc Size 256 pointer 0x7f47e805a440
aetech:1312:1370 [1] NCCL INFO init.cc:517 Mem Alloc Size 240 pointer 0x7f47e805ac00
aetech:1314:1369 [2] NCCL INFO init.cc:517 Mem Alloc Size 240 pointer 0x7f084805ac00
aetech:1315:1371 [3] NCCL INFO init.cc:517 Mem Alloc Size 240 pointer 0x7f081805ac00
aetech:1311:1368 [0] NCCL INFO init.cc:517 Mem Alloc Size 240 pointer 0x7f8b1005a4e0
aetech:1312:1370 [1] NCCL INFO graph/topo.cc:582 Mem Alloc Size 9461768 pointer 0x7f47e805ad00
aetech:1314:1369 [2] NCCL INFO graph/topo.cc:582 Mem Alloc Size 9461768 pointer 0x7f084805ad00
aetech:1315:1371 [3] NCCL INFO graph/topo.cc:582 Mem Alloc Size 9461768 pointer 0x7f081805ad00
aetech:1311:1368 [0] NCCL INFO graph/topo.cc:582 Mem Alloc Size 9461768 pointer 0x7f8b1005a5e0
aetech:1314:1369 [2] NCCL INFO graph/topo.cc:539 Mem Alloc Size 1333312 pointer 0x7f0848961c30
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0848aa7480
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0848aa74a0
aetech:1314:1369 [2] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0848aa74c0
aetech:1314:1369 [2] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1314:1369 [2] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0848aa74e0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0848aa7500
aetech:1314:1369 [2] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0848aa7520
aetech:1314:1369 [2] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1314:1369 [2] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0848aa7540
aetech:1314:1369 [2] NCCL INFO Attribute coll of node net not found
aetech:1314:1369 [2] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f084805ad00
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f084805e520
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848061d40
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848065560
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848068d80
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f084806c5a0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f084806fdc0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08480735e0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848076e00
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:36 Mem Alloc Size 57408 pointer 0x7f084807a620
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f0848088670
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480966c0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480a4710
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480b2760
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480c07b0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480ce800
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480dc850
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480ea8a0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f08480f88f0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08480fc110
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08480ff930
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848103150
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848106970
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f084810a190
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f084810d9b0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08481111d0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08481149f0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:424 Mem Alloc Size 16 pointer 0x7f0848118210
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:425 Mem Alloc Size 32 pointer 0x7f0848118230
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f084805ad00
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f084805e520
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848061d40
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848065560
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0848068d80
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f084806c5a0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f084806fdc0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08480735e0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:36 Mem Alloc Size 57408 pointer 0x7f0848076e00
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f0848084e50
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f0848092ea0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480a0ef0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480aef40
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480bcf90
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480cafe0
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08480d9030
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e7080
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1314:1369 [2] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===
aetech:1314:1369 [2] NCCL INFO CPU/0 (1/1/2)
aetech:1314:1369 [2] NCCL INFO + PCI[12.0] - PCI/17000 (10b5874710b58747)
aetech:1314:1369 [2] NCCL INFO + PCI[12.0] - GPU/19000 (0)
aetech:1314:1369 [2] NCCL INFO + PCI[12.0] - GPU/1A000 (1)
aetech:1314:1369 [2] NCCL INFO + PCI[12.0] - PCI/65000 (10b5874710b58747)
aetech:1314:1369 [2] NCCL INFO + PCI[12.0] - GPU/67000 (2)
aetech:1314:1369 [2] NCCL INFO + PCI[12.0] - GPU/68000 (3)
aetech:1314:1369 [2] NCCL INFO + PCI[3.0] - NIC/B3000
aetech:1314:1369 [2] NCCL INFO ==========================================
aetech:1314:1369 [2] NCCL INFO GPU/19000 :GPU/19000 (0/5000.000000/LOC) GPU/1A000 (2/12.000000/PIX) GPU/67000 (4/12.0
00000/PHB) GPU/68000 (4/12.000000/PHB) CPU/0 (2/12.000000/PHB)
aetech:1314:1369 [2] NCCL INFO GPU/1A000 :GPU/19000 (2/12.000000/PIX) GPU/1A000 (0/5000.000000/LOC) GPU/67000 (4/12.0
00000/PHB) GPU/68000 (4/12.000000/PHB) CPU/0 (2/12.000000/PHB)
aetech:1314:1369 [2] NCCL INFO GPU/67000 :GPU/19000 (4/12.000000/PHB) GPU/1A000 (4/12.000000/PHB) GPU/67000 (0/5000.0
00000/LOC) GPU/68000 (2/12.000000/PIX) CPU/0 (2/12.000000/PHB)
aetech:1314:1369 [2] NCCL INFO GPU/68000 :GPU/19000 (4/12.000000/PHB) GPU/1A000 (4/12.000000/PHB) GPU/67000 (2/12.000
000/PIX) GPU/68000 (0/5000.000000/LOC) CPU/0 (2/12.000000/PHB)
aetech:1314:1369 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1314:1369 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1314:1369 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1314:1369 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1314:1369 [2] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1314:1369 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1314:1369 [2] NCCL INFO init.cc:645 Mem Alloc Size 3936 pointer 0x7f08480e7080
aetech:1315:1371 [3] NCCL INFO graph/topo.cc:539 Mem Alloc Size 1333312 pointer 0x7f0818961c30
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0818aa7480
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0818aa74a0
aetech:1315:1371 [3] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0818aa74c0
aetech:1315:1371 [3] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0818aa74e0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0818aa7500
aetech:1315:1371 [3] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0818aa7520
aetech:1315:1371 [3] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f0818aa7540
aetech:1315:1371 [3] NCCL INFO Attribute coll of node net not found
aetech:1315:1371 [3] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f081805ad00
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f081805e520
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818061d40
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818065560
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818068d80
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f081806c5a0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f081806fdc0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08180735e0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818076e00
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:36 Mem Alloc Size 57408 pointer 0x7f081807a620
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f0818088670
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180966c0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180a4710
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180b2760
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180c07b0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180ce800
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180dc850
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180ea8a0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f08180f88f0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08180fc110
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08180ff930
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818103150
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818106970
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f081810a190
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f081810d9b0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08181111d0
aetech:1311:1368 [0] NCCL INFO graph/topo.cc:539 Mem Alloc Size 1333312 pointer 0x7f8b10961510
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08181149f0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:424 Mem Alloc Size 16 pointer 0x7f0818118210
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b10aa6d60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:425 Mem Alloc Size 32 pointer 0x7f0818118230
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b10aa6d80
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f081805ad00
aetech:1311:1368 [0] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f081805e520
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b10aa6da0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818061d40
aetech:1311:1368 [0] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818065560
aetech:1311:1368 [0] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f0818068d80
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b10aa6dc0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f081806c5a0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b10aa6de0
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f081806fdc0
aetech:1311:1368 [0] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f08180735e0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b10aa6e00
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:36 Mem Alloc Size 57408 pointer 0x7f0818076e00
aetech:1311:1368 [0] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f0818084e50
aetech:1311:1368 [0] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1312:1370 [1] NCCL INFO graph/topo.cc:539 Mem Alloc Size 1333312 pointer 0x7f47e8961c30
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f0818092ea0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b10aa6e20
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180a0ef0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e8aa7480
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180aef40
aetech:1311:1368 [0] NCCL INFO Attribute coll of node net not found
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180bcf90
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e8aa74a0
aetech:1311:1368 [0] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180cafe0
aetech:1312:1370 [1] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f08180d9030
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e8aa74c0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1312:1370 [1] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f8b1005a5e0
aetech:1312:1370 [1] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b1005de00
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e8aa74e0
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10061620
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e8aa7500
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10064e40
aetech:1312:1370 [1] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10068660
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e8aa7520
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b1006be80
aetech:1312:1370 [1] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1312:1370 [1] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b1006f6a0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e8aa7540
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10072ec0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b100766e0
aetech:1312:1370 [1] NCCL INFO Attribute coll of node net not found
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1312:1370 [1] NCCL INFO KV Convert to int : could not find value of '8.0 GT/s PCIe' in dictionary, falling bac
k to 60
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f47e805ad00
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:36 Mem Alloc Size 57408 pointer 0x7f8b10079f00
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e805e520
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8061d40
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8065560
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8068d80
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b10087f50
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e806c5a0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e806fdc0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e80735e0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e7080
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8076e00
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b10095fa0
aetech:1315:1371 [3] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===
aetech:1315:1371 [3] NCCL INFO CPU/0 (1/1/2)
aetech:1315:1371 [3] NCCL INFO + PCI[12.0] - PCI/17000 (10b5874710b58747)
aetech:1315:1371 [3] NCCL INFO + PCI[12.0] - GPU/19000 (0)
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:36 Mem Alloc Size 57408 pointer 0x7f47e807a620
aetech:1315:1371 [3] NCCL INFO + PCI[12.0] - GPU/1A000 (1)
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100a3ff0
aetech:1315:1371 [3] NCCL INFO + PCI[12.0] - PCI/65000 (10b5874710b58747)
aetech:1315:1371 [3] NCCL INFO + PCI[12.0] - GPU/67000 (2)
aetech:1315:1371 [3] NCCL INFO + PCI[12.0] - GPU/68000 (3)
aetech:1315:1371 [3] NCCL INFO + PCI[3.0] - NIC/B3000
aetech:1315:1371 [3] NCCL INFO ==========================================
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e8088670
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100b2040
aetech:1315:1371 [3] NCCL INFO GPU/19000 :GPU/19000 (0/5000.000000/LOC) GPU/1A000 (2/12.000000/PIX) GPU/67000 (4/12.0
00000/PHB) GPU/68000 (4/12.000000/PHB) CPU/0 (2/12.000000/PHB)
aetech:1315:1371 [3] NCCL INFO GPU/1A000 :GPU/19000 (2/12.000000/PIX) GPU/1A000 (0/5000.000000/LOC) GPU/67000 (4/12.0
00000/PHB) GPU/68000 (4/12.000000/PHB) CPU/0 (2/12.000000/PHB)
aetech:1315:1371 [3] NCCL INFO GPU/67000 :GPU/19000 (4/12.000000/PHB) GPU/1A000 (4/12.000000/PHB) GPU/67000 (0/5000.0
00000/LOC) GPU/68000 (2/12.000000/PIX) CPU/0 (2/12.000000/PHB)
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80966c0
aetech:1315:1371 [3] NCCL INFO GPU/68000 :GPU/19000 (4/12.000000/PHB) GPU/1A000 (4/12.000000/PHB) GPU/67000 (2/12.000
000/PIX) GPU/68000 (0/5000.000000/LOC) CPU/0 (2/12.000000/PHB)
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100c0090
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80a4710
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100ce0e0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80b2760
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100dc130
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80c07b0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100ea180
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80ce800
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80dc850
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80ea8a0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f8b100f81d0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b100fb9f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b100ff210
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10102a30
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10106250
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10109a70
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b1010d290
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f47e80f88f0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10110ab0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e80fc110
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b101142d0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e80ff930
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:424 Mem Alloc Size 16 pointer 0x7f8b10117af0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:425 Mem Alloc Size 32 pointer 0x7f8b10117b10
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8103150
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f8b1005a5e0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8106970
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b1005de00
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10061620
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e810a190
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10064e40
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10068660
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e810d9b0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b1006be80
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b1006f6a0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e81111d0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f8b10072ec0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:36 Mem Alloc Size 57408 pointer 0x7f8b100766e0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e81149f0
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b10084730
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:424 Mem Alloc Size 16 pointer 0x7f47e8118210
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b10092780
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:425 Mem Alloc Size 32 pointer 0x7f47e8118230
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100a07d0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:36 Mem Alloc Size 14352 pointer 0x7f47e805ad00
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100ae820
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e805e520
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100bc870
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8061d40
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100ca8c0
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8065560
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f8b100d8910
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e8068d80
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e806c5a0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e806fdc0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 14352 pointer 0x7f47e80735e0
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:36 Mem Alloc Size 57408 pointer 0x7f47e8076e00
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e8084e50
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e8092ea0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80a0ef0
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80aef40
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80bcf90
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80cafe0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:60 Mem Alloc Size 57408 pointer 0x7f47e80d9030
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO CPU/0 (1/1/2)
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1311:1368 [0] NCCL INFO + PCI[12.0] - PCI/17000 (10b5874710b58747)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO + PCI[12.0] - GPU/19000 (0)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO + PCI[12.0] - GPU/1A000 (1)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO + PCI[12.0] - PCI/65000 (10b5874710b58747)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO + PCI[12.0] - GPU/67000 (2)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO + PCI[12.0] - GPU/68000 (3)
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO + PCI[3.0] - NIC/B3000
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80e7080
aetech:1311:1368 [0] NCCL INFO ==========================================
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1311:1368 [0] NCCL INFO GPU/19000 :GPU/19000 (0/5000.000000/LOC) GPU/1A000 (2/12.000000/PIX) GPU/67000 (4/12.0
00000/PHB) GPU/68000 (4/12.000000/PHB) CPU/0 (2/12.000000/PHB)
aetech:1312:1370 [1] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===
aetech:1312:1370 [1] NCCL INFO CPU/0 (1/1/2)
aetech:1311:1368 [0] NCCL INFO GPU/1A000 :GPU/19000 (2/12.000000/PIX) GPU/1A000 (0/5000.000000/LOC) GPU/67000 (4/12.0
00000/PHB) GPU/68000 (4/12.000000/PHB) CPU/0 (2/12.000000/PHB)
aetech:1312:1370 [1] NCCL INFO + PCI[12.0] - PCI/17000 (10b5874710b58747)
aetech:1312:1370 [1] NCCL INFO + PCI[12.0] - GPU/19000 (0)
aetech:1311:1368 [0] NCCL INFO GPU/67000 :GPU/19000 (4/12.000000/PHB) GPU/1A000 (4/12.000000/PHB) GPU/67000 (0/5000.0
00000/LOC) GPU/68000 (2/12.000000/PIX) CPU/0 (2/12.000000/PHB)
aetech:1312:1370 [1] NCCL INFO + PCI[12.0] - GPU/1A000 (1)
aetech:1312:1370 [1] NCCL INFO + PCI[12.0] - PCI/65000 (10b5874710b58747)
aetech:1311:1368 [0] NCCL INFO GPU/68000 :GPU/19000 (4/12.000000/PHB) GPU/1A000 (4/12.000000/PHB) GPU/67000 (2/12.000
000/PIX) GPU/68000 (0/5000.000000/LOC) CPU/0 (2/12.000000/PHB)
aetech:1312:1370 [1] NCCL INFO + PCI[12.0] - GPU/67000 (2)
aetech:1312:1370 [1] NCCL INFO + PCI[12.0] - GPU/68000 (3)
aetech:1312:1370 [1] NCCL INFO + PCI[3.0] - NIC/B3000
aetech:1312:1370 [1] NCCL INFO ==========================================
aetech:1312:1370 [1] NCCL INFO GPU/19000 :GPU/19000 (0/5000.000000/LOC) GPU/1A000 (2/12.000000/PIX) GPU/67000 (4/12.0
00000/PHB) GPU/68000 (4/12.000000/PHB) CPU/0 (2/12.000000/PHB)
aetech:1312:1370 [1] NCCL INFO GPU/1A000 :GPU/19000 (2/12.000000/PIX) GPU/1A000 (0/5000.000000/LOC) GPU/67000 (4/12.0
00000/PHB) GPU/68000 (4/12.000000/PHB) CPU/0 (2/12.000000/PHB)
aetech:1312:1370 [1] NCCL INFO GPU/67000 :GPU/19000 (4/12.000000/PHB) GPU/1A000 (4/12.000000/PHB) GPU/67000 (0/5000.0
00000/LOC) GPU/68000 (2/12.000000/PIX) CPU/0 (2/12.000000/PHB)
aetech:1312:1370 [1] NCCL INFO GPU/68000 :GPU/19000 (4/12.000000/PHB) GPU/1A000 (4/12.000000/PHB) GPU/67000 (2/12.000
000/PIX) GPU/68000 (0/5000.000000/LOC) CPU/0 (2/12.000000/PHB)
aetech:1315:1371 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1315:1371 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1315:1371 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1315:1371 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1315:1371 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1315:1371 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1315:1371 [3] NCCL INFO init.cc:645 Mem Alloc Size 3936 pointer 0x7f08180e7080
aetech:1311:1368 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1311:1368 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1312:1370 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1312:1370 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1311:1368 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1311:1368 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1312:1370 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1312:1370 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1311:1368 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1311:1368 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1311:1368 [0] NCCL INFO init.cc:645 Mem Alloc Size 3936 pointer 0x7f8b100e6960
aetech:1312:1370 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChann
els 1
aetech:1312:1370 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
aetech:1312:1370 [1] NCCL INFO init.cc:645 Mem Alloc Size 3936 pointer 0x7f47e80e7080
aetech:1312:1370 [1] NCCL INFO init.cc:676 Mem Alloc Size 16 pointer 0x7f47e80e7ff0
aetech:1312:1370 [1] NCCL INFO init.cc:677 Mem Alloc Size 16 pointer 0x7f47e80e8010
aetech:1312:1370 [1] NCCL INFO init.cc:695 Mem Alloc Size 32 pointer 0x7f47e80e8030
aetech:1312:1370 [1] NCCL INFO init.cc:736 Mem Alloc Size 512 pointer 0x7f47e80e8060
aetech:1311:1368 [0] NCCL INFO init.cc:676 Mem Alloc Size 16 pointer 0x7f8b100e78d0
aetech:1315:1371 [3] NCCL INFO init.cc:676 Mem Alloc Size 16 pointer 0x7f08180e7ff0
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:265 Mem Alloc Size 512 pointer 0x7f47e80e8270
aetech:1311:1368 [0] NCCL INFO init.cc:677 Mem Alloc Size 16 pointer 0x7f8b100e78f0
aetech:1314:1369 [2] NCCL INFO init.cc:676 Mem Alloc Size 16 pointer 0x7f08480e7ff0
aetech:1315:1371 [3] NCCL INFO init.cc:677 Mem Alloc Size 16 pointer 0x7f08180e8010
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:266 Mem Alloc Size 512 pointer 0x7f47e80e8480
aetech:1311:1368 [0] NCCL INFO init.cc:695 Mem Alloc Size 32 pointer 0x7f8b100e7910
aetech:1315:1371 [3] NCCL INFO init.cc:695 Mem Alloc Size 32 pointer 0x7f08180e8030
aetech:1314:1369 [2] NCCL INFO init.cc:677 Mem Alloc Size 16 pointer 0x7f08480e8010
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:267 Mem Alloc Size 512 pointer 0x7f47e80e8690
aetech:1315:1371 [3] NCCL INFO init.cc:736 Mem Alloc Size 512 pointer 0x7f08180e8060
aetech:1311:1368 [0] NCCL INFO init.cc:736 Mem Alloc Size 512 pointer 0x7f8b100e7940
aetech:1314:1369 [2] NCCL INFO init.cc:695 Mem Alloc Size 32 pointer 0x7f08480e8030
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:268 Mem Alloc Size 512 pointer 0x7f47e80e88a0
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:265 Mem Alloc Size 512 pointer 0x7f08180e8270
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:269 Mem Alloc Size 512 pointer 0x7f47e80e8ab0
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:265 Mem Alloc Size 512 pointer 0x7f8b100e7b50
aetech:1314:1369 [2] NCCL INFO init.cc:736 Mem Alloc Size 512 pointer 0x7f08480e8060
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:266 Mem Alloc Size 512 pointer 0x7f08180e8480
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:270 Mem Alloc Size 512 pointer 0x7f47e80e8cc0
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:266 Mem Alloc Size 512 pointer 0x7f8b100e7d60
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:265 Mem Alloc Size 512 pointer 0x7f08480e8270
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:267 Mem Alloc Size 512 pointer 0x7f08180e8690
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:271 Mem Alloc Size 512 pointer 0x7f47e80e8ed0
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:267 Mem Alloc Size 512 pointer 0x7f8b100e7f70
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:266 Mem Alloc Size 512 pointer 0x7f08480e8480
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:268 Mem Alloc Size 512 pointer 0x7f08180e88a0
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:125 Mem Alloc Size 4 pointer 0x7f47e80e90e0
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:268 Mem Alloc Size 512 pointer 0x7f8b100e8180
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:267 Mem Alloc Size 512 pointer 0x7f08480e8690
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:269 Mem Alloc Size 512 pointer 0x7f08180e8ab0
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:126 Mem Alloc Size 4 pointer 0x7f47e80e9100
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:269 Mem Alloc Size 512 pointer 0x7f8b100e8390
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:268 Mem Alloc Size 512 pointer 0x7f08480e88a0
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:270 Mem Alloc Size 512 pointer 0x7f08180e8cc0
aetech:1312:1370 [1] NCCL INFO graph/connect.cc:127 Mem Alloc Size 4 pointer 0x7f47e80e9120
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:270 Mem Alloc Size 512 pointer 0x7f8b100e85a0
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:269 Mem Alloc Size 512 pointer 0x7f08480e8ab0
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:271 Mem Alloc Size 512 pointer 0x7f08180e8ed0
aetech:1312:1370 [1] NCCL INFO Tree 0 : 0 -> 1 -> 2/-1/-1
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:271 Mem Alloc Size 512 pointer 0x7f8b100e87b0
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:270 Mem Alloc Size 512 pointer 0x7f08480e8cc0
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:125 Mem Alloc Size 4 pointer 0x7f08180e90e0
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:271 Mem Alloc Size 512 pointer 0x7f08480e8ed0
aetech:1312:1370 [1] NCCL INFO Tree 1 : 0 -> 1 -> 2/-1/-1
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:125 Mem Alloc Size 4 pointer 0x7f8b100e89c0
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:126 Mem Alloc Size 4 pointer 0x7f08180e9100
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:125 Mem Alloc Size 4 pointer 0x7f08480e90e0
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:126 Mem Alloc Size 4 pointer 0x7f8b100e89e0
aetech:1315:1371 [3] NCCL INFO graph/connect.cc:127 Mem Alloc Size 4 pointer 0x7f08180e9120
aetech:1312:1370 [1] NCCL INFO Ring 00 : 0 -> 1 -> 2
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:126 Mem Alloc Size 4 pointer 0x7f08480e9100
aetech:1311:1368 [0] NCCL INFO graph/connect.cc:127 Mem Alloc Size 4 pointer 0x7f8b100e8a00
aetech:1314:1369 [2] NCCL INFO graph/connect.cc:127 Mem Alloc Size 4 pointer 0x7f08480e9120
aetech:1312:1370 [1] NCCL INFO Ring 01 : 0 -> 1 -> 2
aetech:1315:1371 [3] NCCL INFO Ring 00 : 2 -> 3 -> 0
aetech:1311:1368 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
aetech:1312:1370 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
aetech:1315:1371 [3] NCCL INFO Ring 01 : 2 -> 3 -> 0
aetech:1314:1369 [2] NCCL INFO Ring 00 : 1 -> 2 -> 3
aetech:1311:1368 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
aetech:1315:1371 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
aetech:1312:1370 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ffffffff
aetech:1314:1369 [2] NCCL INFO Ring 01 : 1 -> 2 -> 3
aetech:1311:1368 [0] NCCL INFO Channel 00/02 : 0 1 2 3
aetech:1315:1371 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ffffffff
aetech:1314:1369 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
aetech:1312:1370 [1] NCCL INFO NCCL_BUFFSIZE set by environment to 73400320.
aetech:1311:1368 [0] NCCL INFO Channel 01/02 : 0 1 2 3
aetech:1315:1371 [3] NCCL INFO NCCL_BUFFSIZE set by environment to 73400320.
aetech:1314:1369 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ffffffff
aetech:1311:1368 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
aetech:1311:1368 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
aetech:1314:1369 [2] NCCL INFO NCCL_BUFFSIZE set by environment to 73400320.
aetech:1311:1368 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
aetech:1311:1368 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ffffffff
aetech:1311:1368 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 73400320.
aetech:1311:1368 [0] NCCL INFO channel.cc:20 Cuda Alloc Size 16 pointer 0x7f8bc7810400
aetech:1311:1368 [0] NCCL INFO channel.cc:21 Mem Alloc Size 16 pointer 0x7f8b100e78d0
aetech:1311:1368 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 2720 pointer 0x7f8bc7810600
aetech:1311:1368 [0] NCCL INFO channel.cc:25 Mem Alloc Size 2720 pointer 0x7f8b100e89c0
aetech:1311:1368 [0] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f8b9cc00400
aetech:1311:1368 [0] NCCL INFO channel.cc:20 Cuda Alloc Size 16 pointer 0x7f8bc7811200
aetech:1311:1368 [0] NCCL INFO channel.cc:21 Mem Alloc Size 16 pointer 0x7f8b100e78f0
aetech:1311:1368 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 2720 pointer 0x7f8bc7811400
aetech:1311:1368 [0] NCCL INFO channel.cc:25 Mem Alloc Size 2720 pointer 0x7f8b100e9f50
aetech:1312:1370 [1] NCCL INFO channel.cc:20 Cuda Alloc Size 16 pointer 0x7f47efe00000
aetech:1312:1370 [1] NCCL INFO channel.cc:21 Mem Alloc Size 16 pointer 0x7f47e8aa7540
aetech:1315:1371 [3] NCCL INFO channel.cc:20 Cuda Alloc Size 16 pointer 0x7f081fe00000
aetech:1315:1371 [3] NCCL INFO channel.cc:21 Mem Alloc Size 16 pointer 0x7f0818aa7540
aetech:1314:1369 [2] NCCL INFO channel.cc:20 Cuda Alloc Size 16 pointer 0x7f084fe00000
aetech:1314:1369 [2] NCCL INFO channel.cc:21 Mem Alloc Size 16 pointer 0x7f0848aa7540
aetech:1311:1368 [0] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f8b29800000
aetech:1311:1368 [0] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f8b100eb670
aetech:1312:1370 [1] NCCL INFO channel.cc:24 Cuda Alloc Size 2720 pointer 0x7f47efe00200
aetech:1312:1370 [1] NCCL INFO channel.cc:25 Mem Alloc Size 2720 pointer 0x7f47e80e93c0
aetech:1315:1371 [3] NCCL INFO channel.cc:24 Cuda Alloc Size 2720 pointer 0x7f081fe00200
aetech:1315:1371 [3] NCCL INFO channel.cc:25 Mem Alloc Size 2720 pointer 0x7f08180e93c0
aetech:1314:1369 [2] NCCL INFO channel.cc:24 Cuda Alloc Size 2720 pointer 0x7f084fe00200
aetech:1314:1369 [2] NCCL INFO channel.cc:25 Mem Alloc Size 2720 pointer 0x7f08480e93c0
aetech:1312:1370 [1] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f47efc00200
aetech:1315:1371 [3] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f081fc00200
aetech:1314:1369 [2] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f084fc00200
aetech:1312:1370 [1] NCCL INFO channel.cc:20 Cuda Alloc Size 16 pointer 0x7f47efe00e00
aetech:1312:1370 [1] NCCL INFO channel.cc:21 Mem Alloc Size 16 pointer 0x7f47e80e7fd0
aetech:1315:1371 [3] NCCL INFO channel.cc:20 Cuda Alloc Size 16 pointer 0x7f081fe00e00
aetech:1315:1371 [3] NCCL INFO channel.cc:21 Mem Alloc Size 16 pointer 0x7f08180e7fd0
aetech:1314:1369 [2] NCCL INFO channel.cc:20 Cuda Alloc Size 16 pointer 0x7f084fe00e00
aetech:1314:1369 [2] NCCL INFO channel.cc:21 Mem Alloc Size 16 pointer 0x7f08480e7fd0
aetech:1312:1370 [1] NCCL INFO channel.cc:24 Cuda Alloc Size 2720 pointer 0x7f47efe01000
aetech:1312:1370 [1] NCCL INFO channel.cc:25 Mem Alloc Size 2720 pointer 0x7f47e80eaf10
aetech:1315:1371 [3] NCCL INFO channel.cc:24 Cuda Alloc Size 2720 pointer 0x7f081fe01000
aetech:1315:1371 [3] NCCL INFO channel.cc:25 Mem Alloc Size 2720 pointer 0x7f08180eaf10
aetech:1314:1369 [2] NCCL INFO channel.cc:24 Cuda Alloc Size 2720 pointer 0x7f084fe01000
aetech:1314:1369 [2] NCCL INFO channel.cc:25 Mem Alloc Size 2720 pointer 0x7f08480eaf10
aetech:1312:1370 [1] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f47f1600000
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80ec610
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80ec630
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80ec630
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1312:1370 [1] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f47e80ec650
aetech:1315:1371 [3] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f0821600000
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ec610
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ec630
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ec630
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ec630
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ec630
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ec630
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ec630
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1314:1369 [2] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f0851600000
aetech:1315:1371 [3] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f08180ec650
aetech:1314:1369 [2] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f08480ec610
aetech:1311:1368 [0] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f8b100ec530
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80ed4f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80ed4f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80ed4f0
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1312:1370 [1] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f47e80ed510
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ed4f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ed4f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ed4f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ed4f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ed4f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ed4f0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180ed4f0
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1315:1371 [3] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f08180ed510
aetech:1314:1369 [2] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f08480ed4d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100ed0d0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100ed0f0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100ed0f0
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1311:1368 [0] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f8b100ed110
aetech:1311:1368 [0] NCCL INFO Channel 00 : 0[19000] -> 1[1a000] via direct shared memory
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100ee000
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100ee000
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100ee000
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1311:1368 [0] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f8b100ee020
aetech:1311:1368 [0] NCCL INFO Channel 01 : 0[19000] -> 1[1a000] via direct shared memory
aetech:1312:1370 [1] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f47e80ee090
aetech:1312:1370 [1] NCCL INFO Channel 00 : 1[1a000] -> 2[67000] via direct shared memory
aetech:1312:1370 [1] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f47e80eefa0
aetech:1312:1370 [1] NCCL INFO Channel 01 : 1[1a000] -> 2[67000] via direct shared memory
aetech:1312:1370 [1] NCCL INFO bootstrap.cc:456 Mem Alloc Size 48 pointer 0x7f47e80efeb0
aetech:1315:1371 [3] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f08180ee090
aetech:1315:1371 [3] NCCL INFO Channel 00 : 3[68000] -> 0[19000] via direct shared memory
aetech:1315:1371 [3] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f08180eefa0
aetech:1315:1371 [3] NCCL INFO Channel 01 : 3[68000] -> 0[19000] via direct shared memory
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480ee070
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480ee090
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480ee090
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480ee090
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480ee090
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480ee090
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480ee090
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1314:1369 [2] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f08480ee0b0
aetech:1314:1369 [2] NCCL INFO Channel 00 : 2[67000] -> 3[68000] via direct shared memory
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480eefa0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480eefa0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480eefa0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480eefa0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480eefa0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480eefa0
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480eefa0
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1314:1369 [2] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f08480eefc0
aetech:1314:1369 [2] NCCL INFO Channel 01 : 2[67000] -> 3[68000] via direct shared memory
aetech:1314:1369 [2] NCCL INFO bootstrap.cc:456 Mem Alloc Size 48 pointer 0x7f08480efeb0
aetech:1311:1368 [0] NCCL INFO Connected all rings
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b10aa6e20
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e78a0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100e7940
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1311:1368 [0] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f8b100e7960
aetech:1315:1371 [3] NCCL INFO Connected all rings
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e8010
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e8060
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e8060
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e8060
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e8060
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e8060
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180e8060
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1315:1371 [3] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f08180e8080
aetech:1315:1371 [3] NCCL INFO Channel 00 : 3[68000] -> 2[67000] via direct shared memory
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f41c0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f41c0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f41c0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f41c0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f41c0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f41c0
aetech:1315:1371 [3] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08180f41c0
aetech:1315:1371 [3] NCCL INFO Could not enable P2P between dev 3(=68000) and dev 2(=67000)
aetech:1315:1371 [3] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f08180f41e0
aetech:1315:1371 [3] NCCL INFO Channel 01 : 3[68000] -> 2[67000] via direct shared memory
aetech:1314:1369 [2] NCCL INFO Connected all rings
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e8010
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e8060
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e8060
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e8060
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e8060
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e8060
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480e8060
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1314:1369 [2] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f08480e8080
aetech:1312:1370 [1] NCCL INFO Connected all rings
aetech:1312:1370 [1] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f47e80e8060
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f2eb0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f2eb0
aetech:1311:1368 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f8b100f2eb0
aetech:1311:1368 [0] NCCL INFO Could not enable P2P between dev 0(=19000) and dev 1(=1a000)
aetech:1311:1368 [0] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f8b100f2ed0
aetech:1312:1370 [1] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f47e80f3e50
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f3e50
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f3e50
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f3e50
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f3e50
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f3e50
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f3e50
aetech:1314:1369 [2] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f08480f3e50
aetech:1314:1369 [2] NCCL INFO Could not enable P2P between dev 2(=67000) and dev 3(=68000)
aetech:1314:1369 [2] NCCL INFO transport/shm.cc:85 Mem Alloc Size 48 pointer 0x7f08480f3e70
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f49f0
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f4a10
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f4a10
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1312:1370 [1] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f47e80f4a30
aetech:1314:1369 [2] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f08480f49f0
aetech:1312:1370 [1] NCCL INFO Channel 00 : 1[1a000] -> 0[19000] via direct shared memory
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f5920
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f5920
aetech:1312:1370 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f47e80f5920
aetech:1312:1370 [1] NCCL INFO Could not enable P2P between dev 1(=1a000) and dev 0(=19000)
aetech:1312:1370 [1] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f47e80f5940
aetech:1314:1369 [2] NCCL INFO Channel 00 : 2[67000] -> 1[1a000] via direct shared memory
aetech:1314:1369 [2] NCCL INFO transport/shm.cc:62 Mem Alloc Size 48 pointer 0x7f08480f5900
aetech:1312:1370 [1] NCCL INFO Channel 01 : 1[1a000] -> 0[19000] via direct shared memory
aetech:1314:1369 [2] NCCL INFO Channel 01 : 2[67000] -> 1[1a000] via direct shared memory
aetech:1314:1369 [2] NCCL INFO bootstrap.cc:456 Mem Alloc Size 48 pointer 0x7f08480f6810
aetech:1311:1368 [0] NCCL INFO Connected all trees
aetech:1311:1368 [0] NCCL INFO Latency/AlgBw | Tree/ LL | Tree/ LL128 | Tree/Simple | Ring/ LL |
Ring/ LL128 | Ring/Simple | CollNet/ LL | CollNet/ LL128 | CollNet/Simple |
aetech:1311:1368 [0] NCCL INFO Max NThreads | 512 | 640 | 512 | 512 |
640 | 256 | 512 | 640 | 512 |
aetech:1311:1368 [0] NCCL INFO Broadcast | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 4.6/ 3.3 |
12.5/ 0.0 | 14.1/ 10.0 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 |
aetech:1311:1368 [0] NCCL INFO Reduce | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 4.6/ 2.5 |
12.5/ 0.0 | 14.1/ 10.0 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 |
aetech:1311:1368 [0] NCCL INFO AllGather | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 6.6/ 4.4 |
17.5/ 0.0 | 25.5/ 13.3 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 |
aetech:1311:1368 [0] NCCL INFO ReduceScatter | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 6.6/ 4.4 |
17.5/ 0.0 | 25.5/ 13.3 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 |
aetech:1311:1368 [0] NCCL INFO AllReduce | 10.4/ 1.2 | 15.8/ 0.0 | 168.0/ 4.6 | 9.6/ 1.7 |
25.0/ 0.0 | 42.6/ 6.7 | 14.4/ 0.0 | 16.2/ 0.0 | 29.7/ 0.0 |
aetech:1311:1368 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
aetech:1311:1368 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
aetech:1311:1368 [0] NCCL INFO graph/paths.cc:539 Mem Alloc Size 16 pointer 0x7f8b100ed0d0
aetech:1311:1368 [0] NCCL INFO init.cc:415 Mem Alloc Size 8 pointer 0x7f8b100ed0d0
aetech:1311:1368 [0] NCCL INFO init.cc:418 Mem Alloc Size 56 pointer 0x7f8b100f57f0
aetech:1311:1368 [0] NCCL INFO init.cc:419 Mem Alloc Size 4 pointer 0x7f8b100f5830
aetech:1311:1368 [0] NCCL INFO init.cc:421 Mem Alloc Size 4 pointer 0x7f8b100f5850
aetech:1311:1368 [0] NCCL INFO init.cc:425 Mem Alloc Size 4 pointer 0x7f8b100f5870
aetech:1311:1368 [0] NCCL INFO NCCL_LAUNCH_MODE set by environment to PARALLEL
aetech:1312:1370 [1] NCCL INFO Connected all trees
aetech:1312:1370 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
aetech:1312:1370 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
aetech:1312:1370 [1] NCCL INFO graph/paths.cc:539 Mem Alloc Size 16 pointer 0x7f47e80ec610
aetech:1312:1370 [1] NCCL INFO init.cc:415 Mem Alloc Size 8 pointer 0x7f47e80ec610
aetech:1312:1370 [1] NCCL INFO init.cc:418 Mem Alloc Size 56 pointer 0x7f47e80f9c90
aetech:1312:1370 [1] NCCL INFO init.cc:419 Mem Alloc Size 4 pointer 0x7f47e80f9cd0
aetech:1312:1370 [1] NCCL INFO init.cc:421 Mem Alloc Size 4 pointer 0x7f47e80f9cf0
aetech:1312:1370 [1] NCCL INFO init.cc:425 Mem Alloc Size 4 pointer 0x7f47e80f9d10
aetech:1312:1370 [1] NCCL INFO NCCL_LAUNCH_MODE set by environment to PARALLEL
aetech:1314:1369 [2] NCCL INFO Connected all trees
aetech:1314:1369 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
aetech:1314:1369 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
aetech:1314:1369 [2] NCCL INFO graph/paths.cc:539 Mem Alloc Size 16 pointer 0x7f08480ee070
aetech:1314:1369 [2] NCCL INFO init.cc:415 Mem Alloc Size 8 pointer 0x7f08480ee070
aetech:1314:1369 [2] NCCL INFO init.cc:418 Mem Alloc Size 56 pointer 0x7f08480f9c70
aetech:1314:1369 [2] NCCL INFO init.cc:419 Mem Alloc Size 4 pointer 0x7f08480f9cb0
aetech:1314:1369 [2] NCCL INFO init.cc:421 Mem Alloc Size 4 pointer 0x7f08480f9cd0
aetech:1314:1369 [2] NCCL INFO init.cc:425 Mem Alloc Size 4 pointer 0x7f08480f9cf0
aetech:1314:1369 [2] NCCL INFO NCCL_LAUNCH_MODE set by environment to PARALLEL
aetech:1311:1368 [0] NCCL INFO bootstrap.cc:456 Mem Alloc Size 48 pointer 0x7f8b100f5890
aetech:1315:1371 [3] NCCL INFO Connected all trees
aetech:1315:1371 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
aetech:1315:1371 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
aetech:1315:1371 [3] NCCL INFO graph/paths.cc:539 Mem Alloc Size 16 pointer 0x7f08180ec610
aetech:1315:1371 [3] NCCL INFO init.cc:415 Mem Alloc Size 8 pointer 0x7f08180ec610
aetech:1315:1371 [3] NCCL INFO init.cc:418 Mem Alloc Size 56 pointer 0x7f08180f6790
aetech:1315:1371 [3] NCCL INFO init.cc:419 Mem Alloc Size 4 pointer 0x7f08180f67d0
aetech:1315:1371 [3] NCCL INFO init.cc:421 Mem Alloc Size 4 pointer 0x7f08180f67f0
aetech:1315:1371 [3] NCCL INFO init.cc:425 Mem Alloc Size 4 pointer 0x7f08180f6810
aetech:1315:1371 [3] NCCL INFO NCCL_LAUNCH_MODE set by environment to PARALLEL
aetech:1315:1371 [3] NCCL INFO bootstrap.cc:456 Mem Alloc Size 48 pointer 0x7f08180f6830
aetech:1312:1370 [1] NCCL INFO init.cc:321 Cuda Alloc Size 16424 pointer 0x7f47efe01c00
aetech:1314:1369 [2] NCCL INFO init.cc:321 Cuda Alloc Size 16424 pointer 0x7f084fe01c00
aetech:1311:1368 [0] NCCL INFO init.cc:321 Cuda Alloc Size 16424 pointer 0x7f8bc7812000
aetech:1311:1368 [0] NCCL INFO comm 0x7f8b10002f70 rank 0 nranks 4 cudaDev 0 busId 19000 - Init COMPLETE
aetech:1311:1311 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5608bfb1e460
aetech:1311:1311 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f8bd57ff400 recvbuff 0x7f8bd57ff400 count 1 datatype
0 op 0 root 0 comm 0x7f8b10002f70 [nranks=4] stream 0x5608c096c4c0
aetech:1311:1311 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5608c0c334d0
aetech:1311:1311 [0] NCCL INFO include/utils.h:69 Mem Alloc Size 5944 pointer 0x5608c0c334f0
aetech:1311:1311 [0] NCCL INFO include/utils.h:74 Mem Alloc Size 5944 pointer 0x5608c0c34c30
aetech:1311:1311 [0] NCCL INFO Launch mode Parallel
aetech:1315:1371 [3] NCCL INFO init.cc:321 Cuda Alloc Size 16424 pointer 0x7f081fe01c00
aetech:1311:1311 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f8bd97d7400 recvbuff 0x7f8bd57ffc00 count 8 datatype
0 op 0 root 0 comm 0x7f8b10002f70 [nranks=4] stream 0x5608c096c4c0
aetech:1311:1311 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5608c0c3cbc0
aetech:1312:1370 [1] NCCL INFO comm 0x7f47e8002f70 rank 1 nranks 4 cudaDev 1 busId 1a000 - Init COMPLETE
aetech:1314:1369 [2] NCCL INFO comm 0x7f0848002f70 rank 2 nranks 4 cudaDev 2 busId 67000 - Init COMPLETE
aetech:1312:1312 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55b4d60b1990
aetech:1314:1314 [2] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55c22e04ab20
aetech:1312:1312 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f47f37d7400 recvbuff 0x7f47f37d7400 count 1 datatype
0 op 0 root 0 comm 0x7f47e8002f70 [nranks=4] stream 0x55b4d6e038c0
aetech:1312:1312 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55b4d6ee7580
aetech:1314:1314 [2] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f08537d7400 recvbuff 0x7f08537d7400 count 1 datatype
0 op 0 root 0 comm 0x7f0848002f70 [nranks=4] stream 0x55c22e1cb7c0
aetech:1312:1312 [1] NCCL INFO include/utils.h:69 Mem Alloc Size 5944 pointer 0x55b4d6ee75a0
aetech:1314:1314 [2] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55c22ee809c0
aetech:1312:1312 [1] NCCL INFO include/utils.h:74 Mem Alloc Size 5944 pointer 0x55b4d6ee8ce0
aetech:1314:1314 [2] NCCL INFO include/utils.h:69 Mem Alloc Size 5944 pointer 0x55c22ee809e0
aetech:1314:1314 [2] NCCL INFO include/utils.h:74 Mem Alloc Size 5944 pointer 0x55c22ee82120
aetech:1315:1371 [3] NCCL INFO comm 0x7f0818002f70 rank 3 nranks 4 cudaDev 3 busId 68000 - Init COMPLETE
aetech:1315:1315 [3] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x560e2814bb60
aetech:1315:1315 [3] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f08237d7400 recvbuff 0x7f08237d7400 count 1 datatype
0 op 0 root 0 comm 0x7f0818002f70 [nranks=4] stream 0x560e28a4e100
aetech:1315:1315 [3] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x560e28f81990
aetech:1315:1315 [3] NCCL INFO include/utils.h:69 Mem Alloc Size 5944 pointer 0x560e28f819b0
aetech:1315:1315 [3] NCCL INFO include/utils.h:74 Mem Alloc Size 5944 pointer 0x560e28f830f0
aetech:1312:1312 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f47f37d7400 recvbuff 0x7f47f37d7e00 count 8 datatype
0 op 0 root 0 comm 0x7f47e8002f70 [nranks=4] stream 0x55b4d6e038c0
aetech:1312:1312 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55b4d78d0170
aetech:1314:1314 [2] NCCL INFO AllGather: opCount 0 sendbuff 0x7f08537d7400 recvbuff 0x7f08537d7e00 count 8 datatype
0 op 0 root 0 comm 0x7f0848002f70 [nranks=4] stream 0x55c22e1cb7c0
aetech:1314:1314 [2] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55c22f6f3df0
aetech:1315:1315 [3] NCCL INFO AllGather: opCount 0 sendbuff 0x7f08237d7400 recvbuff 0x7f08237d7e00 count 8 datatype
0 op 0 root 0 comm 0x7f0818002f70 [nranks=4] stream 0x560e28a4e100
aetech:1315:1315 [3] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x560e2996b430
aetech:1312:1312 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f47f37d7e00 recvbuff 0x7f47f37d7e00 count 7872 dataty
pe 0 op 0 root 0 comm 0x7f47e8002f70 [nranks=4] stream 0x55b4d6e038c0
aetech:1312:1312 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55b4d6ef5320
aetech:1314:1314 [2] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f08537d7e00 recvbuff 0x7f08537d7e00 count 7872 dataty
pe 0 op 0 root 0 comm 0x7f0848002f70 [nranks=4] stream 0x55c22e1cb7c0
aetech:1314:1314 [2] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55c22f8684f0
aetech:1315:1315 [3] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f08237d7e00 recvbuff 0x7f08237d7e00 count 7872 dataty
pe 0 op 0 root 0 comm 0x7f0818002f70 [nranks=4] stream 0x560e28a4e100
aetech:1315:1315 [3] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x560e28f8f6e0
aetech:1311:1311 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f8bd75fd600 recvbuff 0x7f8bd75fd600 count 7872 dataty
pe 0 op 0 root 0 comm 0x7f8b10002f70 [nranks=4] stream 0x5608c096c4c0
aetech:1311:1311 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5608c1487000
aetech:1311:1311 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f8b2a000000 recvbuff 0x7f8b2a000000 count 104403100 d
atatype 0 op 0 root 0 comm 0x7f8b10002f70 [nranks=4] stream 0x5608c096c4c0
aetech:1311:1311 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5608c0c334d0
aetech:1311:1311 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f8bd57ff400 recvbuff 0x7f8bd57ff400 count 616 datatyp
e 0 op 0 root 0 comm 0x7f8b10002f70 [nranks=4] stream 0x5608c096c4c0
aetech:1311:1311 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5608c0c3cef0
aetech:1315:1315 [3] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f07ea000000 recvbuff 0x7f07ea000000 count 104403100 d
atatype 0 op 0 root 0 comm 0x7f0818002f70 [nranks=4] stream 0x560e28a4e100
aetech:1315:1315 [3] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x560e2933d1d0
aetech:1312:1312 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f47b2000000 recvbuff 0x7f47b2000000 count 104403100 d
atatype 0 op 0 root 0 comm 0x7f47e8002f70 [nranks=4] stream 0x55b4d6e038c0
aetech:1312:1312 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55b4d77e6900
aetech:1315:1315 [3] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f08237d7400 recvbuff 0x7f08237d7400 count 616 datatyp
e 0 op 0 root 0 comm 0x7f0818002f70 [nranks=4] stream 0x560e28a4e100
aetech:1315:1315 [3] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x560e28303c80
aetech:1312:1312 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f47f37d7400 recvbuff 0x7f47f37d7400 count 616 datatyp
e 0 op 0 root 0 comm 0x7f47e8002f70 [nranks=4] stream 0x55b4d6e038c0
aetech:1312:1312 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55b4d6274340
aetech:1314:1314 [2] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f0812000000 recvbuff 0x7f0812000000 count 104403100 d
atatype 0 op 0 root 0 comm 0x7f0848002f70 [nranks=4] stream 0x55c22e1cb7c0
aetech:1314:1314 [2] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55c22f783d40
aetech:1314:1314 [2] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f08537d7400 recvbuff 0x7f08537d7400 count 616 datatyp
e 0 op 0 root 0 comm 0x7f0848002f70 [nranks=4] stream 0x55c22e1cb7c0
aetech:1314:1314 [2] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55c22f59c190
aetech:1312:1312 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f47f37d7400 recvbuff 0x7f47f37d7400 count 1 datatype
1 op 0 root 0 comm 0x7f47e8002f70 [nranks=4] stream 0x55b4d6e038c0
aetech:1312:1312 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55b4d6262aa0
aetech:1314:1314 [2] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f08537d7400 recvbuff 0x7f08537d7400 count 1 datatype
1 op 0 root 0 comm 0x7f0848002f70 [nranks=4] stream 0x55c22e1cb7c0
aetech:1314:1314 [2] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55c22ee8a850
aetech:1315:1315 [3] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f08237d7400 recvbuff 0x7f08237d7400 count 1 datatype
1 op 0 root 0 comm 0x7f0818002f70 [nranks=4] stream 0x560e28a4e100
aetech:1315:1315 [3] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x560e28f94fc0
^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[ATraceback (most recent call l
ast):
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/utils/torch_utils.py", line
49, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py",
line 2791, in barrier
work.wait()
RuntimeError: Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/aetech/.config/Ultralytics/DDP/_temp_3c1zm9j8140252851494144.py", line 12, in
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 199
, in train
self._do_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 313
, in _do_train
self._setup_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 277
, in _setup_train
self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode="train")
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/models/yolo/detect/train.py"
, line 48, in get_dataloader
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
KeyboardInterrupt
Traceback (most recent call last):
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/utils/torch_utils.py", line
49, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py",
line 2791, in barrier
work.wait()
RuntimeError: Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/aetech/.config/Ultralytics/DDP/_temp_3c1zm9j8140252851494144.py", line 12, in
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 199
, in train
self._do_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 313
, in _do_train
self._setup_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 277
, in _setup_train
self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode="train")
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/models/yolo/detect/train.py"
, line 48, in get_dataloader
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
KeyboardInterrupt
Traceback (most recent call last):
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/utils/torch_utils.py", line
49, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py",
line 2791, in barrier
work.wait()
RuntimeError: Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/aetech/.config/Ultralytics/DDP/_temp_3c1zm9j8140252851494144.py", line 12, in
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 199
, in train
self._do_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 313
, in _do_train
self._setup_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 277
, in _setup_train
self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode="train")
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/models/yolo/detect/train.py"
, line 48, in get_dataloader
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
KeyboardInterrupt
Traceback (most recent call last):
File "/home/aetech/.config/Ultralytics/DDP/_temp_wp_ynbyp140653279240144.py", line 12, in
results = trainer.train()
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 199
, in train
self._do_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 313
, in _do_train
self._setup_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 277
, in _setup_train
self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode="train")
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/models/yolo/detect/train.py"
, line 48, in get_dataloader
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/utils/torch_utils.py", line
49, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py",
line 2791, in barrier
work.wait()
RuntimeError: [Rank 3] Caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, TensorShape=[1], Tim
eout(ms)=10800000) ran for 10800233 milliseconds before timing out.
Traceback (most recent call last):
File "/home/aetech/.config/Ultralytics/DDP/_temp_wp_ynbyp140653279240144.py", line 12, in
results = trainer.train()
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 199
, in train
self._do_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 313
, in _do_train
self._setup_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 277
, in _setup_train
self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode="train")
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/models/yolo/detect/train.py"
, line 48, in get_dataloader
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/utils/torch_utils.py", line
49, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py",
line 2791, in barrier
work.wait()
RuntimeError: [Rank 2] Caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, TensorShape=[1], Tim
eout(ms)=10800000) ran for 10800258 milliseconds before timing out.
Traceback (most recent call last):
File "/home/aetech/.config/Ultralytics/DDP/_temp_wp_ynbyp140653279240144.py", line 12, in
results = trainer.train()
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 199
, in train
self._do_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 313
, in _do_train
self._setup_train(world_size)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/engine/trainer.py", line 277
, in _setup_train
self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode="train")
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/models/yolo/detect/train.py"
, line 48, in get_dataloader
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/ultralytics/utils/torch_utils.py", line
49, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py",
line 2791, in barrier
work.wait()
RuntimeError: [Rank 1] Caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, TensorShape=[1], Tim
eout(ms)=10800000) ran for 10800863 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1311 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1312) of binary: /home/a
etech/anaconda3/envs/data_manager/bin/python
Traceback (most recent call last):
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/run.py", line 765, in

main()
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessin
g/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in
main
run(args)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in
run
elastic_launch(
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line
131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line
245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/aetech/.config/Ultralytics/DDP/_temp_wp_ynbyp140653279240144.py FAILED

Failures:
[1]:
time : 2024-05-20_11:48:20
host : aetech
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1314)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-05-20_11:48:20
host : aetech
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 1315)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-05-20_11:48:20
host : aetech
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1312)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Traceback (most recent call last):
File "/home/aetech/PycharmProjects/torch/JSW_test/ultralytics/train.py", line 44, in
results = model.train(data='Objects365.yaml', epochs=epochs, imgsz=640, device=[0,1,2,3], batch=32)
File "/home/aetech/PycharmProjects/torch/JSW_test/ultralytics/ultralytics/engine/model.py", line 673, in train
self.trainer.train()
File "/home/aetech/PycharmProjects/torch/JSW_test/ultralytics/ultralytics/engine/trainer.py", line 194, in train
raise e
File "/home/aetech/PycharmProjects/torch/JSW_test/ultralytics/ultralytics/engine/trainer.py", line 192, in train
subprocess.run(cmd, check=True)
File "/home/aetech/anaconda3/envs/data_manager/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/home/aetech/anaconda3/envs/data_manager/bin/python', '-m', 'torch.distribu
ted.run', '--nproc_per_node', '4', '--master_port', '58543', '/home/aetech/.config/Ultralytics/DDP/_temp_wp_ynbyp1406
53279240144.py']' returned non-zero exit status 1.

@Leo-aetech Leo-aetech added the question Further information is requested label May 20, 2024
Copy link

👋 Hello @Leo-aetech, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

It looks like you're encountering a timeout issue with NCCL during your distributed training. This can sometimes be related to network issues or configurations when using multiple GPUs.

Here are a couple of suggestions you might try:

  1. Increase the Timeout: You've already set NCCL_TIMEOUT, but sometimes it's necessary to increase it even further depending on your system's performance and the complexity of the task.

  2. Reduce the Number of Workers: Try reducing the number of data loader workers (workers=0 is already set, which is good for debugging). This can sometimes alleviate timeout issues by reducing system load.

  3. Check Network Configuration: Ensure that your network configuration supports efficient GPU communication. For multi-GPU setups, proper networking is crucial.

  4. Update PyTorch and NCCL: If not already using the latest, consider updating both PyTorch and NCCL to their latest versions, as improvements and bug fixes in newer versions might resolve your issue.

  5. Simplify the Task: Temporarily reduce the complexity of the task (e.g., use a smaller dataset or fewer epochs) to see if the issue persists. This can help isolate whether the problem is data-related or configuration-related.

If these suggestions don't resolve the issue, it might be helpful to provide more details about your setup (like the specific NCCL version and network topology) for further diagnosis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants