Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

log shapes + dtypes in Flight Recorder logs #126554

Closed
c-p-i-o opened this issue May 17, 2024 · 0 comments
Closed

log shapes + dtypes in Flight Recorder logs #126554

c-p-i-o opened this issue May 17, 2024 · 0 comments
Assignees
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@c-p-i-o
Copy link
Contributor

c-p-i-o commented May 17, 2024

馃悰 Describe the bug

Reported by @wconstab .

newly added logs to job show mismatching DTYPE of op, which affects data size.
Even though the sizes match and we dont see the dtype on the Flight Recorder log.
We need to either switch from logging shapes to sizes, or do shapes + dtypes in Flight Recorder logs.

441965205_359665349994463_2525628757340361135_n
441167767_2434758523390979_986656904036035359_n

Versions

Collecting environment information...
PyTorch version: 2.4.0a0+git468f1d3
Is debug build: False
CUDA used to build PyTorch: 12.0
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.34

Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-0_fbk12_hardened_11583_g0bef9520ca2b-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.0.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100
GPU 1: NVIDIA H100
GPU 2: NVIDIA H100
GPU 3: NVIDIA H100

Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 184
On-line CPU(s) list: 0-183
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 1
Core(s) per socket: 184
Socket(s): 1
Stepping: 1
BogoMIPS: 4792.78
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 11.5 MiB (184 instances)
L1i cache: 11.5 MiB (184 instances)
L2 cache: 92 MiB (184 instances)
L3 cache: 2.9 GiB (184 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-183
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] flake8==6.1.0
[pip3] flake8-bugbear==23.3.23
[pip3] flake8-comprehensions==3.12.0
[pip3] flake8-executable==2.1.3
[pip3] flake8-logging-format==0.9.0
[pip3] flake8-pyi==23.3.1
[pip3] flake8-simplify==0.19.3
[pip3] mypy==1.8.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.0
[pip3] optree==0.10.0
[pip3] torch==2.4.0a0+git066ba81
[conda] blas 1.0 mkl
[conda] magma-cuda116 2.6.1 1 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-include 2023.1.0 h06a4308_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 1.26.0 pypi_0 pypi
[conda] optree 0.10.0 pypi_0 pypi
[conda] torch 2.4.0a0+git066ba81 dev_0
[conda] torchfix 0.4.0 pypi_0 pypi
(/home/cpio/local/a/pytorch-env) [cpio@devvm17556.vll0 ~/local/pytorch/tools/flight_recorder (main)]$

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

@c-p-i-o c-p-i-o self-assigned this May 17, 2024
@c-p-i-o c-p-i-o changed the title log shapes + dtpyes in Flight Recorder logs log shapes + dtypes in Flight Recorder logs May 17, 2024
c-p-i-o added a commit that referenced this issue May 17, 2024
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: ed4f620c204622624ead0d8f4e2d5afc8c09b8a6
Pull Request resolved: #126554

ghstack-source-id: 948a33f95169c03852a11f82078a4cde5dbd588d
Pull Request resolved: #126581
c-p-i-o added a commit that referenced this issue May 17, 2024
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 661222d6dff0149af37ea037d36b877dd7ae665c
Pull Request resolved: #126554

ghstack-source-id: 661222d6dff0149af37ea037d36b877dd7ae665c
Pull Request resolved: #126581
c-p-i-o added a commit that referenced this issue May 17, 2024
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 23a3252b3dc91621bccf3eb58ae0df5ca9130288
Pull Request resolved: #126554

ghstack-source-id: 23a3252b3dc91621bccf3eb58ae0df5ca9130288
Pull Request resolved: #126581
c-p-i-o added a commit that referenced this issue May 17, 2024
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 24488c2afa57f9e1f18838b5ae002d4d3f6a2e15
Pull Request resolved: #126554

ghstack-source-id: 24488c2afa57f9e1f18838b5ae002d4d3f6a2e15
Pull Request resolved: #126581
c-p-i-o added a commit that referenced this issue May 17, 2024
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 9ec254931390065a3edd228b9d2bfd6422596366
Pull Request resolved: #126554

ghstack-source-id: 9ec254931390065a3edd228b9d2bfd6422596366
Pull Request resolved: #126581
c-p-i-o added a commit that referenced this issue May 18, 2024
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: aefe300202da2b786d46cf1a68548c807f79d85c
Pull Request resolved: #126554

ghstack-source-id: aefe300202da2b786d46cf1a68548c807f79d85c
Pull Request resolved: #126581
c-p-i-o added a commit that referenced this issue May 18, 2024
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7ae2b6bf06bbf152dd05b151ea2a3be85ce1026a
Pull Request resolved: #126554

ghstack-source-id: 7ae2b6bf06bbf152dd05b151ea2a3be85ce1026a
Pull Request resolved: #126581
@mikaylagawarecki mikaylagawarecki added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 20, 2024
c-p-i-o added a commit that referenced this issue May 20, 2024
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: bc4b38a7cb0bd5184ab1f3699fec06ee6c01fd73
Pull Request resolved: #126554

ghstack-source-id: bc4b38a7cb0bd5184ab1f3699fec06ee6c01fd73
Pull Request resolved: #126581
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

2 participants