When I run a llm model with tensor parallel=4, each thread run partial model with one device.
But sometimes some of the 4 threads will hang in cudaLaunchKernel function.
There is the backtrace:
#0 futex_abstimed_wait (private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x557072a340cc) at ../sysdeps/nptl/futex-internal.h:284
#1 __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x557072a340c0) at pthread_rwlock_common.c:731
#2 __GI___pthread_rwlock_wrlock (rwlock=0x557072a340c0) at pthread_rwlock_wrlock.c:27
#3 0x00007f2cf8d8315f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f2cf8de4118 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007f2cfedbeac3 in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#6 0x00007f2cfedbf3e9 in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#7 0x00007f2cfedbf494 in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#8 0x00007f2cfedbf624 in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#9 0x00007f2cfed978ac in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#10 0x00007f2cfeddd2f0 in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#11 0x00007f2cfe157edc in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#12 0x00007f2cfc8d2d8f in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#13 0x00007f2cfbdfa471 in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#14 0x00007f2cfbf0123b in ?? () from /usr/local/cuda/lib64/libcublasLt.so.12
#15 0x00007f2cfbf52fa1 in cublasLtTSTMatmul () from /usr/local/cuda/lib64/libcublasLt.so.12
#16 0x00007f2d1ee99d70 in ?? () from /usr/local/cuda/lib64/libcublas.so.12
#17 0x00007f2d1ee9c066 in ?? () from /usr/local/cuda/lib64/libcublas.so.12
#18 0x00007f2d1ee7522e in ?? () from /usr/local/cuda/lib64/libcublas.so.12
#19 0x00007f2d1ee77b50 in ?? () from /usr/local/cuda/lib64/libcublas.so.12
#20 0x00007f2d1ea073ed in cublasGemmStridedBatchedEx () from /usr/local/cuda/lib64/libcublas.so.12
another thread:
(gdb) bt
#0 futex_abstimed_wait (private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x557072a340cc) at ../sysdeps/nptl/futex-internal.h:284
#1 __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x557072a340c0) at pthread_rwlock_common.c:731
#2 __GI___pthread_rwlock_wrlock (rwlock=0x557072a340c0) at pthread_rwlock_wrlock.c:27
#3 0x00007f2cf8d8315f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f2cf8de4118 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007f2d36a2f923 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#6 0x00007f2d36a30070 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#7 0x00007f2d36a300de in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#8 0x00007f2d36a32d47 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#9 0x00007f2d36a0af5b in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#10 0x00007f2d36a66b2b in cudaLaunchKernel () from /usr/local/cuda/lib64/libcudart.so.12
#11 0x00007f2d3841d015 in void fastertransformer::invokeAddFusedQKVBiasTranspose<__nv_bfloat16>(__nv_bfloat16*, __nv_bfloat16*, __nv_bfloat16*, fastertransformer::PrefixPromptBatchWeightsParam<__nv_bfloat16>, __nv_bfloat16*, __nv_bfloat16 const*, int const*, int, int, int, int, int, int, int, int, float const*, int, CUstream_st*) ()
#12 0x00007f2d378019a0 in fastertransformer::GptContextAttentionLayer<__nv_bfloat16>::forward(fastertransformer::TensorMap*, fastertransformer::TensorMap*, fastertransformer::AttentionWeight<__nv_bfloat16, __nv_bfloat16> const*) ()
The hang is occasional.
And some devices’ thread will run after nccl AllReduce kernel launch, and some devices will hang before nccl kernel.
My hardware: 4 * H800
gpu driver: 535.54.03
cuda: 12.2
nccl: 2.18.5
And it only hang at first iterator.
Why the cudaLaunchKernel() function in different threads will try to get the same one lock? In my code, each thread holds and only holds one device and one stream. But the backtrace shows that 3 threads try get the same one lock.
The usual reason for locks in such situations is that there is a shared resource, and only one client is allowed to access the shared resource at a time.
I won’t be able to answer exactly why a shared resource might be needed in any particular kernel launch scenario, or what that shared resource may be.
The CUDA docs make reference to this possibility here:
Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon.
This presentation may also be of interest. In particular, the discussion around slides 35-40 of the associated PDF presentation (look for the PDF download link on that page, after you have logged in with your developer credentials) may be of interest, although I’m not suggesting they apply directly to this case or give detailed information about the lock behavior.
@Robert_Crovella Thank you very much for your replay. The current situation with my code can be simplified to as follows:
thread/device 0: Gemm kernel → Gemm kernel(cudaLaunchKernel try to get lock)
thread/device 1: Gemm kernel → Gemm kernel(cudaLaunchKernel try to get lock)
thread/device 2: Gemm kernel → Gemm kernel → ncclAllReduce(executing on gpu) → Gemm kernel → Activation(cudaLaunchKernel blocked)
thread/device 3: Gemm kernel → Gemm kernel(cudaLaunchKernel try to get lock)
I think the reason is that thread2’s cudaLaunchKernel function is blocked and waiting for the resource( which holds by nccl kernel). And cudaLaunchKenerl func of thread2 has hold the lock that shared between the 4 threads. Then the other 3 threads can’t get lock and the nccl kernel on device2 also can’t get data from other 3 devices. A deadlock is formed.
But I still have 2 questions:
- Why thread 2’s cudaLuanchKernerl are blocked? Does it wait the resources used in nccl kernel? Before dispatch cuda kernel to the stream, what the cudaLaunchKernel function need prepare? Or the buffer of kernels to execute in the stream already full?
- My environment is cuda 12.2 and driver 535.54, there should be less or no internal contention even. Or how can I avoid them in cuda 12.2?
this is the backtrace of thread2 which hold the lock:
(gdb) bt
#0 0x00007ffdbfffb794 in ?? ()
#1 0x00007ffdbfffb9ef in clock_gettime ()
#2 0x00007f2d363b40b5 in __GI___clock_gettime (clock_id=4, tp=0x7f29d7fe0610) at ../sysdeps/unix/sysv/linux/clock_gettime.c:38
#3 0x00007f2cf8dc270f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f2cf8c9c6d6 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007f2cf8ffe106 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#6 0x00007f2cf8d99c37 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#7 0x00007f2cf8d82a4a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#8 0x00007f2cf8d830c3 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#9 0x00007f2cf8de4118 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007f2d36a2f923 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#11 0x00007f2d36a30070 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#12 0x00007f2d36a300de in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#13 0x00007f2d36a32d47 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#14 0x00007f2d36a0af5b in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#15 0x00007f2d36a66b2b in cudaLaunchKernel () from /usr/local/cuda/lib64/libcudart.so.12
#16 0x00007f2d37a78cf2 in void fastertransformer::invokeGenericActivation<fastertransformer::SiluActivation, __nv_bfloat16, __nv_bfloat16>(__nv_bfloat16*, __nv_bfloat16 const*, __nv_bfloat16 const*, __nv_bfloat16 const*, int const*, __nv_bfloat16 const*, int, int, int, float const*, float const*, int const*, int, CUstream_st*) ()
I won’t be able to explain why locks are used in the CUDA runtime API, other than the very brief description I pointed out in the documentation already.
Certainly usage of locks by the CUDA runtime API designers is not intended to lead to a deadlock (that is, a permanent hang). It’s possible that the code you have constructed has a defect, or its possible that there is a defect in CUDA. There is no way for me to make any further statements based on what is presented here. I can only suggest possibilities. I’m unlikely to have any further comment here.