Cuda nvcc default stream per-thread doesn't seem to be working

cjoh354 · August 10, 2020, 5:09pm

I’m attempting to compile some code using the flag --default-stream per-thread. However, it seems like it’s not creating one stream per thread properly: When one thread is stuck waiting, all others hang as well, which was the problem I attempted to add the flag to fix. The code is compiled as a DLL, and that DLL is referenced in C# program from many different threads created with Task.Factory.StartNew. Each thread is given an index, and the function called from each thread is named Ordered, which is passed the index. At the beginning of Ordered I print “Ordered start: [index]”.

Within Ordered, a kernel launch is performed (after doing some memory management) and within this kernel there is another kernel launch. In that 2nd kernel launch, there is a waiting while loop like so:

printf("Index: %i, MsgNum: %i\n", index, MsgNum);

while (index > MsgNum) __nanosleep(100); //Wait until msg index equals next msg num

//do work

atomicAdd(&MsgNum, 1);

If there is no need to wait, everything executes correctly. However, whenever there is a wait, all other threads hang. I consulted someone on this and they said since all threads were using the default stream, one waiting would block the others, so default stream per thread should be used. But the issue still occurs, which can be seen in the program output below:

Ordered start: 0
Index: 0, MsgNum: 0
Ordered start: 1
Index: 1, MsgNum: 1
Ordered start: 2
Index: 2, MsgNum: 2
Ordered start: 6
Ordered start: 8
Ordered start: 5
Ordered start: 7
Ordered start: 9
Ordered start: 4
Ordered start: 3
Index: 9, MsgNum: 3

… And then the program hangs indefinitely.

I am on Windows 10, an RTX 2070 Super, Cuda 11.0, and the program is compiled using Visual Studio. I found no option in the properties menu for default stream so I added it to the command line box. I tried putting it in the linker command line as well even though that doesn’t seem necessary. My compiler output looks like this:

1>------ Build started: Project: orderer_kernel, Configuration: Release x64 ------
1>Compiling CUDA source file kernel.cu…
1>
1>orderer_kernel>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe” -gencode=arch=compute_70,code="sm_70,compute_70" --use-local-env -ccbin “C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64” -x cu -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static --default-stream per-thread -use_fast_math -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_WINDLL -D_MBCS -Xcompiler “/EHsc /W3 /nologo /Ox /Fdx64\Release\vc142.pdb /FS /Zi /MD " -o x64\Release\kernel.cu.obj “orderer_kernel\kernel.cu”
1>orderer_kernel/kernel.cu(21): warning C4005: ‘CUDACC’: macro redefinition
1>orderer_kernel/kernel.cu(313): warning : function “main” cannot be declared in a linkage-specification
1>
1>orderer_kernel/kernel.cu(21): warning C4005: ‘CUDACC’: macro redefinition
1>orderer_kernel/kernel.cu(313): warning : function “main” cannot be declared in a linkage-specification
1>
1>kernel.cu
1>orderer_kernel/kernel.cu(167): warning C4267: ‘initializing’: conversion from ‘size_t’ to ‘long’, possible loss of data
1>orderer_kernel/kernel.cu(168): warning C4267: ‘initializing’: conversion from ‘size_t’ to ‘long’, possible loss of data
1>orderer_kernel/kernel.cu(253): warning C4267: ‘initializing’: conversion from ‘size_t’ to ‘long’, possible loss of data
1>orderer_kernel/kernel.cu(254): warning C4267: ‘initializing’: conversion from ‘size_t’ to ‘long’, possible loss of data
1>Done building project “orderer_kernel.vcxproj”.
1>
1>orderer_kernel>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe” -dlink -o x64\Release\orderer_kernel.device-link.obj -Xcompiler “/EHsc /W3 /nologo /Ox /Zi /Fdx64\Release\vc142.pdb /MD " -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin/crt” -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\lib\x64” cudart_static.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib cudart.lib cudadevrt.lib --default-stream per-thread -gencode=arch=compute_70,code=sm_70 --machine 64 x64\Release\kernel.cu.obj
1>cudart_static.lib
1>kernel32.lib
1>user32.lib
1>gdi32.lib
1>winspool.lib
1>comdlg32.lib
1>advapi32.lib
1>shell32.lib
1>ole32.lib
1>oleaut32.lib
1>uuid.lib
1>odbc32.lib
1>odbccp32.lib
1>cudart.lib
1>cudadevrt.lib
1>kernel.cu.obj
1> Creating library orderer_kernel\x64\Release\orderer_kernel.lib and object orderer_kernel\x64\Release\orderer_kernel.exp
1>LINK : /LTCG specified but no code generation required; remove /LTCG from the link command line to improve linker performance
1>orderer_kernel.vcxproj → orderer_kernel\x64\Release\orderer_kernel.dll
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

Topic		Replies	Views
Inexpiable CUDA hang (NOT WDM timeout!) CUDA Programming and Performance	2	1471	June 5, 2014
Multi threaded issue with --default-stream per-thread CUDA Programming and Performance	3	912	November 20, 2018
cuStreamWaitValue32 and cuStreamWriteValue32 blocking issue CUDA Programming and Performance	8	375	April 12, 2024
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8607	December 18, 2008
Streams and multi-gpu CUDA Programming and Performance	10	2151	June 17, 2014
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	7896	April 5, 2015
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2047	February 5, 2020
Overlapping CPU and GPU code. CUDA Programming and Performance	6	1594	February 27, 2016
Does cudaLaunchHostFunc block work added to all streams? CUDA Programming and Performance	19	1385	October 12, 2021
A few new to CUDA questions CUDA Programming and Performance	3	1110	February 4, 2011

Cuda nvcc default stream per-thread doesn't seem to be working

Related topics