Cuda nvcc default stream per-thread doesn't seem to be working

I’m attempting to compile some code using the flag --default-stream per-thread. However, it seems like it’s not creating one stream per thread properly: When one thread is stuck waiting, all others hang as well, which was the problem I attempted to add the flag to fix. The code is compiled as a DLL, and that DLL is referenced in C# program from many different threads created with Task.Factory.StartNew. Each thread is given an index, and the function called from each thread is named Ordered, which is passed the index. At the beginning of Ordered I print “Ordered start: [index]”.

Within Ordered, a kernel launch is performed (after doing some memory management) and within this kernel there is another kernel launch. In that 2nd kernel launch, there is a waiting while loop like so:

printf("Index: %i, MsgNum: %i\n", index, MsgNum);

while (index > MsgNum) __nanosleep(100); //Wait until msg index equals next msg num

//do work

atomicAdd(&MsgNum, 1);

If there is no need to wait, everything executes correctly. However, whenever there is a wait, all other threads hang. I consulted someone on this and they said since all threads were using the default stream, one waiting would block the others, so default stream per thread should be used. But the issue still occurs, which can be seen in the program output below:

Ordered start: 0
Index: 0, MsgNum: 0
Ordered start: 1
Index: 1, MsgNum: 1
Ordered start: 2
Index: 2, MsgNum: 2
Ordered start: 6
Ordered start: 8
Ordered start: 5
Ordered start: 7
Ordered start: 9
Ordered start: 4
Ordered start: 3
Index: 9, MsgNum: 3

… And then the program hangs indefinitely.

I am on Windows 10, an RTX 2070 Super, Cuda 11.0, and the program is compiled using Visual Studio. I found no option in the properties menu for default stream so I added it to the command line box. I tried putting it in the linker command line as well even though that doesn’t seem necessary. My compiler output looks like this:

1>------ Build started: Project: orderer_kernel, Configuration: Release x64 ------
1>Compiling CUDA source file kernel.cu…
1>
1>orderer_kernel>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe” -gencode=arch=compute_70,code=“sm_70,compute_70” --use-local-env -ccbin “C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64” -x cu -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static --default-stream per-thread -use_fast_math -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_WINDLL -D_MBCS -Xcompiler “/EHsc /W3 /nologo /Ox /Fdx64\Release\vc142.pdb /FS /Zi /MD " -o x64\Release\kernel.cu.obj “orderer_kernel\kernel.cu”
1>orderer_kernel/kernel.cu(21): warning C4005: ‘CUDACC’: macro redefinition
1>orderer_kernel/kernel.cu(313): warning : function “main” cannot be declared in a linkage-specification
1>
1>orderer_kernel/kernel.cu(21): warning C4005: ‘CUDACC’: macro redefinition
1>orderer_kernel/kernel.cu(313): warning : function “main” cannot be declared in a linkage-specification
1>
1>kernel.cu
1>orderer_kernel/kernel.cu(167): warning C4267: ‘initializing’: conversion from ‘size_t’ to ‘long’, possible loss of data
1>orderer_kernel/kernel.cu(168): warning C4267: ‘initializing’: conversion from ‘size_t’ to ‘long’, possible loss of data
1>orderer_kernel/kernel.cu(253): warning C4267: ‘initializing’: conversion from ‘size_t’ to ‘long’, possible loss of data
1>orderer_kernel/kernel.cu(254): warning C4267: ‘initializing’: conversion from ‘size_t’ to ‘long’, possible loss of data
1>Done building project “orderer_kernel.vcxproj”.
1>
1>orderer_kernel>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe” -dlink -o x64\Release\orderer_kernel.device-link.obj -Xcompiler “/EHsc /W3 /nologo /Ox /Zi /Fdx64\Release\vc142.pdb /MD " -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin/crt” -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\lib\x64” cudart_static.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib cudart.lib cudadevrt.lib --default-stream per-thread -gencode=arch=compute_70,code=sm_70 --machine 64 x64\Release\kernel.cu.obj
1>cudart_static.lib
1>kernel32.lib
1>user32.lib
1>gdi32.lib
1>winspool.lib
1>comdlg32.lib
1>advapi32.lib
1>shell32.lib
1>ole32.lib
1>oleaut32.lib
1>uuid.lib
1>odbc32.lib
1>odbccp32.lib
1>cudart.lib
1>cudadevrt.lib
1>kernel.cu.obj
1> Creating library orderer_kernel\x64\Release\orderer_kernel.lib and object orderer_kernel\x64\Release\orderer_kernel.exp
1>LINK : /LTCG specified but no code generation required; remove /LTCG from the link command line to improve linker performance
1>orderer_kernel.vcxproj -> orderer_kernel\x64\Release\orderer_kernel.dll
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========