Hi.
Thanks a lot for the extensive reply.
I saw the LaunchFailed
only now (see below). I have to address this one.
Sorry, but not allowed to post any code. I know … it would make it easier for me as well. I will work myself through all the points of your post above. Very useful!
# /usr/local/cuda-11.1/bin/ncu --section InstructionStats -s 1 -c 1 test_cuda
==PROF== Profiling "kernel_2" - 1 of 1: 0%....50%....100% - 2 passes
==ERROR== Error: LaunchFailed
...
==PROF== Disconnected from process 12114
==ERROR== An error occurred while trying to profile.
[12114] test_cuda@127.0.0.1
...
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler (!) n/a
Executed Instructions (!) n/a
Avg. Issued Instructions Per Scheduler (!) n/a
Issued Instructions (!) n/a
---------------------------------------------------------------------- --------------- ------------------------------
Is it possible that this is due to a coding error?
# /usr/local/cuda-11.1/bin/compute-sanitizer test_cuda
========= COMPUTE-SANITIZER
...
========= ERROR SUMMARY: 0 errors
Regarding LaunchFailed
…
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/online/group__CUDART__TYPES_g3f51e3575c2178246db0a94a430e0038.html
cudaErrorLaunchFailure An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA.
Fishing in the dark … :(
… later …
I have more info.
It seems that all turns around these values.
dim3 blocks (9*1024,1,1);
dim3 threads (1024);
If I change the block to 8*1024
, it always works.
If I change the block to 10*1024
, it always fails.
9*1024
sometime works, sometimes fails.
This must be somehow an indication of a very specific problem.
I checked the code up and down … didn’t find anything wrong.
This must be CUDA related.
A passed run looks like this:
# /usr/local/cuda-11.1/bin/ncu --section InstructionStats -s 1 -c 1 test_cuda
...
==PROF== Profiling "kernel_test" - 1 of 1: 0%....50%....100% - 3 passes
...
==PROF== Disconnected from process 34479
[34479] test_cuda@127.0.0.1
kernel_test(curandStateXORWOW*, int, res_struct*, int*, unsigned long long*), 2020-Nov-27 15:33:33, Context 1, Stream 7
Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler inst 80.148.480
Executed Instructions inst 15.388.508.160
Avg. Issued Instructions Per Scheduler inst 80.148.553,89
Issued Instructions inst 15.388.522.347
---------------------------------------------------------------------- --------------- ------------------------------