CUDA Dynamic Parallelism API and Principles

Originally published at: https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/

This post is the second in a series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. This post is an in-depth tutorial on the ins and outs of programming with…

Hello, Andrew.

thanks for your post, it is hard to find good information about CDP.

I have some questions about error checking in CDP. I have such a complex CDP code, and I faced some problems when I launch a big number of kernels.

Sometimes I can get the error number 11, but, sometimes, I just receive wrong answer and no error. If in CUDA 7.5 the pool is a virtualized pool that can handle a big number of kernels, how can I get a error when I receive a wrong answer? Because, sometimes, I receive good answer, but not all blocks were processed.

All the best,

Tiago Carneiro.

Hello Tiago,
>> thanks for your post, it is hard to find good information about CDP.
Agree with you on this! However, as a first step, we gave a talk at this GTC regarding dynamic parallelism. Even though the major focus was on its perf characteristics, at the end of the talk, it also contains some information on error handling. Hopefully, that information will be useful to you. The talk is here: https://registration.gputec.... The recordings and slides will be up sooner.

>> Sometimes I can get the error number 11, but, sometimes, I just receive wrong answer and no error.
Yes. As of today, the error handling is difficult on device-side with dyn-par. As such, the debug strategy depends upon the reason for the error in such scenarios.
1.If the error is because of an OOB access, one can use the combination of ‘nvcc -lineinfo' compilation and ‘cuda-memcheck’ to pin-point the source of the error.
2.If the error is suspected to be due to cuda calls on device side, AFAIK currently, there’s no easy way to pass this info to the host side. Meaning, cudaGetLastError on the host side after completing parent kernel will not catch this! Hence, I’ve been using ‘cudaGetLastError’ on the device side + device-side-printfs to debug such errors. I believe this warrants an nvbug, but I’ve not been able to find time + a real use-case to do so.

Finally, if you can give us a mini-app regarding these issues, that’ll be of tremendous help while making a case to fix and improve error-handling on device side!

Regards,
Thejaswi

Hello, Thejaswi.

Thanks for sending me the GTC Talking, it will be very useful, and thank you for your explanation, it will help me a lot.

Please, tell me, how can I send you the app with the different error situations?

Regards,

Tiago Carneiro.

Hi Tiago,
You can send us the app in either of the 2 ways:
1. Recommended approach is to become a registered developer [https://developer.nvidia.co...] and then file a bug from the portal, while attaching the app. So that way you can follow-up on the progress.
2. You can contact me through s n a n dit ale AT nvidia.com.
Kindly also provide a README describing steps to build and run your app, in order to reproduce the issues.
Regards,
Thejaswi

Hello Thejaswi,

Thank you for your help!

No problem, I'm making a version of the program with a Makefile and a Readme to reproduce the issue, and I'll send you soon.

I'm registered in NVidia Developer since 2012, but I didn't know about this help with bugs, that's great, I'll do it too.

All the best,

Tiago Carneiro.

Ps: The correct contact is s n a n dit ale or s n a n dot ale?

Hi Tiago, it's "dit".

Dear Thejaswi Rao,
Thank you for your previous support.

Now I'm facing new CDP challenges. I'm trying to make a CDP code that allocates memory dynamically and I'm facing the following problem.

Even if I set cudaLimitMallocHeapSize big enough, as the problem size grows the application just returns "CUDA error: an illegal memory access was encountered".

Doing some debugging, I've found that problem with the allocations. Looks like if I perform a huge number of allocations on the device side, the GPU returns me this error.

Looks like that even if I set cudaLimitMallocHeapSize big enough, CUDA may be unable to perform a huge amount of allocations.

Have you ever faced this situation?

Best regards,

Tiago Carneiro.

Hello Andrew. Thanks for your post.

I'm facing new CDP challenges. I'm trying to make a CDP code that
allocates memory dynamically and I'm facing the following problem.

Even
if I set cudaLimitMallocHeapSize big enough, as the problem size grows
the application just returns "CUDA error: an illegal memory access was
encountered".

Doing some debugging, I've found that problem with
the allocations. Looks like if I perform a huge number of allocations on
the device side, the GPU returns me this error.

Looks like that even if I set cudaLimitMallocHeapSize big enough, CUDA may be unable to perform a huge amount of allocations.

Have you ever faced this situation?

Best regards,

Tiago Carneiro.

Hi Tiago. Can you provide an as-simple-as-possible program that reproduces the problem somewhere (e.g. on Github, or as a Gist)?

Dear Mark,

thank you for your answer!

Yes, I can. I'm going to prepare a simple code.

All the best,

Tiago Carneiro.

Hello Andrew,
I tried to implement simple program for Dynamic Parallelism from latest CUDA Programming Guide.
I am facing ERROR -
Error1error : calling a __global__ function("childKernel") from a __global__ function("parentKernel") is only allowed on the compute_35 architecture or above

But my GPU is Tesla K40c with compute capability 3.5.
I used CUDA 8.0 toolkit
Rest of CUDA programs run fine
Is there any initial conditions I should set for Dynamic parallelism ?
Thank you.

Make sure you compile with at least the "-arch=sm_35" flag to NVCC. The default is sm_20, which doesn't support dynamic parallelism.

Hello Mark,
I use Visual Studio 2013, on a Windows 10 system.
I changed the configuration property by changing Code Generation option in Device section of Cuda C/C++ from "compute_20,sm_20" to "compute_35,sm_35", (I did this for all configurations and all Platforms).
I am getting the following error -
"kernel launch from __device__ or __global__ functions requires separate compilation mode"
Thank you for your valuable help,
Ameya

I was able to resolve that error by adding -
"nvcc --gpu-architecture=sm_35 --device-c"
to Command line.
But unfortunstely it lead to following error -
Error1error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -gencode=arch=compute_35,code=\"sm_35,compute_35\" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static nvcc --gpu-architecture=sm_35 --device-c kernel.cu -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /FS /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "c:\Users\awadekar\documents\visual studio 2013\Projects\Dynamic Parallelism 3\Dynamic Parallelism 3\kernel.cu"" exited with code 1.

Hello Mark,
I am still unable to implement Dynamic parallelism on my system I use (Visual Studio 2015)
I am getting following error(Uploaded a screenshot):-
https://uploads.disquscdn.c...
I am unable to understand the reason for this error.
I would really appreciate if you could clarify.
Thank you,
Ameya.

Hi Ameya, can you post the code you are trying to compile and the full command line somewhere so we can try to reproduce? A GitHub Gist is one way to do this. https://gist.github.com

Dear Ameya Wadekar. According to your picture, I think the following flags are missing: "-lcudadevrt -rdc=true".

Best regards,

Tiago Carneiro.

Dear Mark Harris,

I have a series of questions concerning the issues a huge number of kernels launched may cause.

I performed some experiments in this direction. These experiments launch a huge number of kernels (more than 2048), and I observed that the following issues sometimes happen.

EI)The program halts, getLastError() on the host returns “an illegal memory access was encountered” and NVPROF returns the following message: “==18406== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.”

EII) The program does not halt, takes much longer, from 10x to 100x longer, and returns a wrong answer. Neither host or NVPROF returns an error message. In this situation, Nvprof returns no information concerning kernels launched by CDP, although it returns information concerning the kernel launched by the host. According to a global variable, CDP launches all kernels it is supposed to launch.

We have noticed that EI) and EII) are more likely to happen when CDP kernels dynamically allocate memory.

Some remarks:

If we set the heap (cudaLimitMallocHeapSize) to a big size, e.g., 65% of the global memory, problems E1) and E2) happen much more often.

2) Another interesting information is that error 1) is the same error we get when the heap is not enough.

3) The device never gets an error by using getLastError().

Now, I have the following questions:

q1) Does the device can get any error by using getLastError()?

q2) Do the errors getLastError(), Nvprof and cuda-memcheck return in CDP applications make sense? Are they just random errors?

q3) As said before, for bigger heaps, errors happen more often. Does the device need other memory than cudaLimitDevRuntimePendingLaunchCount and cudaLimitDevRuntimeSyncDepth to manage kernels that CDP launches?

If the answer for q3) is yes, is that the reason errors EI) and EII) happen more often when we set a big heap?

All the best,

Tiago Carneiro.

Dear Tiago,
I was working on Visual studio, so couldnt figure it out.
But now I got the code working.
Thank you,
Ameya