Dynamic SM with Dynamic Parallelism

Is it possible to increase dynamically allocated SM (cudaFuncAttributeMaxDynamicSharedMemorySize) for the kernels launched with Dynamic Parallelism? My kernel runs OK with <= 48K of the dynamically allocated SM with Dynamic Parallelism, I can also directly launch kernel from the host code with >48K, but I was not able to call a kernel (that needs > 48K SM) from another kernel.

Code links:
Calling kernel from the host code (works with SM >48K):

Kernel itself: src/TileProcessor.cuh · lwir16 · Elphel / tile_processor_gpu · GitLab

Attempt to use Dynamic parallelism (works with <=48K), fails with >48K:
Host code: test_tp.cu:1674 (same file, link removed to fit into 3 links allowed)

Outer kernel called from host:
TileProcessor.cuh:2796 (same file, link removed to fit into 3 links allowed)

Inner kernel call: src/TileProcessor.cuh · lwir16 · Elphel / tile_processor_gpu · GitLab

Andrey

I don’t believe it is possible. If this is an important feature for you, you may wish to file a bug, requesting an enhancement…

Robert, thank you for your reply. I am not an experienced CUDA developer, so I’m trying first to understand if this functionality is not yet implemented or I’m just doing something wrong. I think it is an important feature and not just for me, especially for the devices with later compute capability that have larger shared memory; in my case (7.5), I was just unlucky to hit between 48K (available with statically allocated SM) and 64K - maximal total SM. Later devices have larger SM but the same 48K for static allocation.

I find DP to be a very convenient feature. Among other advantages of the DP, I enjoy chaining multiple kernels where the next ones can only start when the previous ones are finished, and each kernel may have a different grid configuration. I can then expose to the host just the top-level kernel, which in turn calls all the others. It is especially convenient to launch kernels from Java (with jcuda) in a single call. The kernel itself is already tested from inside the convenient Nsight environment - the debug capability is unavailable in Java, where I have to replicate the top kernel’s functionality.

Andrey
PS. Line numbers in my earlier code links changed because it is the branch I’m currently working on.

Hello Robert,

I followed your advice and filed a bug report https://developer.nvidia.com/nvidia_bug/3503453 . It all went well at first, but since 01/24, I could not post my reply. I tried many times, from different computers and browsers, but I’m getting the same error message:
An AJAX HTTP error occurred.
HTTP Result Code: 500
Debugging information follows.
Path: /system/ajax
StatusText: Internal Server Error
ResponseText:

Can anything be done about it? I feel very uncomfortable because it looks like it is my fault that I’m failing to respond to the request for the additional details related to my bug report. Can that bug report be supplemented with my response or at least hidden until the website error is resolved?

Andrey

Yuki, Thank you. I still have to reply here, as that AJAX error did not go away (I hoped the state would change after you last post there).

Andrey

Yuki, maybe I’m wrong but I suspect that the website problem maybe somehow related to the “publicly visible log.txt” - I do not see any. Where should it be? Search does not find another “log.txt” on the page itself and in the source view. Maybe it broke something with permissions and now only NVIDIA personnel can post on that page? Can you yourself see that log.txt ?

Andrey

Hi Yuki,

I’m still getting that AJAX error trying to reply you on the bug report page - https://developer.nvidia.com/nvidia_bug/3503453

Additionallly, I asked another developer who is registered at NVIDIA - he can not access that page at all.

Where should I respond you? For now I’m doing it here, please see my response below.


Hi Yuki,

I do not understand your recommendation. In the textures_accumulate kernel, I need 58800 bytes of shared memory; I pass it to this sub-kernel in DP and non-DP modes. It does work in non-DP mode and does not work in DP mode. What do you mean by “not define its shared memory in sub kernel?” Where should I define it?

Andrey

Hi Andrey ,

I just saw your reply here to my latest email . Sorry for the bug report system breakage .We’ve been checking with our IT engineers on it .
What about we focus on email chat and bring back the ticket result here ? Thanks .

Yuki,

Maybe we should restart this bug report as a new one to fix the web problems?

Andrey