NVRTC with Dynamic Parallelism

According to the NVRTC document, it says dynamic parallelism feature is not “yet” implemented in NVRTC.
http://docs.nvidia.com/cuda/nvrtc/index.html#known-issues

Does NVIDIA plan to support the feature in the later version?
(I want to know if someone inside can answer it.)

Or, do we have a tricky way to generate equivalent binary to kick
kernel function inside of the another kernel function?

Thanks,

Background:

My GPU kernel processes tables join inside of PostgreSQL. Its GPU kernel is built by NVRTC.
In case when it involves more than 3 tables, CPU kicks kernel function twice - the first for joining of table-A and -B, then the second for joining of (table-A+B) and table-C.
Once (table-A+B) tries to generate rows more than temporary buffer, it returns an error to CPU, then it retries entire kernel calls with larger result buffer or reduced input table.

If we can use dynamic parallelism here, a controller kernel function, not CPU code, can retry the step-i of N if sub-kernel function returned the lack of buffer error.

[current implementation]

  1. CPU kicks N of GPU kernel.
  2. GPU kernel step-1 goes successfully.
  3. GPU kernel step-2 got a lack of buffer error.
  4. CPU assigns proper problem size, then kicks entire GPU kernel again.
  5. GPU kernel step-1 goes successfully.
  6. GPU kernel step-2 goes successfully.
  7. CPU get expected result.

[what I want to do]

  1. CPU kicks a controller GPU kernel.
  2. The controller kicks GPU kernel for step-1, goes successfully.
  3. The controller kicks GPU kernel for step-2, but gets an error.
  4. The controller reduces the problem size, then kicks the step-2 again.
  5. CPU get a part of the result, then kicks the controller GPU kernel for the remaining portion.

The later is much smart and beneficial implementation, especially N>5.
It is not easy job to predicate number of rows generated by tables join.

do you eventually want all results, or just the first x results?

in the case of the former, even if you use a dual buffer, the device would eventually need to synchronize with the host (host emptying its buffer)
synchronous or sequential kernels - however you want to call it - can communicate with each other via global memory
each kernel can easily dump a trace in global memory for the next kernel to either abort (no error), or carry on where the previous kernel stopped
no need for dp

Here is two difficulty if we use existing infrastructures; supported by NVRTC.

Even if later kernel checks result of the previous kernel execution, we cannot change the number of threads launched for the later GPU kernel, once it is launched. For optimization reason, we like to launch least number of GPU threads but larger than number of source rows. It is predictable for the first step, but not easy job to estimate exact number of rows generated by join prior to execution.

On the other hands, if we try to control GPU kernel launch by CPU according to the status of previous steps, it means at least one CPU thread has to monitor the completion of GPU kernels. However, it is not easy to implement the software as extension of PostgreSQL database. So, all we can do is, enqueuing async DMA send/recv and GPU kernel call at once, and register the callback on task completion.

So, these are reason why I’d like to apply dynamic parallelism here.

BTW, I found a kernel function that calls another kernel function is transformed to a pair of cudaGetParameterBufferV2() and cudaLaunchDeviceV2().

cudaGetParameterBufferV2() packs function pointer, size of block/grid for each dimension, and required shared memory size.
Then, cudaLaunchDeviceV2() takes the parameter returned by cudaGetParameterBufferV2(), and user defined arguments.

Does someone tried to call these functions directly?
Both of them has device attribute, however, I’m uncertain how we can use these functions in the kernel.
Especially, the second argument of cudaLaunchDeviceV2() is cudaStream_t. Is it kernel accessible structure?

“For optimization reason, we like to launch least number of GPU threads but larger than number of source rows. It is predictable for the first step, but not easy job to estimate exact number of rows generated by join prior to execution”

i fail to see the mentioned optimization

if you have kernel threads that are sufficiently long-lived such that redundant threads are actually a concern, and if you have a sufficient number of kernel threads, you can easily build in flexibility in kernel dimensions, by moving to semi-persistent threads, in my opinion

no device can seat 30k threads at the same time
hence, there should be little difference between issuing kernel blocks with 30k threads, and kernel blocks with 1k threads, doing the work of 30k threads
if a thread is going to get up, just so that another can sit down, why should the 1st thread get up in the first place?

if you have kernel threads that are sufficiently long-lived such that redundant threads are actually a concern, and if you have a sufficient number of kernel threads, you can easily build in flexibility in kernel dimensions, by moving to semi-persistent threads, in my opinion

In my observation, massive amount of redundant kernel thread made significant performance degradation, even if it just exit the kernel after check of length of array (of input rows) to be processed.
A typical failure scenario is, CPU estimates step-2 join generates 20K rows as an intermediate result, then CPU also kicks step-3 GPU kernel with 20K threads, however, step-2 actually generated 100rows. 99.8% of threads will exit with no job.
We cannot know exact number of rows generated by join until its execution, so it is fundamentally not easy to launch GPU kernel to fit the problem size, unless we cannot determine the number on run-time.

If we would keep GPU threads running to process variable number of items, it will consume device resource than necessity; that shall be assigned to the concurrent session that uses another GPU context. I cannot imagine these approach works well…

"
If we would keep GPU threads running to process variable number of items, it will consume device resource than necessity; that shall be assigned to the concurrent session that uses another GPU context. I cannot imagine these approach works well… "

you are referring to a persistent solution
i referred to a semi-persistent solution
there is a difference

“A typical failure scenario is, CPU estimates step-2 join generates 20K rows as an intermediate result, then CPU also kicks step-3 GPU kernel with 20K threads, however, step-2 actually generated 100rows. 99.8% of threads will exit with no job.”

this would support an observation that each thread of the thread block of the kernel actually does not do much - its overhead compares with its execution time
this in turn may support a semi persistent solution
it is clear that you need more flexibility in terms of your kernel dimensions
one approach is certainly to have some master - host or device side - adjust and guess best dimensions
this in turn seems wasteful itself - what is the master’s guessing accuracy?
in my view, a semi persistent solution would permit such dimension flexibility