GTX Titan and dynamic parallelism

Dynamic parallelism is a nice conceptual candidate for the nested loop

for each octree node o
for each sample s in the octree node o
do things

So one Kernel launch for all nodes in octree and inside the kernel one kernel launch for all samples inside the node. Yes but my GTX-Titan is kneeling on this concept. Can an NVIDIA appropriate person come in contact with me or start a conversation inside here?

What my experiments show is that its faster doing the second for serially rather than spawning a kernel inside. This is bad very bad. Also the process freezes the driver. So my conclusion is that dynamic parallelism is not there yet where it needs to be…I think NVIDIA should come in contact with me to give them the program under disgresion policies of copyright. Or is there a way to come in contact providing feedback in a constant basis rather than using forums? For instance enrolling as a CUDA developer?

[Note: I have no hands-on experience with dynamic parallelism]

As far as I know, on currently shipping hardware the latency of launching a kernel from the device is roughly the same as launching it from the host. The latency of launching a kernel from the host is on the order of 4-5 microseconds. So I suspect the ratio of kernel launches to “do things” is too high to yield any benefit for your use case.

You are welcome to file a bug, attaching your code as a repro app. The bug reporting form is linked from the registered developer website. Signing up as a registered developer should be painless, the applications are typically approved within one business day.

I think that the scheduler needs improvement…I will profile it with NSIGHT to see the timings

Also lets hope the new NSIGHT 3.2 can debug kernel within kernel launches…

By the way for NSIGHT and Dynamic Parallelism you need the display handled by a different GPU. Makes my decision wise to buy the i7 with the HD4000…
So what you need is a Motherboard with VGA capabilities and an Intel with a GPU.
I am using RealVNC to connect to my server since I have the computer at my office in the University. The VNC server is tied to the HD4000 leaving the GTX-Titan just for CUDA. Its a perfect environment. I am glad I have finally built a server to see how much Dynamic Parallelism can enter into one of my applications.

One point is that I do not want to deter anyone from using Dynamic Parallelism. Just take it easy with it. It can actually work in my case also but I have to do some further processing to gather all the results and apply only once the kernel inside each node. Just take it easy is my advice. There is a certain limit you can reach and then it becomes counterproductive. It is fascinating to explore these limits and how NVIDIA will gradually overcome these limits…