Adaptive Parallel Computation with CUDA Dynamic Parallelism

jwitsoe · May 6, 2014, 8:48pm

Originally published at: https://developer.nvidia.com/blog/introduction-cuda-dynamic-parallelism/

Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU. For applications consisting of “parallel for” loops the bulk parallel model is not too limiting, but some parallel…

anon66475735 · May 8, 2014, 3:27pm

Pre-DP, this would require multiple kernel launches from the device side. Do you have the timing of such a thing, for those with pre-3.5 devices?

anon55393588 · May 8, 2014, 8:57pm

I’ve done a comparison like this for the Triplet Finder algorithm for the PANDA experiment. The the plot is available from GTC On-Demand (http://on-demand.gputechcon..., slide 33), and will also be presented in the third part of this blog post series.

anon66475735 · May 12, 2014, 4:08pm

Great. Thanks!

anon91506221 · May 17, 2014, 1:53am

I went to high school with Rico Mariani :)

anon49579789 · January 20, 2017, 8:19am

Hi whoever you are :)

anon91506221 · January 20, 2017, 9:01pm

DB!

anon83272736 · December 6, 2018, 4:16pm

In the code of mandelbrot_block_k function
else if (depth + 1 > MAX_DEPTH && d / SUBDIV < MIN_SIZE)
should be
else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE)
and in the main function mandelbrot_block_k function is called with complex(-1.5, 1) for cmin parameter, but it should be complex(-1.5, -1) according to the code in the GitHub.