Adaptive Parallel Computation with CUDA Dynamic Parallelism

Originally published at:

Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU. For applications consisting of “parallel for” loops the bulk parallel model is not too limiting, but some parallel…

Pre-DP, this would require multiple kernel launches from the device side. Do you have the timing of such a thing, for those with pre-3.5 devices?

I’ve done a comparison like this for the Triplet Finder algorithm for the PANDA experiment. The the plot is available from GTC On-Demand (http://on-demand.gputechcon..., slide 33), and will also be presented in the third part of this blog post series.

Great. Thanks!

I went to high school with Rico Mariani :)

Hi whoever you are :)


In the code of mandelbrot_block_k function
else if (depth + 1 > MAX_DEPTH && d / SUBDIV < MIN_SIZE)
should be
else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE)
and in the main function mandelbrot_block_k function is called with complex(-1.5, 1) for cmin parameter, but it should be complex(-1.5, -1) according to the code in the GitHub.