Is this strategy not suitable for dynamic parallelism ?

neoideo · January 9, 2014, 9:34am

I am trying to improve an aspect of a CUDA program i have running on a Tesla K40 GPU:

In my original CUDA implementation, i launch a 2D grid of 16x16 blocks. In the algorithm, each thread computes (1) an initial value and (2) fine grained results based on (1). The problem is that (1) is actually the same for all threads within a block.

My first approach at optimizing this aspect was to just make the first thread of each block to compute the common value and store it in shared memory, followed by a __syncthreads() instruction. Afterwards, all threads could do (2) using the initial value just as before, with the advantage of being computed only once per block. I thought this solution could work faster, but in the end the performance was the same as in the original solution. So i discarded it and returned to the original.

Given that i recently got a Tesla K40, i thought that it was the perfect moment to try the so attractive dynamic parallelism feature. What i did is to first launch a coarse kernel, where each thread represents a block of the original CUDA implementation. The idea is to compute the initial value on this coarse kernel, and then make each thread launch a child kernel of 16x16 threads, passing the common value as an argument. In the board this sounded like a good optimization, in practice i get horrible performance, up to 20X times slower.

This is the idea of how i intended to use dynamic parallelism:

__global__ coarse_kernel( parameters ){
    int common_val = compute_common_val();
    dim3 dimblock(16, 16, 1);
    dim3 dimgrid(1, 1, 1);
    child_kernel <<< dimgrid, dimblock >>> (common_val, parameters);

}

__global__ child_kernel( int common_val, parameters ){
    // use common value
    do_computations(common_val, parameters);
}

A realistic case for example would be a 45 x 45 coarse grid of 16x16 blocks, where each thread spawns
a new child kernel of just 16x16 threads. As you see, there will be many many child kernels.

So in the end i just went back to the original CUDA implementation, but it is sad that none of the ideas above worked in practice.

What type of performance problem or limitation am i facing here?? Does someone know what approach should i take?

Topic		Replies	Views
dynamic parallelism CUDA Programming and Performance	3	1164	December 30, 2012
dynamic parallelism performance CUDA Programming and Performance	4	1038	January 3, 2013
a question about low performance on dynamic parallelism with tremendous data CUDA Programming and Performance	2	1232	May 27, 2013
A question on nested parallelism CUDA Programming and Performance	5	1466	April 11, 2019
Dynamic Parallelism improvement CUDA Programming and Performance	2	1044	February 15, 2013
Dynamic Parallelism extreme slowdown CUDA Programming and Performance	0	874	April 7, 2013
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	983	June 3, 2024
How much benefit can i get from dynamic parallelism in my code CUDA Programming and Performance	0	689	December 24, 2013
Is dynamic parallelism suitable for this application? CUDA Programming and Performance	3	1257	August 20, 2013
Cuda Dynamic Parallelism Performance CUDA Programming and Performance	3	2022	July 14, 2016

Is this strategy not suitable for dynamic parallelism ?

Related topics