Bad performance problems and discussion

bcesarg6 · May 17, 2016, 11:10pm

Hi! I’m kinda newbie at CUDA,

I’m coding a parallel version of a heuristic graph coloring algorithm and the performance of my kernel is worrying me… the kernel execution time increases really a lot when i increase the number of threads in the kernel launch:

64 Threads -> 1.62s Kernel execution time
128 Threads -> 1.64s Kernel execution time
256 Threads -> 1.77s Kernel execution time
512 Threads -> 2.21s Kernel execution time
1024 Threads -> 8.00s Kernel execution time

With 2048+ Threads the kernel reaches it time limit and gives me a the launch timed out and was terminated error.

My Block and Grid organization works this way:

dim3 block,grid;

void threads_setup(aco_t* aco_info){
    n_threads = aco_info->n_threads;

    if( n_threads > 64){
        block.x = 8;
        block.y = 8;
        int dim = n_threads / 64;
        if (dim > 4){
            int dim2 = dim / 2;
            grid.x = dim2;
            grid.y = 2;
        }else{
            grid.x = dim;
        }
    } else {
        block.x = 8;
        block.y = 8;
        grid.x = 1;
    }
}

Like i said i’m newbie so this thread_setup function was made like this just because i thought that 64 threads in a 2D block was a nice way of doing this, my kernel function doesn’t make any use of this since i run this code to get my threadID:

int threadID;
  if(gridDim.y > 1){
      threadID = ((blockIdx.x * (blockDim.x * blockDim.y)) + (blockIdx.y * (blockDim.x * blockDim.y * gridDim.x))) + ((threadIdx.x * blockDim.x) + threadIdx.y);
  } else {
      threadID = ((blockIdx.x * (blockDim.x * blockDim.y))) + ((threadIdx.x * blockDim.x) + threadIdx.y);
  }

Other important information about my kernel is that it is made of 5 functions that are called a lot of times by a main function inside the kernel. So the times above are the sum of all this functions.

That’s all that i imagine that could be slowing my code so much, what is the real problem? The block and grid organization influences the performance this much? Having a lot of functions in the kernel is wrong? There’s a implicit sync barrier between functions calls inside the kernel? Or none of that is the reason and the problem is within my kernel?

Please if someone could point me to the right direction it would really help! Thanks for attention!

BulatZiganshin · May 17, 2016, 11:20pm

if that’s your first cuda program, i suggest to start with much simpler examples. remember - program doesn’t get magic boost once you recompiled it with CUDA, unleashing GPU power need much more work than CPU programming

you can share code and may be someone will look into it. wildguessing, it may be for example due to cache trashing

The block and grid organization influences the performance this much?

you can check that yourself

Topic		Replies	Views
kernel performance and number of threads CUDA Programming and Performance	2	6595	November 22, 2007
Kernel Launch: number of blocks CUDA Programming and Performance	1	1695	May 21, 2009
Kernel only runs fast with >512 threads, regardless of what they are actually doing CUDA Programming and Performance	3	1854	May 19, 2009
CUDA perormances CUDA Programming and Performance	10	7128	January 22, 2008
help with first cuda program CUDA Programming and Performance	5	3879	June 24, 2009
help with some cuda programming CUDA Programming and Performance	9	1818	August 31, 2009
CUDA kernels keep on crashing CUDA Programming and Performance	6	3644	October 27, 2008
How to choose grid size ? No. of blocks and threads ? CUDA Programming and Performance	1	818	February 4, 2016
Unexpected CUDA processing time dependency on thread count CUDA Programming and Performance cuda , python , numba	0	785	April 17, 2021
Low performance. whats wrong ? CUDA Programming and Performance	3	2816	May 6, 2009

Bad performance problems and discussion

Related topics