Bad performance problems and discussion

Hi! I’m kinda newbie at CUDA,

I’m coding a parallel version of a heuristic graph coloring algorithm and the performance of my kernel is worrying me… the kernel execution time increases really a lot when i increase the number of threads in the kernel launch:

64 Threads -> 1.62s Kernel execution time
128 Threads -> 1.64s Kernel execution time
256 Threads -> 1.77s Kernel execution time
512 Threads -> 2.21s Kernel execution time
1024 Threads -> 8.00s Kernel execution time

With 2048+ Threads the kernel reaches it time limit and gives me a the launch timed out and was terminated error.

My Block and Grid organization works this way:

dim3 block,grid;

void threads_setup(aco_t* aco_info){
    n_threads = aco_info->n_threads;

    if( n_threads > 64){
        block.x = 8;
        block.y = 8;
        int dim = n_threads / 64;
        if (dim > 4){
            int dim2 = dim / 2;
            grid.x = dim2;
            grid.y = 2;
            grid.x = dim;
    } else {
        block.x = 8;
        block.y = 8;
        grid.x = 1;

Like i said i’m newbie so this thread_setup function was made like this just because i thought that 64 threads in a 2D block was a nice way of doing this, my kernel function doesn’t make any use of this since i run this code to get my threadID:

int threadID;
  if(gridDim.y > 1){
      threadID = ((blockIdx.x * (blockDim.x * blockDim.y)) + (blockIdx.y * (blockDim.x * blockDim.y * gridDim.x))) + ((threadIdx.x * blockDim.x) + threadIdx.y);
  } else {
      threadID = ((blockIdx.x * (blockDim.x * blockDim.y))) + ((threadIdx.x * blockDim.x) + threadIdx.y);

Other important information about my kernel is that it is made of 5 functions that are called a lot of times by a main function inside the kernel. So the times above are the sum of all this functions.

That’s all that i imagine that could be slowing my code so much, what is the real problem? The block and grid organization influences the performance this much? Having a lot of functions in the kernel is wrong? There’s a implicit sync barrier between functions calls inside the kernel? Or none of that is the reason and the problem is within my kernel?

Please if someone could point me to the right direction it would really help! Thanks for attention!

if that’s your first cuda program, i suggest to start with much simpler examples. remember - program doesn’t get magic boost once you recompiled it with CUDA, unleashing GPU power need much more work than CPU programming

you can share code and may be someone will look into it. wildguessing, it may be for example due to cache trashing

The block and grid organization influences the performance this much?

you can check that yourself