About grid size and performance

I have a program that creates a large number of thread blocks. Each thread block’s x and y IDs are used in the program to “parallelize a loop” and compute something.
For example, if in a serial program i have for(x=1 to 128) then to parallelize it I have a gridDim.x = 128. Therefore, in each thread the code is for(x=blockid.x to blockid.x+128/gridDim.x) i.e. each thread runs the loop once. If the gridDim.x=64, each thread runs the loop twice. I do the same with gridDim.y.
(In the same way, there are many threads per block in two dimensions and each thread’s x and y ids are used to parallelize more loops). Sometimes a result is written to global memory, sometimes it is not.

Now I was noting down some performance statistics and saw that the program’s total execution time was greater when I had a grid size of 128(following the example) but was optimum when it was 32. This is strange because with the grid size of 32 each thread is actually doing more work. What is the reason for this?
Reducing the block size gave expected results but reducing the grid size gave counterintuitive results.

depends on your application.

if your application is memory-bound, then using one thread to deal a lot of data is a good choice,

because you can use more registers to prefetch global data and then do reduction or some computations

on these register pool.

Generally speaking, using one thread to handle multiple data can save index computation.

Thats the thing. The program never needs to read any data. It only writes to global memory. Also the program is actually doing more computing than memory writing.

This may be something to look out for but the index computation is much less work than the actual computation taking place in each thread. And when i half the grid size, it doubles that computation. I hope I understood your response right.

OK I managed to get the visual profiler to run my program for different grid sizes. I still cant figure out any reason why a smaller grid size is performing better than a larger size. Also, something i noticed in the visual profiler counters is that the “instructions” count for the code with larger grid size is a lot more (almost twice) the count with half grid size. If anything this should have been the opposite because a kernel launched with half the grid size is actually doing more work!

How many threads per block? what is occupancy?

If you just write to memory, then writing should not be a problem because the thread need not wait for response from memory controller.

I keep the threads per block constant (at 256) and only change grid size. The occupancy is 0.33. The performance for the program is about the same for grid sizes of 64 and 32 (continuing with the example i started with). Grid size of 16 also performs better than a grid size of 128. If it helps, my device has just 2 multiprocessors and 16 cores.

That is only true if the program is compute bound. If it is memory bound, back-pressure from the memory controllers will still cause cores to stall on writes.

Now I have a version of the code in which there are hardly any results to write (maybe just one result for a kernel launch that takes about 55 seconds). Even then, the performance is better with 1/2 or 1/3rd the grid size compared to a grid size of 128 (following example from previous posts).

Do all your blocks need the same time to finish, or are there differences in the amount of work done in each one?

There may be a difference based whether they find a result or not. There are branch instructions in the kernel code

That could explain why you see better results with fewer blocks, at least if you are not on Fermi. On pre-Fermi-cards, a new wave of blocks only starts if all blocks of the previous wave have finished. Thus the device is used most efficiently if all blocks have equal runtime.

In your case, with more work done in each block, runtime differences can even out so that overall device utilization is better.

To improve device utilization even further, you can roll your own block scheduler. Something along these lines will do:

[codebox]

device void myKernel(dim3 myBlockIdx, parameters…)

{

...   // do work based on myBlockIdx instead of blockIdx

}

device unsigned int block_counter;

global void block_scheduler(parameters…)

{

__shared__ dim3 myBlockIdx;

while (1) {

    if ((threadIdx.x==0) && (threadIdx.y==0) && (threadIdx.z==0)) {

        unsigned int block_index = atomicAdd(&block_counter, 1);

        myBlockIdx.x = block_index % gridDim.x;

        myBlockIdx.y = block_index / gridDim.x;

        myBlockIdx.z = 0;

    }

    __syncthreads();

if (myBlockIdx.y >= gridDim.y)

        break;

myKernel(myBlockIdx, parameters…);

    __syncthreads();

}

}

int main(int argc, char *argv)

{

...

// replace your kernel call by this sequence:

const unsigned int zero = 0;

CUDA_CALL(cudaMemcpyToSymbol(block_counter, &zero, sizeof(block_counter), 0, cudaMemcpyHostToDevice));  // clear block counter

block_scheduler<<<dimGrid, dimBlock>>>(parameters...);

...

}

[/codebox]