Launch configuration vs Performance

Hi folks, I am currently employing CUDA to solve certain variants of knapsack problems. What is relevant now is that the computation is in a single dimension. Also, before the computation part, each thread (thread block) needs to fetch certain data from global memory to shared memory.

Below are two launch configurations and corresponding kernels.

(a)

In this version, considering there are X cells (of a row) are to be computed, X threads are launched such that the thread block size is 768 and the number of thread blocks is B = X / 768. Each thread needs to compute a single cell.

Launch configuration: B, 768

I have checked that neither shared memory or register usage limits the occupancy and is equal to 100% for large enough X

Kernel:

j = global_id; // = blockIdx.x * blockDim.x + threadIdx.x;

fetch_data_to_shared_memory();

__syncthreads();

compute_using_shared_memory(j);  // This involves global memory fetches and computes the jth cell

In this kernel, each of the B thread blocks need to fetch data from global memory to shared memory and then go on to the computation part.

(b)

I was trying to improve shared memory usage and ended with the following version where just enough thread blocks are launched, to keep occupancy at 100% and each thread computes several cells.

Launch configuration: 28, 768 // I am working on a Tesla M2050 with 14 Compute units and hence 28 thread blocks.

Kernel:

j = global_id; // = blockIdx.x * blockDim.x + threadIdx.x;

fetch_data_to_shared_memory();

__syncthreads();

while(all_cells_computed) {

  compute_using_shared_memory(j);  // This involves global memory fetches and computes jth cell

  j += blockDim.x * gridDim.x;     // j is updated

}

In this kernel, the fetch from global memory to shared memory is done only 28 times (once by one thread block). The additional task in this kernel is the updation of j and the check if all cells are computed.

I see that version (b) performs worse, but I am unable to figure out the reason. Any idea why it is so?

/Bharath