grid/block/thread size confusion linear addressing

Hi all,
I have a list of float4’s A (x,y,z,mass) which represent particles and i want to calculate their force contribution on another particle P. I copy the location of particle P into the constant memory and the list A into the global memory in the host side of the code.

The idea of my kernel is to have each thread grab a single part of the list, representing one particle, work out the force calculation (calculate the distance between that particle and P) and then dump this result into the another list in the shared memory on the card.

The final part of the plan is to then reduce the list of results using one of the reduce kernels from the sdk and hand the final result back to my application.

I have managed to implement a kernel which reads a location in the global memory, does the calculation and then pushes the answer into another part of the global memory. I compare the answers against some cpu code which does the same thing and it works, but only for a certain size,

So with all that pre-amble the important bits of my code are as follows:

global void sumlist(float4* active_list, float3* partial_acc_list){

int thid = (blockIdx.xblockDim.x)+threadIdx.x;
read active_list[thid] /
do processing … /
write to partial_acc_list[thid]*/

I invoke the kernel as follows:
#define BLOCK_SIZE 256

int list_size = 131072;
dim3 dimBlock(BLOCK_SIZE, 1);
dim3 dimGrid(list_size/BLOCK_SIZE, 1);
sumlist<<<dimBlock, dimGrid>>>(d_active_list, d_partial_acc_list );

The code works if list_size <= 131072 and otherwise returns a bunch of NAN’s, i figure this is because something is going wrong with the memory addressing (thid) in the kernel. Since i’m only using a linear array i’ve made the blocks and grids linear, which might be wrong, and then i’ve tried to force it s.t there will be exactly enough threads so that each part of the list gets processed. Should i still be using some kind of 2d memory addressing, so that i don’t go off the rails?

Is this a terrible way to proceed? Should i have some kind of fixed block / thread size and then have the kernel grab and process as many list elements as needed so that the whole calculation is done (i.e many particles per thread).

In my application the actual list size is going to be highly variable and it is unlikely to ever be a power of two, i was naiively going to get around this by having the kernel check to see if it’s index was > list_length and if so then it should do nothing, This seems like a bad idea because of the time asymmetry it could introduce.

Hopefully someone can correct my bad mis-understanding of the way this should work.

my card is a 8800 Ultra.

You have dimGrid and dimBlock swapped in your call:

sumlist<<<dimBlock, dimGrid>>>(d_active_list, d_partial_acc_list );

and you are invoking a kernel with more than 512 threads (131072/256)

It should be:
sumlist<<<dimGrid, dimBlock>>>(d_active_list, d_partial_acc_list );

Thanks for the pointer about the kernel invocation, i swapped dimGrid and dimBlock and now it works up to 2^20 elements, but that seems very strange because as you point out shouldn’t you be limited in the number of threads you can just launch?

Also how should i handle irregularly shaped lists, is it ok to invoke some branching and have the kernel check to see if its’ over the end of the list? And also, (again), cuda seems to be only about 2.5 times faster than the cpu. How can i go about speeding things up? How do you check to see if your reads are coalesced? Is that even possible with a float4? (i could use float3’s instead and pass in another list of floats for the masses).


float3 reads are uncoalesced because the load gets broken into a 64 bit and a 32 bit load, so neither load is contiguous and not able to be coalesced. float4’s are coalesced, but only get half the throughput that 32-bit or 64 bit coalesced reads get, for some unknown reason. A workaround for this is to used shared memory as a staging area, using the whole block to read in your particle list as floats. Then use a float4 pointer to read the same array in shared memory as float4 values and do the arithmetic. Since you have lots of blocks, the scheduler will be able to interleave the “read” phase and the “compute” phases of different blocks and hide some of the memory load time.

you can launch up to 512 threads per block, the total number of threads is much much larger.