Hi all,
I have a list of float4’s A (x,y,z,mass) which represent particles and i want to calculate their force contribution on another particle P. I copy the location of particle P into the constant memory and the list A into the global memory in the host side of the code.
The idea of my kernel is to have each thread grab a single part of the list, representing one particle, work out the force calculation (calculate the distance between that particle and P) and then dump this result into the another list in the shared memory on the card.
The final part of the plan is to then reduce the list of results using one of the reduce kernels from the sdk and hand the final result back to my application.
I have managed to implement a kernel which reads a location in the global memory, does the calculation and then pushes the answer into another part of the global memory. I compare the answers against some cpu code which does the same thing and it works, but only for a certain size,
So with all that pre-amble the important bits of my code are as follows:
global void sumlist(float4* active_list, float3* partial_acc_list){
…
int thid = (blockIdx.xblockDim.x)+threadIdx.x;
/ read active_list[thid] /
/ do processing … /
/ write to partial_acc_list[thid]*/
}
I invoke the kernel as follows:
#define BLOCK_SIZE 256
…
int list_size = 131072;
dim3 dimBlock(BLOCK_SIZE, 1);
dim3 dimGrid(list_size/BLOCK_SIZE, 1);
sumlist<<<dimBlock, dimGrid>>>(d_active_list, d_partial_acc_list );
The code works if list_size <= 131072 and otherwise returns a bunch of NAN’s, i figure this is because something is going wrong with the memory addressing (thid) in the kernel. Since i’m only using a linear array i’ve made the blocks and grids linear, which might be wrong, and then i’ve tried to force it s.t there will be exactly enough threads so that each part of the list gets processed. Should i still be using some kind of 2d memory addressing, so that i don’t go off the rails?
Is this a terrible way to proceed? Should i have some kind of fixed block / thread size and then have the kernel grab and process as many list elements as needed so that the whole calculation is done (i.e many particles per thread).
In my application the actual list size is going to be highly variable and it is unlikely to ever be a power of two, i was naiively going to get around this by having the kernel check to see if it’s index was > list_length and if so then it should do nothing, This seems like a bad idea because of the time asymmetry it could introduce.
Hopefully someone can correct my bad mis-understanding of the way this should work.
my card is a 8800 Ultra.