write to the global memory with grouping?

FangQ · July 26, 2010, 6:14pm

hi

I am wondering if I need to manually group data when writing to the global memory from a kernel to improve efficiency?

this specific question occurred when I worked on my Monte Carlo simulator, where when a particle hits the boundary, I need to save a bunch of metrics to the global memory. The length of the metrics depends on user’s input, for example, users can choose to save the propagation time or path length etc.

I am planning to use the shared memory to accumulate the metrics inside each thread, and then, when a particle hits the boundary, I will dump these numbers to the global memory with a loop:

for(i=0;i<numcount;i++)

   global_metric[baseaddr+i]=shared_metric[i];

numcount (1~6) is passed from outside. In this type of command, shall I make shared_metric/global_metric as float4 to reduce the total number of write transactions, or CUDA will automatically group the write to get the best performance? any potential efficiency issues?

thanks

Qianqian

tera · July 26, 2010, 7:03pm

You don’t need to make the type a float4, but structures of 4, 8, or 16 bytes will indeed be more efficient to write than other sizes.

Are you writing from one thread only, or from all threads in a warp? If you write from one thread only (and you are on a compute capability 1.2 or higher device), try using multiple threads to write all of the structs in parallel.