I am wondering if I need to manually group data when writing to the global memory from a kernel to improve efficiency?
this specific question occurred when I worked on my Monte Carlo simulator, where when a particle hits the boundary, I need to save a bunch of metrics to the global memory. The length of the metrics depends on user’s input, for example, users can choose to save the propagation time or path length etc.
I am planning to use the shared memory to accumulate the metrics inside each thread, and then, when a particle hits the boundary, I will dump these numbers to the global memory with a loop:
numcount (1~6) is passed from outside. In this type of command, shall I make shared_metric/global_metric as float4 to reduce the total number of write transactions, or CUDA will automatically group the write to get the best performance? any potential efficiency issues?