Hi, I have a program that writes a value to global memory coalesced as float2 values, so every thread is writing 8 bytes apart from the others. I have tried changing this to float and saw the same results. I’m only writing about 3MB of data between all the threads in a kernel with 224 blocks and 160 threads. It’s taking about 400us to write all of those values. That’s only about 7.5GB/s throughput. Does anyone know what could be wrong? I’m using a Titan.
What GPU are you using?
Can you provide a reproducible?
If each thread calculates an address (usually using blockIdx and threadIdx and issues a single float2 store then you will not be able to saturate the write data path. The latency of launching each warp and calculating the address will greatly limit the throughput. Try having each thread perform several writes.
Thanks Greg, I ended up figuring out what the problem was. It wasn’t the global writes that were taking long, but the loop above that which had more iterations than it should have. When I removed the global writes it was optimizing out the loop.