About global memory

Hi guys,

I’ve got a few questions about CUDA’s global memory:

  1. I’ve got a simple kernel which reads a single element from a matrix, increments it and writes the result back to the same position in global memory. All memory accesses are coalesced. I would expect the number of coalesced reads and writes to be the same. However, when looking at the profile generated by CUDA profiler, the number of coalesced writes is four times higher than the number of coalesced reads. Can anyone tell why is that? (My platform is 8500 GT and CUDA 1.1)
  2. In the manual, it says that the global memory access time is 400 – 600 cycles. Suppose I’ve got 16 global memory accesses that are coalesced followed by 16 individual memory accesses (not coalesced), can I expect the total clock cycles to be roughly 400 + 400 * 16 ? (assume a memory access takes 400 cc)
  3. Is there any fast way of estimating a kernel’s memory requirement by inspecting either the kernel code or ptx code? If there is, how will it be different if memory coalescing is taken into account?