About global memory

128Cores · October 19, 2008, 1:47pm

Hi guys,

I’ve got a few questions about CUDA’s global memory:

I’ve got a simple kernel which reads a single element from a matrix, increments it and writes the result back to the same position in global memory. All memory accesses are coalesced. I would expect the number of coalesced reads and writes to be the same. However, when looking at the profile generated by CUDA profiler, the number of coalesced writes is four times higher than the number of coalesced reads. Can anyone tell why is that? (My platform is 8500 GT and CUDA 1.1)
In the manual, it says that the global memory access time is 400 â€“ 600 cycles. Suppose Iâ€™ve got 16 global memory accesses that are coalesced followed by 16 individual memory accesses (not coalesced), can I expect the total clock cycles to be roughly 400 + 400 * 16 ? (assume a memory access takes 400 cc)
Is there any fast way of estimating a kernelâ€™s memory requirement by inspecting either the kernel code or ptx code? If there is, how will it be different if memory coalescing is taken into account?

Thanks,
J.G

Topic		Replies	Views
read from global mem vs write to global mem CUDA Programming and Performance	13	6561	January 22, 2009
Global memory access time Time to read from global to share memor CUDA Programming and Performance	4	3317	July 16, 2007
How to know where the bottleneck is? CUDA Programming and Performance	3	4334	February 29, 2008
1 coalesced global memory load = 16 loads? CUDA Programming and Performance	0	951	January 23, 2011
global memory latency CUDA Programming and Performance	12	16287	December 13, 2007
no speedup from coalescing global reads?! Surprising profile results CUDA Programming and Performance	1	1656	March 7, 2008
global memory latency CUDA Programming and Performance	4	2167	June 22, 2008
memory latency CUDA Programming and Performance	5	4020	March 21, 2007
coalescing heuristics CUDA Programming and Performance	3	3917	February 26, 2008
Global Memory Coalescing: Read and Write Memory Coalescing CUDA Programming and Performance	9	8325	July 31, 2007