Global Memory Coalescing on Devices with Compute Capability 1.2 and Higher

Hi,

I have a question on global memory writing (from shared memory). Your help is appreciated.

The CUDA programming guide 2.3.1 has following:

The global memory access by all threads of a half-warp is coalesced into a single memory transaction as soon as the words accessed by all threads lie in the same segment of size equal to:

Hi,

I have a question on global memory writing (from shared memory). Your help is appreciated.

The CUDA programming guide 2.3.1 has following:

The global memory access by all threads of a half-warp is coalesced into a single memory transaction as soon as the words accessed by all threads lie in the same segment of size equal to:

Hi,

I have a question on global memory writing (from shared memory). Your help is appreciated.

The CUDA programming guide 2.3.1 has following:

The global memory access by all threads of a half-warp is coalesced into a single memory transaction as soon as the words accessed by all threads lie in the same segment of size equal to:

Sorry, somehow the message is always truncated. One more (last) try:

Hi,

I have a question on global memory writing (from shared memory). Your help is appreciated.

The CUDA programming guide 2.3.1 has following:

The global memory access by all threads of a half-warp is coalesced into a single memory transaction as soon as the words accessed by all threads lie in the same segment of size equal to:

32 bytes if all threads access 1-byte words,
64 bytes if all threads access 2-byte words,
128 bytes if all threads access 4-byte 9r 8-byte words.

So let’s say I have N number 1-D threads in a block, and I have S number of short integers (2 bytes width each) in shared memory (in a 1-D array), and I use these N threads to write those S short integers into global memory. I want to calculate how many memory transactions will it be?

Assuming thread0 writes S0, thread 1 writes S1, threadN writes S(N-1), then thread0 writes S(N), thread1 writes S(N+1), … until all S short integers are written.

So the number of memory segments is S/32, each segment will cause ceiling(32/N) transactions. So the total memory transactions are:
(S/32) * ceiling(32/N).

Is the above calculation correct?