Coalesced Memory access related doubt

Dear All,

I am having a little trouble in understanding coalesced memory access and how exactly to do it. Following is my kernel code:

__global__ void myKernel(int * sum,int W,int H,int dw,int dh,myType * classifier,int stgCnt,int clfCnt, bool * out)

{

	int tx = blockIdx.x * blockDim.x + threadIdx.x;

	int ty = blockIdx.y * blockDim.y + threadIdx.y;

	int threadNum =threadIdx.x + threadIdx.y * blockDim.x;

	int idx = (blockIdx.x+(blockIdx.y*gridDim.x))*blockDim.x*blockDim.y  + (threadIdx.x+threadIdx.y*blockDim.x);

	 __shared__ int shar_sum[36*28];

		

	for ( int i=0;i<4;i++)

	{

		if ( threadNum < 36*28/4  )

		{

			int temp=0;

			if ( tx/2 + dw < W || ty+dh<H)

			{

				int * iptr = &sum[blockIdx.y*W*(BLOCK_SIZE) + (BLOCK_SIZE/2)*blockIdx.x];

				temp= iptr[4*(threadNum%7)+i + ( threadNum/7)*W];

			}

			shar_sum[4*threadNum+i]=temp;

		} // if threadNum

	}// i

	__syncthreads();

//........ Processing  

}

In this code each thread is reading 4 integers ( 16 bytes ) from the global memory to shared memory. Now how to determine wheteher this access is coalesced or not? Do I have to do something while kernel launch to make sure the accesses are coalesced. Someone please help me understand this.

Regards

The coalescing rules are pretty clearly explained in Appendix G of the current programming guide, but the simple rule is that threads in a half-warp have to read from the same 16 transaction word sized segment of global memory for the read to be coalesced (resulting in a single 32, 64, 128 or 256 byte load from global memory to service the read for the half-warp of threads).

thanks for the reply. But I already tried to understand this by reading programming guide but not able to understand. Programming guide says the following:

  • Find the memory segment that contains the address requested by the lowest numbered active thread. The segment size depends on the size of the words accessed by the threads:
     32 bytes for 1-byte words,
     64 bytes for 2-byte words,
     128 bytes for 4-, 8- and 16-byte words.
  • Find all other active threads whose requested address lies in the same segment.
  • Reduce the transaction size, if possible:
     If the transaction size is 128 bytes and only the lower or upper half is used, reduce the transaction size to 64 bytes;
     If the transaction size is 64 bytes (originally or after reduction from 128 bytes) and only the lower or upper half is used, reduce the transaction size to 32 bytes.
  • Carry out the transaction and mark the serviced threads as inactive.
  • Repeat until all threads in the half-warp are serviced.

Now what does “32 bytes for 1-byte words” means here?? Does it mean that if I am accessing a 1 byte data from global memory then hardware will actually read 32 bytes to get the data?
What does " Reducing the transaction size" means?
I know these questions might seem very trivial to you but i am finding it difficult to understand. I am new to CUDA.

No it means that if a warp of threads are reading a byte each, and those bytes all fall into the same 32 byte long chunk of memory, the hardware will coalesce the read requests into a single 32 byte load transaction. So all the threads get their data in one transaction, as opposed to 32 seperate transactions, which is much faster.

Thanks again for quick reply.

So in the code which I sent each thread is reading 4 consecutive integers ( ie 16 bytes each). So this will mean that one warp will be reading 16*32 = 512 bytes of data. So does it mean that now hardware will only take only 512/128= 4 transaction to load this data? So does it mean that the code which i sent is already coalesced? Is anything more that could be done to access this kind of data?

No. Coalescing is an assembly instruction level process, not some sort of long range, out of order execution one. Your code performs 4 discrete integer loads, and across a warp, each load doesn’t fall into the same 32 integer word segment of memory. The loads in your code are not coalesced.

Sorry but I couldn’t really understood what you said here. Can you give some example to explain this?

I really don’t know how to state it more clearly. There are even nice diagrams in Appendix G of recent versions of the programming guide.

Here is some code off the top of my head. Two skeleton kernels that might do the same thing (never been compiled let alone run, so caveat emptor). One should have coalesced reads. The other should not.

__global__ void coalesced(float *x)

{

	__shared__ float _buff[512];

	int tid = threadIdx.x + blockDim.x * blockIdx.x;

	// assume 128 threads per block

	// each thread reads 4 values to shared memory

	for(int i=threadIdx.x; i<512; i+=128)

		_buff[i] = x[tid+i];

}

__global__ void uncoalesced(float *x)

{

	__shared__ float _buff[512];

	int bid = blockDim.x * blockIdx.x;

	// assume 128 threads per block

	// each thread reads 4 values to shared memory

	for(int i=0; i<4; i++)

		_buff[4*threadIdx.x + i] = x[bid + 4*threadIdx.x + i];

	

}

I also was not clear on this until I wrote my own example. Here is another example with explanation (perhaps a bit pedantic, but folks can skip my explanation). Cut and paste (and change if you don’t use Windows) the following code into a program.

#include <stdio.h>

#include <malloc.h>

#include <sys/timeb.h>

#include <time.h>

// The following includes for MS VC++ intelliSense to flag problems

// in syntax of CUDA that NVCC automatically includes. E.g., "__global__".

#include <cuda.h>

#include <cuda_runtime.h>

#include <device_functions.h>

#include <device_launch_parameters.h>

// NO COALESCING -- ELEMENTS IN EACH HALF WARP OUT OF SEQUENCE.

// NOTE, ON FERMI CARDS, THIS DOESN'T MATTER BECAUSE IT USES CACHING.

__device__ int reverse[32] = { 0,

 2,  1,  3,  4,  5,  6,  7,  8,  9, 10,

11, 12, 13, 14, 15, 16, 18, 17, 19, 20,

21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__global__ void no_coalesce(int * data, int n, int iter)

{

    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int sr[32];

    sr[idx] = reverse[idx];

    __syncthreads();

    for (int i = 0; i < iter; ++i)

        data[sr[idx]] += n;

}

// NO COALESCING -- LAST ELEMENT FETCHED IS NOT CONTIGUOUS.

__device__ int extended[32] = { 0,

 1,  333,  3,  4,  5,  6,  7,  8,  9, 10,

11, 12, 13, 14, 15, 566, 17, 18, 19, 20,

21, 22, 222, 24, 25, 26, 27, 28, 29, 30, 444};

__global__ void no_coalesce2(int * data, int n, int iter)

{

    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int ex[32];

    ex[idx] = extended[idx];

    __syncthreads();

    for (int i = 0; i < iter; ++i)

        data[ex[idx]] += n;

}

__device__ int forward[32] = { 0,

 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,

11, 12, 13, 14, 15, 16, 17, 18, 19, 20,

21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__global__ void coalesce(int * data, int n, int iter)

{

    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int sf[32];

    sf[idx] = forward[idx];

    __syncthreads();

    for (int i = 0; i < iter; ++i)

        data[sf[idx]] += n;

}

int main(int argc, char**argv)

{

    argc--; argv++;

// First argv is an int, cuda device number.

    int rvdev = cudaSetDevice(atoi(*argv));

// Setup for "in" host array.

    int n = 32;  // number of elements of arrays that are changed by kernel.

    int extended = 5000; // actual length of array.

int * in = (int*)malloc(extended * sizeof(int));

    for (int i = 0; i < extended; ++i)

        in[i] = 0;

// Setup for "out" host array.

    int * out = (int*)malloc(extended * sizeof(int));

// Timers.

    struct _timeb  t1;

    struct _timeb  t2;

    struct _timeb  t3;

    struct _timeb  t4;

/////////////////////////////////////////////

    // Test no_coalescing example.

    /////////////////////////////////////////////

    printf("Starting GPU test v1 ...\n");

    _ftime(&t1);

    int * din;

    int rv1 = cudaMalloc(&din, extended * sizeof(int));

    _ftime(&t2);

    int rv2 = cudaMemcpy(din, in, extended * sizeof(int), cudaMemcpyHostToDevice);

    _ftime_s(&t3);

    int kernel_calls = 1;

    int internal_iters = 10000000;

    int block_size = 32;

    int blocks = 1;

    dim3 block(block_size);

    dim3 grid(blocks);

    no_coalesce<<<grid, block>>>(din, n, internal_iters);

    cudaThreadSynchronize();

    int rv3 = cudaGetLastError();

    if (rv3)

        printf("last error %d\n", rv1);

    _ftime(&t4);

    printf("N Time t4-t3 %f\n", (double)(t4.time - t3.time + ((double)(t4.millitm - t3.millitm))/1000));

    int rv4 = cudaMemcpy(out, din, extended * sizeof(int), cudaMemcpyDeviceToHost);

    //for (int i = 0; i < block_size; ++i)

    //  printf("%d %d\n", i, out[i]);

/////////////////////////////////////////////

    // Test coalescing example.

    /////////////////////////////////////////////

    // Reset device "in" array.

    int rv5 = cudaMemcpy(din, in, extended * sizeof(int), cudaMemcpyHostToDevice);

    _ftime_s(&t3);

    coalesce<<<grid, block>>>(din, n, internal_iters);

    cudaThreadSynchronize();

    int rv6 = cudaGetLastError();

    if (rv6)

        printf("last error %d\n", rv1);

    _ftime(&t4);

    printf("C Time t4-t3 %f\n", (double)(t4.time - t3.time + ((double)(t4.millitm - t3.millitm))/1000));

    int rv7 = cudaMemcpy(out, din, extended * sizeof(int), cudaMemcpyDeviceToHost);

    //for (int i = 0; i < block_size; ++i)

    //  printf("%d %d\n", i, out[i]);

/////////////////////////////////////////////

    // Test no_coalescing2 example.

    /////////////////////////////////////////////

    // Reset device "in" array.

    int rv8 = cudaMemcpy(din, in, extended * sizeof(int), cudaMemcpyHostToDevice);

    _ftime_s(&t3);

    no_coalesce2<<<grid, block>>>(din, n, internal_iters);

    cudaThreadSynchronize();

    int rv9 = cudaGetLastError();

    if (rv9)

        printf("last error %d\n", rv1);

    _ftime(&t4);

    printf("E Time t4-t3 %f\n", (double)(t4.time - t3.time + ((double)(t4.millitm - t3.millitm))/1000));

    int rv10 = cudaMemcpy(out, din, extended * sizeof(int), cudaMemcpyDeviceToHost);

    //for (int i = 0; i < block_size; ++i)

    //  printf("%d %d\n", i, out[i]);

cudaFree(din);

return 0;

}

This code accesses an array in three different patterns. For an explanation of the rules for coalescing (of global memory), see section G.3.2 of the NVIDIA CUDA C Programming Guide (PG).

“coalesce” accesses a 128-byte array sequentially. First, all data is accessed in words (from the PG, “The size of the words accessed by the threads must be 4, 8, or 16 bytes;”, and in the code “data[sf[idx]] += n;”, “data” is an array of 4-byte values). Next, all data for each half warp is in 64 bytes (“If this size is: 4, all 16 words must lie in the same 64-byte segment”, and in the code “data[sf[idx]] += n”, “data” is only accessed from addresses &data[0] through &data[15] in the first half warp, &data[16] through &data[31] in the first half warp). Finally, all data is accessed sequentially from &data[0] through &data[15] in the first half warp, &data[16] through &data[31] in the first half warp. The array “forward” specifies the indices of the access. Notice that you must run this on a CUDA 1.0 or 1.1 device like GeForce 9800 GT.

“no_coalesce” accesses a 128-byte array in a non-sequential manner, so coalescing does not occur. While all data is accessed in words, all data for each half warp is in 64 bytes, “data” is NOT accessed sequentially for each half warp. In the first half warp, “data” is accessed in the order &data[0], &data[2], &data[1], &data[3], &data[4], …, &data[15]; in the second half warp, “data” is accessed in the order &data[16], &data[18], &data[17], &data[19], …, &data[31]. This violates the rule that say it must be in order.

“no_coalesce2” accesses the array in a non-contiguous manner, so coalescing does not occur. While all data is accessed in words, all data for each half warp is in 64 bytes, the data is NOT accessed in a contiguous array of memory for each half warp. The first two accesses are contiguous (&data[0], &data[1]), the third is not (&data[333]).

On my GeForce 9800 GT, the output from this example is:

Starting GPU test v1 …

N Time t4-t3 1.156000

C Time t4-t3 0.800000

E Time t4-t3 1.157000

These times display coalescing for the 2nd case (“C”, i.e., “coalesce”) only. The other two do not have coalesce access of global memory.

However, if I run this on my GeForce GTX 470, the output is:

Starting GPU test v1 …

N Time t4-t3 0.626000

C Time t4-t3 0.626000

E Time t4-t3 0.626000

These times display no difference with regards to the access pattern. Coalesced global memory access does not exist on 2.0 devices. (Or, if it does–which I don’t believe it does–it is a much more complicated access pattern, and so NVIDIA does not want to describe it in the Programming Guide.)

Ken D.

Dear avidday and ken D.,

Thanks a lot for taking trouble in explaining me about coalesced memory access. Because of the examples provided by you guys now I think i have some idea about coalescing. Also I did tried to coalesced access on my kernel but not sure whether it is the best way to do it. Please have a look at the following code and let me know if its correct or is there anything better that can be done. 

My device is of compute capability 1.2. I have a matrix of dimension WxH and my kernel launch is with a block size of 16x16. Now I need to load a matrix of 36x36 into the shared memory using 16x16 threads ( threads per block). Here is what i have done:

__global__ void myKernel(int * sum,int W,int H,int dw,int dh,myType * classifier,int stgCnt,int clfCnt, bool * out)

{

        int tx = blockIdx.x * blockDim.x + threadIdx.x;

        int ty = blockIdx.y * blockDim.y + threadIdx.y;

__shared__ int shar_sum[36*36];

for ( int j= 3* threadIdx.y ; j <3*(threadIdx.y+1) && j<36 ;j++) // each thread will load three rows of the 36x36 matrix.

         {

                 for ( int i= threadIdx.x; i < 36; i+=16)  

                 {

                         shar_sum[i+(j*36)] = sum[(tx+(i-threadIdx.x)+ ( ty+(j-threadIdx.y))*W)];       

                 }//i

        }//j

        __syncthreads();

//........ Processing  

}

In this code each thread in a half warp will read consecutive elements so total of 64 bytes will be read per half warp. First of all is this correct as far as coalescing in concerned? Secondly will it be better if instead of 64 bytes per half warp I load 128 bytes per half warp?

regards

Hi Anshu,

It’s difficult to understand your program, because you have no bounds checks in your program and you don’t give the size of the grid, nor M and N. This is a problem if M and N are not multiples of 16, resulting in out of bounds access of “sum”. For the sake of analysis, I assume that you meant to make the dimensions of “shar_sum” a multiple of your block size, not 36. In addition, to simplify the problem even more, I just made N, M, and shar_sum all just 32 instead of 36. Thus, the 16 by 16 block fits perfectly into “sum”; the grid=(2,2,1) and block=(16,16,1). I took your code, and added printf’s to see how the code accesses the array “sum” versus threadIdx.x with constant threadIdx.y. Here is the code and partial output on the first iteration of j in the for-loop:

__global__ void myKernel(int * sum, int W, int H)

{

	int tx = blockIdx.x * blockDim.x + threadIdx.x;

	int ty = blockIdx.y * blockDim.y + threadIdx.y;

	__shared__ int shar_sum[32*32];

	for ( int j= 3* threadIdx.y ; j <3*(threadIdx.y+1) && j<32 ;j++)

		// each thread will load three rows of the 32x32 matrix.

	{

		for ( int i= threadIdx.x; i < 32; i+=16)

		{

			printf("threadIdx.x, y = %d, %d, tx, ty = %d, %d, i = %d, j = %d, shar_sum[%d], sum[%d]\n", threadIdx.x, threadIdx.y, tx, ty, i, j, i+(j*32), (tx+(i-threadIdx.x)+ ( ty+(j-threadIdx.y))*W));

			shar_sum[i+(j*32)] = sum[(tx+(i-threadIdx.x)+ ( ty+(j-threadIdx.y))*W)];

		}//i

	}//j

	__syncthreads();

	//........ Processing

}

Output:

threadIdx.x, y = 0, 0, tx, ty = 0, 0, i = 0, j = 0, shar_sum[0], sum[0]

threadIdx.x, y = 1, 0, tx, ty = 1, 0, i = 1, j = 0, shar_sum[1], sum[1]

threadIdx.x, y = 2, 0, tx, ty = 2, 0, i = 2, j = 0, shar_sum[2], sum[2]

threadIdx.x, y = 3, 0, tx, ty = 3, 0, i = 3, j = 0, shar_sum[3], sum[3]

threadIdx.x, y = 4, 0, tx, ty = 4, 0, i = 4, j = 0, shar_sum[4], sum[4]

threadIdx.x, y = 5, 0, tx, ty = 5, 0, i = 5, j = 0, shar_sum[5], sum[5]

threadIdx.x, y = 6, 0, tx, ty = 6, 0, i = 6, j = 0, shar_sum[6], sum[6]

threadIdx.x, y = 7, 0, tx, ty = 7, 0, i = 7, j = 0, shar_sum[7], sum[7]

threadIdx.x, y = 8, 0, tx, ty = 8, 0, i = 8, j = 0, shar_sum[8], sum[8]

threadIdx.x, y = 9, 0, tx, ty = 9, 0, i = 9, j = 0, shar_sum[9], sum[9]

threadIdx.x, y = 10, 0, tx, ty = 10, 0, i = 10, j = 0, shar_sum[10], sum[10]

threadIdx.x, y = 11, 0, tx, ty = 11, 0, i = 11, j = 0, shar_sum[11], sum[11]

threadIdx.x, y = 12, 0, tx, ty = 12, 0, i = 12, j = 0, shar_sum[12], sum[12]

threadIdx.x, y = 13, 0, tx, ty = 13, 0, i = 13, j = 0, shar_sum[13], sum[13]

threadIdx.x, y = 14, 0, tx, ty = 14, 0, i = 14, j = 0, shar_sum[14], sum[14]

threadIdx.x, y = 15, 0, tx, ty = 15, 0, i = 15, j = 0, shar_sum[15], sum[15]

(next iteration of i)

threadIdx.x, y = 0, 0, tx, ty = 0, 0, i = 16, j = 0, shar_sum[16], sum[16]

threadIdx.x, y = 1, 0, tx, ty = 1, 0, i = 17, j = 0, shar_sum[17], sum[17]

threadIdx.x, y = 2, 0, tx, ty = 2, 0, i = 18, j = 0, shar_sum[18], sum[18]

threadIdx.x, y = 3, 0, tx, ty = 3, 0, i = 19, j = 0, shar_sum[19], sum[19]

threadIdx.x, y = 4, 0, tx, ty = 4, 0, i = 20, j = 0, shar_sum[20], sum[20]

threadIdx.x, y = 5, 0, tx, ty = 5, 0, i = 21, j = 0, shar_sum[21], sum[21]

threadIdx.x, y = 6, 0, tx, ty = 6, 0, i = 22, j = 0, shar_sum[22], sum[22]

threadIdx.x, y = 7, 0, tx, ty = 7, 0, i = 23, j = 0, shar_sum[23], sum[23]

threadIdx.x, y = 8, 0, tx, ty = 8, 0, i = 24, j = 0, shar_sum[24], sum[24]

threadIdx.x, y = 9, 0, tx, ty = 9, 0, i = 25, j = 0, shar_sum[25], sum[25]

threadIdx.x, y = 10, 0, tx, ty = 10, 0, i = 26, j = 0, shar_sum[26], sum[26]

threadIdx.x, y = 11, 0, tx, ty = 11, 0, i = 27, j = 0, shar_sum[27], sum[27]

threadIdx.x, y = 12, 0, tx, ty = 12, 0, i = 28, j = 0, shar_sum[28], sum[28]

threadIdx.x, y = 13, 0, tx, ty = 13, 0, i = 29, j = 0, shar_sum[29], sum[29]

threadIdx.x, y = 14, 0, tx, ty = 14, 0, i = 30, j = 0, shar_sum[30], sum[30]

threadIdx.x, y = 15, 0, tx, ty = 15, 0, i = 31, j = 0, shar_sum[31], sum[31]

Now, let’s assume that one half warp is mapped into threadIdx.x = 0 … 15, with threadIdx.y = 0. (I would have to check this, but it seems reasonable.) From the output, “sum” is accessed sequentially, contiguously, and in multiples of 4 for each thread in sequence. This should result in coalescing global memory accesses. Whether this works in your original code, I don’t know. As far as changing this to work on 8 byte quantities, instead of 4, who knows.

For these things, I usually use printf to check the functionality of the program. But, since you run on a 1.2 device, you don’t have access to printf in the kernel. In that case, you should try to run it in a debugger, or on an emulator that does support printf. Another option is to use the CUDA profiler, because it should show you the number of coalesced accesses. But I’ve had problems with the profiler displaying the correct number of coalesced accesses, so I don’t use it that often.

Ken

Dear Ken,
My grid launch parameters are as follows: girdDim(38,28) and blockDim(16,16). I am working with a 640x480 image. Now for each block of 16x16 dimension i need to load a data of 36x36 which is a requirement of my algorithm. Basically each thread in my kernel needs a 20x20 image data. so for a 16x16 block size i need to load 36x36 image data. If the data to be loaded be 32x32 instead of 36x36 will make some significant performance improvement then i can reduce my block size to be 12x12 instead of 16x16. Although in this case I still have to check what will be the effect because now more blocks will be required as compared to before.
But from what I see in my code also all 16 consecutive bytes are read by a thread warp and the next 16 will be read by the other one. But as the dimensions are not a multiple of 16 so the last 4 bytes will not be coalesced. But it is still better than previous un-coalesced load from global memory. I might still be wrong in some of the arguments here so please point out if you find anything you find wrong. Also will it be better if I can load the data in 128 byte segment instead of 64 byte segment?

I have another weird question and I know I am wrong in this but still needs someone to point this out. So here it goes: finally accesses will be coalesced if data accessed by half warp is contiguous. So what will happen if we only the first thread of a warp loads the data ie data will be loaded by threads 0,16,32… and so on. Will the access in this case also be coalesced even though only one thread is participating in reading data??

If you are loading 4 byte words (so int, unsigned int, float) that is not possible. For the purposes of this discussion you must assume that the transaction size of a coalesced read is always 16 words. If the word size is 4 bytes, the resulting load will always be 16*4 = 64 bytes. To get a 128 byte transaction would require loading an 8 byte word size type (so long, double, float2, int2).

No, and it would make no sense for it to be so. In that case, the hardware is presented with two choices: read 1 word to service a single thread’s load request, or read 16 words. The latter would be an enormous waste of memory bandwidth. The whole concept of coalescing is that the memory controller can do a single transaction to get 16 words of memory for a half-warp, instead of 16 separate 1 word transactions. The reduction in transactions improves performance.

Dear Aviday,

Thanks for the clarification :). Now I know what was I thinking wrong.