Visual debugger to see if mem access is coalesced

I’m having trouble telling why my memory accesses aren’t coalesced.

The goal is to loop through a large block of data, 512MB of short ints. I’ve written my kernel so that each thread can read between 2-16 bytes of data (1-8 short ints) at once. Adjacent threads in a block process adjacent chunks of data. I was expecting the best performance to be when N_CHUNK=4, each thread reads 8 bytes at once, and a half-warp can be serviced by a single 128-byte coalesced read.

Instead, I observe best performance when N_CHUNK=1 and each thread reads only 2 bytes at once. Regardless of the parameter, the visual debugger shows that the achieved global memory throughput is at most 40% of peak. When N_CHUNK=4, the ‘Memory Throughput Analysis’ says that the memory access pattern isn’t coalesced, but I don’t understand why.

My questions:

  • What’s the fastest access pattern for a simple kernel like this? Wouldn’t it be better to heave each thread read 8 bytes at once rather than 2 bytes at once?

  • Am I making a silly mistake somewhere in this experiment?

  • Why wouldn’t the memory access be coalesced in the N_CHUNK=4 (16 byte per thread) case?

  • In the N_CHUNK=1 (2 byte per thread) case, why wouldn’t I get higher memory throughput

Thanks for your attention!

*The following code is an excerpt from a self-contained .cu file posted here: https://gist.github.com/1318144

const int N_DATA = (512*512*512*2);

const int N_BYTES = (N_DATA*2);

const int N_GRID = 512*16;

const int N_BLOCK = 512;

const int N_CHUNK = 1;

const int N_FAN = N_DATA/N_GRID/N_BLOCK/N_CHUNK;

const int K = 13;

const int N_LOOPS = 10;

/*

  N_DATA = (N_GRID) * (N_FAN) * (N_BLOCK) * (N_CHUNK) = 512*512*512*2

 */

__global__ void incr_data1(short int *data) {

  // Outer loop skips by strides of N_BLOCK*N_CHUNK

  for (int i = 0; i < N_FAN; i++) {

    int idx = blockIdx.x*(N_FAN*N_BLOCK*N_CHUNK) + i*(N_BLOCK*N_CHUNK) + threadIdx.x*(N_CHUNK);

// Inner loop processes 16 bytes (8 short ints) at once (a chunk)

    #pragma unroll

    for (int j = 0; j < N_CHUNK; j++, idx++) {

      data[idx] += K;

    }

  }

}

int main(void) {

  short int *data_gpu;

cudaMalloc((void **) &data_gpu, N_BYTES);

dim3 dimBlock(N_BLOCK,1,1);

  dim3 dimGrid(N_GRID,1,1);    

// Run the kernel several times

  for (int i = 0; i < N_LOOPS; i++) {      

    incr_data1<<<dimGrid, dimBlock>>>(data_gpu);

  }

}

Output from visual profiler, when N_CHUNK=1

Grid size:  [8192  1  1]

Block size:  [512  1  1]

Limiting Factor

Achieved Occupancy:  0.99 ( Theoretical Occupancy:  1.00 )

Achieved global memory throughput:  10.68 ( Peak global memory throughput(GB/s):  25.60 )

“At once” really means that, i.e. within the same memory transaction. Each of the inner loop iterations however generates a new transaction, so for N_CHUNK>1 these don’t qualify for at once.

To enable wider memory transactions, operate on larger types, i.e. short2 or short4 (provided the data is suitable aligned). In this case the compiler usually is even smart enough to combine subsequent loads and stores of the components into wider transactions:

const int N_DATA = (512*512*512);

const int N_BYTES = (N_DATA*4);

const int N_GRID = 512*16;

const int N_BLOCK = 512;

const int N_FAN = N_DATA/N_GRID/N_BLOCK;

const int K = 13;

const int N_LOOPS = 10;

__global__ void incr_data1(short2 *data) {

  // loop skips by strides of N_BLOCK

  for (int i = 0; i < N_FAN; i++) {

    int idx = blockIdx.x*(N_FAN*N_BLOCK) + i*N_BLOCK + threadIdx.x;

// explicitly process 4 bytes (2 short ints) at once (a chunk)

    data[idx].x += K;

    data[idx].y += K;

  }

}

(beware: code is completely untested!)

Further speedups can be achieved by using short4 instead of short2, and potentially by unrolling the loop a few times ([font=“Courier New”]#pragma unroll 4[/font] or similar) to have more transactions in flight (although with enough active threads per SM this brings no further improvement).

Thank you for the suggestions Tera!

You’re absolutely right that using short4 improved the memory throughput. I was even able to see 64-bit load and store instructions (a short4’s worth) in the disassembly, when before I would only see 16-bit instructions (one short).

$ cuobjdump ./cuda_512mb_short --dump-sass | egrep "GLD|GST"

	/*0068*/     /*0xd00e040180800780*/ 	GLD.S64 R0, global14 [R2];

	/*00c8*/     /*0xd00e0401a0800780*/ 	GST.S64 global14 [R2], R0;

	/*0068*/     /*0xd00e080180800780*/ 	GLD.S64 R0, global14 [R4];

	/*00b8*/     /*0xd00e0801a0800780*/ 	GST.S64 global14 [R4], R0;

	/*00c8*/     /*0xd00e080180800780*/ 	GLD.S64 R0, global14 [R4];

	/*0128*/     /*0xd00e0801a0800780*/ 	GST.S64 global14 [R4], R0;
#pragma unroll

    for (int j = 0; j < N_CHUNK; j+=4, idx++) {

      short4 *d = (short4 *) data;

      d[(idx+j)/4].x += K;

      d[(idx+j)/4].y += K;

      d[(idx+j)/4].z += K;

      d[(idx+j)/4].w += K;

    }

Now ‘short8’ isn’t a built-in type, so I tried to create one with a struct of short4s. Nothing I tried resulted in something like a GLD.S128. What’s more, if I create a struct with all the shorts laid out like the following, it only produces S16 load/store instructions. In other words, between a short[4], a short.{s0,s1,s2,s3}, and a short4, only the short4 gets compiled into 64-bit instructions.

struct __align__(16) short8  {

    short s0, s1, s2, s3, s4, s5, s6, s7;

};

So my current questions:

  • How can you get a 128-bit load to be compiled in, using short ints? I think a 128-bit float4 would be loaded in one instruction, no problem.

  • Are there any example kernels that are known to max-out the memory throughput?

  • Why wouldn’t a struct of {s0,s1,s2…} be treated by the compiler the same as a vector type?

I’m having the same problem, even with a simple ‘copy’ kernel that I would expect to saturate the global memory. When I run this kernel with the visual profiler, I still see <50% utilization of memory b/w.

Does anyone know of a kernel or an example code that is known to fully max-out their memory bandwidth?

__global__ void copy_data1(short int *data, short int *out) {

  // Outer loop skips by strides of N_BLOCK*N_CHUNK                                                                                                                                 

  for (int i = 0; i < N_FAN; i++) {

    int idx = blockIdx.x*(N_FAN*N_BLOCK*N_CHUNK) + i*(N_BLOCK*N_CHUNK) + threadIdx.x*(N_CHUNK);

// Inner loop processes 16 bytes (8 short ints) at once (a chunk)                                                                                                               

    #pragma unroll

    for (int j = 0; j < N_CHUNK; j+=4, idx++) {

      short4 *d = (short4 *) data;

      short4 *o = (short4 *) out;

      o[(idx+j)/4].x = d[(idx+j)/4].x;

      o[(idx+j)/4].y = d[(idx+j)/4].y;

      o[(idx+j)/4].z = d[(idx+j)/4].z;

      o[(idx+j)/4].w = d[(idx+j)/4].w;

    }

  }

}

You could probably read in as an int4 and then extract and then extract the exact components you want. I’d use a union that contains an int4 or 2 short4s to make the code readable.

Thank you Justin. You were right! Using an int4 caused the compiler to use 128-bit memory instructions. I disrecarded your ‘union’ advice so I have a nasty looking pointer juggle going on in this kernel. But it’s simple enough.

struct __align__(16) short8  {

    short s0, s1, s2, s3, s4, s5, s6, s7;

};

__global__ void incr_data3(short int *data) {

    for (int i = 0; i < N_FAN; i++) {

        int idx = blockIdx.x*(N_FAN*N_BLOCK*N_CHUNK) + i*(N_BLOCK*N_CHUNK) + threadIdx.x*(N_CHUNK);

        int4 *dd = (int4 *) &data[idx];

        int4 d_ = *dd;

        short8 d = *((short8 *) &d_);

        d.s0 += K;

        d.s1 += K;

        d.s2 += K;

        d.s3 += K;

        d.s4 += K;

        d.s5 += K;

        d.s6 += K;

        d.s7 += K;

        *dd = *((int4 *) &d);

    }

}
$ cuobjdump ./cuda_512mb_short --dump-sass | egrep "GLD|GST"

	/*0040*/     /*0xd00e0a0180a00780*/ 	GLD.S128 R0, global14 [R5];

	/*0188*/     /*0xd00e0a01a0a00780*/ 	GST.S128 global14 [R5], R0;

And sure enough this kernel runs even faster. But it still doesn’t bring me quite up to 50% memory utilization. Has anyone actually observed 100% utilization of memory bandwidth in the compute profiler? If so, on what sort of kernel?

I tried using the “bandwidthTest” example in the GPU Computing SDK. Even bandwidthTest shows 95.5 GB/s (for device-to-device transfer), when I’m expecting 133.3 GB/s, i.e. only 70% utilization.

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			95493.7

Based on my experiments mentioned in this thread then, it seems that 50% of peak bandwidth utilization is a maximum limit for cuda kernels. Even a device-to-device API call can’t achieve better than 70% of peak bandwidth. Can anyone else corroborate this or show me a counter example?

For your runs is N_CHUNK=1? For anything other than N_CHUNK=1 you are going to have uncolesced accesses. Is data aligned on 128 byte boundaries? Have you tried increasing your problem size and/or block size to see if it increases your throughput?

Finally another optimization that often helps is to manually unroll and process 2 independent elements per thread. Basically like this

d1=data[idx]
d2=data[idx+NBLOCK/2]
//computation for d1
//computation for d2
data[idx]=d1
data[idx+NBLOCK / 2] = d2

You should be able to get better than 50% utilization. I have seen many kernels that get much higher than that.