Load or L2 Bottleneck?

I’m trying to understand the performance bottleneck on the M1000M in the situation that follows. This is for an algorithm that is nearly embarrassingly parallel with every thread identical and independent of every other thread.

The Visual Profiler indicates stalls due to memory dependency. Global Memory Alignment and Access Pattern shows 32 transactions/access vs 4 or 8 ideal for loads. The disassembly window shows numerous places where “L2 transactions = 100% of max I2 transactions”. I’m not sure what this last part means.

Experimentally, I find that doubling the number of registers, doubling the number of threads, or running more than one kernel per SM yields little improvement. However, if I run four identical kernels, one on each available SM, I get a almost a linear speed up (run-time is about the same to run four as one). If the L2 is shared among the SMs, this tells me that neither global memory bandwidth nor L2 transactions is the limitation.

If I comment a few load statements, initially I see only a small improvement; however once I comment enough of them, all of sudden the run-time decreases by 10x. Finally, I attempted to reorder most memory accesses so that each thread would access four or eight bytes contiguous across threads. Although, I never quite got the code working, I also didn’t see a performance improvement, so I abandoned that optimization.

Based on this, I believe either there aren’t enough load units, or the problem has something to do with L2, but I don’t understand the issue. If this is correct (or if it isn’t), is there anything I can do besides the obvious of decreasing the number of memory accesses?

You could try to improve the efficiency of global memory access. Take a look at gld_efficiency (and also gst_efficiency) metric, and see what is reported. If it is lower than about 70%, you might want to try and improve your access patterns - striving for coalesced loads (and stores).

The L2 is shared among the SMs. My guess is you are maxing out LD/ST unit capacity, but you may also be just maxing out on either L2 bandwidth or main memory bandwidth. There are metrics which can shed light on all of these.

The first 2 most important optimization tasks for any CUDA program are:

  1. Make efficient use of memory
  2. expose enough parallelism

I’m pretty sure the profiler is telling you that item 1 is not well done.

Thanks, they are both under 20%. I know I’m accessing memory poorly; I’m just not sure it’s THE problem.

I don’t think L2 or main memory bandwidth can be the bottleneck. If that were true, I wouldn’t be able to run four kernels on the four SMs and get almost the same run-time as running a single kernel on one SM.

After messing with this for a few days, I discovered that there are vectorized loads and stores. It wasn’t too hard to partially vectorize the code, and I finally see some performance gains. I have a couple of issues I’m wondering about.

(1) Is there any benefit (in terms of minimizing usage of LD/ST units) to using vectorized types in the following situation?

float4 a[N], b[N];
a[0].x = b[cachedInd[ind[0]]].x;
a[0].y = b[cachedInd[ind[1]]].y;
a[0].z = b[cachedInd[ind[2]]].z;
a[0].w = b[cachedInd[ind[3]]].w;

The left hand side is (eventually) a contiguous store, but there’s no way for me to know the indices of b ahead of time, and in any case, they’re not contiguous. It may actually be faster to recompute cachedInd every single access, but that’s going to be a time consuming experiment.

(2) At some point, I’m going to run into a bandwidth limitation. I believe the optimal memory strategy is 4 byte loads in each warp thread for a total of 128 contiguous bytes. If I’m accessing float4 (16 bytes) contiguously across threads, from a bandwidth perspective, is this the same as optimal coalesced memory? Or is it more like strided single float loads? Or something else?