Very low global memory bandwidth, advice wanted

Hi folks, I’m writing some matrix multiplication code (trying to do what CUTLASS did). On the last stage, i.e. writing back to the global memory, the profiled bandwidth is astonishingly low: 44MB/s.

The lines are like

#pragma unroll
for (unsigned ia = 0; ia < A; ia++) {
    #pragma unroll
    for (unsigned ib = 0; ib < B; ib++) {
        *(d_c + (BLOCK_SIZE*(bidX * A+ ia) + tidX) + WIDTH * (BLOCK_SIZE*(bidY * B+ ib) + tidY)) = result[ia][ib];
    }
}

bid and tid are just blockId and threadId.

Under my settings, grid size is 16x16 and blocks are of 32x32 threads. With 32 warps per block, there should be 8192 warps, seemed enough to hide the latency of memory operations?

From NSight, I see that there are (per warp)

  • 4 store requests
  • 4 store transactions
    I think this means the stores are coalesced, is that correct?

But with this, I wonder why the memory bandwidth is so very low? I do notice that, there are a lot of int calculations for indexing.

FWIW, I’m using 780Ti of SM3.5, this design achieved 100% occupancy. However, only 5.16/62.68 warps are eligible for scheduling and achieved only 2.83 IPC.

Thanks for your help!

are you working with float quantities here?
is result a thread-local array? I assume it is not in shared memory.

Have you benchmarked just this piece of code in its own kernel? If not, you may want to do that to remove effects of other code you have not shown here. For example, unless you have __syncthreads() before this, it’s not guaranteed that all warps are performing this code at the same time.

Furthermore, depending on how you are computing your bandwidth, it may not be surprising to witness low bandwidth if your code is compute bound.

I solved the problem by employing more ILP, but I still want to ask that

  1. Since it’s not me calculating the bandwidth, it’s NSight, the shown result shouldn’t be affected by whether the application is compute-bound or IO-bound, right?

  2. Take Kepler as an example, per SMX it has 4 warp schedulers and 32 LD/ST units. Then it means that the maximum concurrency is 32 load and stores and it takes 4 cycles (assuming the processor is idle) to fully saturate the pipeline (8 issues per cycle)?

  3. (off topic) CUDA10 removed the support for launching cuBLAS from kernels?

Thanks for your answer!

  1. My point is you were worried about low bandwidth. Because a (optimized) matrix multiply is fundamentally a compute bound algorithm on GPUs, I was suggesting low bandwidth might not be a problem. But since you haven’t shown what you are doing other than the data write part, I really have no idea. I wasn’t suggesting that a measurement would be fundamentally affected by compute-bound or latency bound. But if you were measuring this by dividing transactions by total kernel time, for a compute-bound kernel I would expect that to be relatively lower. If, OTOH, you were somehow measuring bandwidth by just measuring the execution time of the code snippet you had shown, I would expect that to be high bandwidth (subject to some caveats about the previous code not shown).

  2. If a SM has 32 LD/ST units, I would generally say it can handle one LD or ST instruction (from one warp, i.e. warp-wide) per cycle. Each warp scheduler could (in theory) have up to two of those types of instructions waiting to be issued in any particular cycle. I don’t understand the rest of your statements.

  3. Yes, its documented in the release notes