Very low global memory bandwidth, advice wanted

lyskule · December 15, 2018, 10:15pm

Hi folks, I’m writing some matrix multiplication code (trying to do what CUTLASS did). On the last stage, i.e. writing back to the global memory, the profiled bandwidth is astonishingly low: 44MB/s.

The lines are like

#pragma unroll
for (unsigned ia = 0; ia < A; ia++) {
    #pragma unroll
    for (unsigned ib = 0; ib < B; ib++) {
        *(d_c + (BLOCK_SIZE*(bidX * A+ ia) + tidX) + WIDTH * (BLOCK_SIZE*(bidY * B+ ib) + tidY)) = result[ia][ib];
    }
}

bid and tid are just blockId and threadId.

Under my settings, grid size is 16x16 and blocks are of 32x32 threads. With 32 warps per block, there should be 8192 warps, seemed enough to hide the latency of memory operations?

From NSight, I see that there are (per warp)

4 store requests
4 store transactions
I think this means the stores are coalesced, is that correct?

But with this, I wonder why the memory bandwidth is so very low? I do notice that, there are a lot of int calculations for indexing.

FWIW, I’m using 780Ti of SM3.5, this design achieved 100% occupancy. However, only 5.16/62.68 warps are eligible for scheduling and achieved only 2.83 IPC.

Thanks for your help!

Robert_Crovella · December 15, 2018, 10:28pm

are you working with float quantities here?
is result a thread-local array? I assume it is not in shared memory.

Have you benchmarked just this piece of code in its own kernel? If not, you may want to do that to remove effects of other code you have not shown here. For example, unless you have __syncthreads() before this, it’s not guaranteed that all warps are performing this code at the same time.

Furthermore, depending on how you are computing your bandwidth, it may not be surprising to witness low bandwidth if your code is compute bound.

lyskule · December 18, 2018, 8:43pm

I solved the problem by employing more ILP, but I still want to ask that

Since it’s not me calculating the bandwidth, it’s NSight, the shown result shouldn’t be affected by whether the application is compute-bound or IO-bound, right?
Take Kepler as an example, per SMX it has 4 warp schedulers and 32 LD/ST units. Then it means that the maximum concurrency is 32 load and stores and it takes 4 cycles (assuming the processor is idle) to fully saturate the pipeline (8 issues per cycle)?
(off topic) CUDA10 removed the support for launching cuBLAS from kernels?

Thanks for your answer!

Robert_Crovella · December 18, 2018, 8:54pm

My point is you were worried about low bandwidth. Because a (optimized) matrix multiply is fundamentally a compute bound algorithm on GPUs, I was suggesting low bandwidth might not be a problem. But since you haven’t shown what you are doing other than the data write part, I really have no idea. I wasn’t suggesting that a measurement would be fundamentally affected by compute-bound or latency bound. But if you were measuring this by dividing transactions by total kernel time, for a compute-bound kernel I would expect that to be relatively lower. If, OTOH, you were somehow measuring bandwidth by just measuring the execution time of the code snippet you had shown, I would expect that to be high bandwidth (subject to some caveats about the previous code not shown).
If a SM has 32 LD/ST units, I would generally say it can handle one LD or ST instruction (from one warp, i.e. warp-wide) per cycle. Each warp scheduler could (in theory) have up to two of those types of instructions waiting to be issued in any particular cycle. I don’t understand the rest of your statements.
Yes, its documented in the release notes

Topic		Replies	Views
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
global memory bandwidth problem CUDA Programming and Performance	4	1409	March 2, 2010
Why my program bandwidth exceeds the standard bandwidth? CUDA Programming and Performance	6	1003	April 3, 2015
Effective Bandwidth Problem CUDA Programming and Performance	13	7709	March 23, 2011
Shared memory as slow as global memory CUDA Programming and Performance	8	4391	September 5, 2016
Unusually high bandwidth and little effect of coalescing when running the global memory coalescing sample code CUDA Programming and Performance	4	873	April 21, 2018
About coalescing CUDA Programming and Performance	6	2617	April 16, 2010
Kernel with a for-loop over matrix columns results in worse bandwidth for longer ranges? CUDA Programming and Performance	4	649	April 27, 2021
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4491	October 24, 2008
Effective global memory bandwidth? CUDA Programming and Performance	17	17571	September 18, 2007

Very low global memory bandwidth, advice wanted

Related topics