Fastest Gflops when operands are in memory?

Fellow developers,

I’m writing a general 2D convolution function. Basically, my question is what the maximum Gflops can be achieved when the operands are in memory (either texture, shared, const). I know FFT convolution is much asymptotically better and will try it next, but I first want to know what the hardware is capable of. I’m using a Tesla 1060 (240 SPs @1.293 Ghz)

I’ve first tried using a texture for the input and const mem for the filter signal as done in the SDK example, and get a comparable performance of ~80 Gflops.

Now, I’ve written another version that uses shared mem for the image and get ~200 Gflops performance. My method is similar to the SDK’s convolutionSeparable, except I modify it to not have to load all ghost regions in the case when shared mem isn’t big enough (my filter is 129 x 47).

I’m trying to figure out a good Gflops upper bound. I wrote several benchmarks in PTXAS that measure the following:

  1. MAD throughput
  2. const mem bandwidth
  3. shared mem bandwidth
  4. texture fetch bandwidth
  5. combined shared mem, const mem throughput
  6. combined texture cache, const mem throughput

The benchmarks are basically an unrolled loop:

.entry SharedMemThroughput(.param .u32 n)
{
.reg .u32 i, end;
.reg .pred again;
.reg .f32 f<16>;
cvt.u32.u16 i, %tid.x;
mul.lo.u32 i, i, 4;
ld.param.u32 end, [n];
mul.lo.u32 end, end, 4;
add.u32 end, i, end;
loop:
add.u32 i, i, 4;
ld.shared.f32 f0, [i];
ld.shared.f32 f1, [i + 4];
ld.shared.f32 f2, [i + 8];
ld.shared.f32 f3, [i + 12];
ld.shared.f32 f4, [i + 16];
ld.shared.f32 f5, [i + 20];
ld.shared.f32 f6, [i + 24];
ld.shared.f32 f7, [i + 28];
setp.lt.u32 again, i, end; // hopefully we hide the latency

ld.shared.f32 f8, [i + 32];
ld.shared.f32 f9, [i + 36];
ld.shared.f32 f10, [i + 40];
ld.shared.f32 f11, [i + 44];
ld.shared.f32 f12, [i + 48];
ld.shared.f32 f13, [i + 52];
ld.shared.f32 f14, [i + 56];
ld.shared.f32 f15, [i + 60];
@again bra loop; // another 24 cycles latency?

// store values to memory so compiler doesn’t treat them as dead
add.f32 f0, f0, f1;
add.f32 f0, f0, f2;
add.f32 f0, f0, f3;
add.f32 f0, f0, f4;
add.f32 f0, f0, f5;
add.f32 f0, f0, f6;
add.f32 f0, f0, f7;
add.f32 f0, f0, f8;
add.f32 f0, f0, f9;
add.f32 f0, f0, f10;
add.f32 f0, f0, f11;
add.f32 f0, f0, f12;
add.f32 f0, f0, f13;
add.f32 f0, f0, f14;
add.f32 f0, f0, f15;

st.shared.f32 [0], f0;
ret;
}

The PTXAS is then JITed. I use maximum thread block sizes (probably not needed since I don’t have any long latency ops to hide), set grid size to a multiple of #multiprocessors (30) to keep them busy all the time and get these results:

  1. 510 Gflops not bad; theoretical is 620 Gflops when all 240 SPs do a MAD/cycle.
  2. 249 G const fetches/s (all threads access same location)
  3. 276 G shared fetches/s => 1104 Gb/s
  4. 25 G tex fetches/s (4 floats/fetch)
  5. 125 G const fetches/s, ~125 G shared fetches/s :(
  6. todo

The halved bandwidth in #5 compared to #2 and #3 implies the shared memory and constant cache aren’t being accessed in parallel. If that’s the case, then an upper bound would be 250 Gflops. I don’t see any reason why the constant cache and shared memory can’t be accessed in parallel since they’re separate interfaces. I heard MAD and MUL can be issued in the same cycle, but can ld.const and ld.shared be dual issued?

External Image Aha, I’ve realized in order to get above the 250 Gflop limit, you will have to load blocks into the registers and maximize their reuse. Then the inner loop will just be multiply-adds, which will run at the 620 Gflops peak. Ideally, each input sample will be used kw * kh (filter dimension) times, so if you can load everything into registers, you’ll reduce loads by a factor of kw * kh. Tesla probably has enough registers to load blocks of 4x4, so the load reduction will be much smaller. Stay tuned for the actual performance.

BTW, has someone already used this technique I described? What performance did you get?

From the lack of response, I take it that ld.shared and ld.const can’t be dual issued. NVIDIA, please add this feature to your future architectures. It’s a shame to have a memory idle every other cycle and it shouldn’t be expensive to implement - Pentium had dual issue for many instruction pairs.