Fellow developers,
I’m writing a general 2D convolution function. Basically, my question is what the maximum Gflops can be achieved when the operands are in memory (either texture, shared, const). I know FFT convolution is much asymptotically better and will try it next, but I first want to know what the hardware is capable of. I’m using a Tesla 1060 (240 SPs @1.293 Ghz)
I’ve first tried using a texture for the input and const mem for the filter signal as done in the SDK example, and get a comparable performance of ~80 Gflops.
Now, I’ve written another version that uses shared mem for the image and get ~200 Gflops performance. My method is similar to the SDK’s convolutionSeparable, except I modify it to not have to load all ghost regions in the case when shared mem isn’t big enough (my filter is 129 x 47).
I’m trying to figure out a good Gflops upper bound. I wrote several benchmarks in PTXAS that measure the following:
- MAD throughput
- const mem bandwidth
- shared mem bandwidth
- texture fetch bandwidth
- combined shared mem, const mem throughput
- combined texture cache, const mem throughput
The benchmarks are basically an unrolled loop:
.entry SharedMemThroughput(.param .u32 n)
{
.reg .u32 i, end;
.reg .pred again;
.reg .f32 f<16>;
cvt.u32.u16 i, %tid.x;
mul.lo.u32 i, i, 4;
ld.param.u32 end, [n];
mul.lo.u32 end, end, 4;
add.u32 end, i, end;
loop:
add.u32 i, i, 4;
ld.shared.f32 f0, [i];
ld.shared.f32 f1, [i + 4];
ld.shared.f32 f2, [i + 8];
ld.shared.f32 f3, [i + 12];
ld.shared.f32 f4, [i + 16];
ld.shared.f32 f5, [i + 20];
ld.shared.f32 f6, [i + 24];
ld.shared.f32 f7, [i + 28];
setp.lt.u32 again, i, end; // hopefully we hide the latency
ld.shared.f32 f8, [i + 32];
ld.shared.f32 f9, [i + 36];
ld.shared.f32 f10, [i + 40];
ld.shared.f32 f11, [i + 44];
ld.shared.f32 f12, [i + 48];
ld.shared.f32 f13, [i + 52];
ld.shared.f32 f14, [i + 56];
ld.shared.f32 f15, [i + 60];
@again bra loop; // another 24 cycles latency?
// store values to memory so compiler doesn’t treat them as dead
add.f32 f0, f0, f1;
add.f32 f0, f0, f2;
add.f32 f0, f0, f3;
add.f32 f0, f0, f4;
add.f32 f0, f0, f5;
add.f32 f0, f0, f6;
add.f32 f0, f0, f7;
add.f32 f0, f0, f8;
add.f32 f0, f0, f9;
add.f32 f0, f0, f10;
add.f32 f0, f0, f11;
add.f32 f0, f0, f12;
add.f32 f0, f0, f13;
add.f32 f0, f0, f14;
add.f32 f0, f0, f15;
st.shared.f32 [0], f0;
ret;
}
The PTXAS is then JITed. I use maximum thread block sizes (probably not needed since I don’t have any long latency ops to hide), set grid size to a multiple of #multiprocessors (30) to keep them busy all the time and get these results:
- 510 Gflops not bad; theoretical is 620 Gflops when all 240 SPs do a MAD/cycle.
- 249 G const fetches/s (all threads access same location)
- 276 G shared fetches/s => 1104 Gb/s
- 25 G tex fetches/s (4 floats/fetch)
- 125 G const fetches/s, ~125 G shared fetches/s :(
- todo
The halved bandwidth in #5 compared to #2 and #3 implies the shared memory and constant cache aren’t being accessed in parallel. If that’s the case, then an upper bound would be 250 Gflops. I don’t see any reason why the constant cache and shared memory can’t be accessed in parallel since they’re separate interfaces. I heard MAD and MUL can be issued in the same cycle, but can ld.const and ld.shared be dual issued?