Have pushed my CUDA C kernel code as far as it can go in terms of obvious optimizations. Generated the .ptx file and look at the CUDA PTX ISA document, which helps figure out interpret some of the lower-level details.
Is there such thing as ‘low-hanging’ fruit in terms of potentially obvious candidates for low-level optimizations?
Take the following section as an example:
BB2_7:
mul.wide.s32 %rd19, %r34, 48;
mov.u64 %rd20, _Z11simple_backPKfS0_S0_Pffiifiiii$__cuda_local_var_795740_31_non_const_projMat;
add.s64 %rd21, %rd20, %rd19;
ld.shared.f32 %f16, [%rd21+32];
ld.shared.f32 %f17, [%rd21+36];
mul.ftz.f32 %f18, %f17, %f2;
fma.rn.ftz.f32 %f19, %f16, %f1, %f18;
ld.shared.f32 %f20, [%rd21+40];
fma.rn.ftz.f32 %f21, %f20, %f3, %f19;
ld.shared.f32 %f22, [%rd21+44];
add.ftz.f32 %f23, %f21, %f22;
ld.shared.f32 %f24, [%rd21];
ld.shared.f32 %f25, [%rd21+4];
mul.ftz.f32 %f26, %f25, %f2;
fma.rn.ftz.f32 %f27, %f24, %f1, %f26;
ld.shared.f32 %f28, [%rd21+8];
fma.rn.ftz.f32 %f29, %f28, %f3, %f27;
ld.shared.f32 %f30, [%rd21+12];
add.ftz.f32 %f31, %f29, %f30;
div.approx.ftz.f32 %f32, %f31, %f23;
cvt.rni.ftz.f32.f32 %f33, %f32;
cvt.rzi.ftz.s32.f32 %r4, %f33;
ld.shared.f32 %f34, [%rd21+16];
ld.shared.f32 %f35, [%rd21+20];
mul.ftz.f32 %f36, %f35, %f2;
fma.rn.ftz.f32 %f37, %f34, %f1, %f36;
ld.shared.f32 %f38, [%rd21+24];
fma.rn.ftz.f32 %f39, %f38, %f3, %f37;
ld.shared.f32 %f40, [%rd21+28];
add.ftz.f32 %f41, %f39, %f40;
div.approx.ftz.f32 %f42, %f41, %f23;
cvt.rni.ftz.f32.f32 %f43, %f42;
cvt.rzi.ftz.s32.f32 %r5, %f43;
setp.gt.s32 %p5, %r4, 0;
setp.le.s32 %p6, %r4, %r10;
and.pred %p7, %p5, %p6;
setp.gt.s32 %p8, %r5, 0;
and.pred %p9, %p7, %p8;
setp.le.s32 %p10, %r5, %r11;
and.pred %p11, %p9, %p10;
@!%p11 bra BB2_10;
bra.uni BB2_8;
That section I believe maps to some 32-bit multiplications of local registers with shared memory values within a for loop.
Not asking for an exact answer, just would like to get an idea of what to look for in ptx.
Note: this is for Kepler, otherwise I would use maxas (thanks Scott G) to delve into the low-level details.
Thanks!