Hi,
Sorry for cross-posting - I have originally posted this question in the General forum, but I guess
this is a much better place…
I have a question regarding the compilation of a simple CUDA kernel. It is by no means
functional, it is just the simplest code I managed to come up with that shows the problem I have.
I would like to apply a filters to a grid of points (who doesn’t…)… The filters are few,
so I store them in the constant memory for caching. Depending on some condition decided at runtime,
I would like every thread to use the appropriate set of coefficients.
The below code shows the basic idea:
[codebox]
device constant float coeff1[3*3];
device constant float coeff2[3*3];
global void test(float *out){
extern shared float shm;
float * kernel = coeff1;
if(out[0]>0) kernel = coeff2;
out[0] =
kernel[0]*shm[0] + kernel[1]*shm[1] + kernel[2]*shm[2] +
kernel[3]*shm[3] + kernel[4]*shm[4] + kernel[5]*shm[5] +
kernel[6]*shm[6] + kernel[7]*shm[7] + kernel[8]*shm[8] ;
}
[/codebox]
Now, there is a substantial difference between the output generated in this scenario,
and the output generated for just a single set of filter coefficients (the if statement in the above commented out).
FYI, I am using 2.2 version of CUDA compilation tools (V0.2.1221).
This is the output of decuda for the code without the if statement
[codebox]
// Disassembling _Z4testPf
000000: d00b8005 20000780 mov.b32 $ofs1, 0x000005c0
000008: c491d201 00200780 mul.rn.f32 $r0, s[$ofs1+0x0024], c0[$ofs1+0x0044]
000010: e490d001 00200780 mad.rn.f32 $r0, s[$ofs1+0x0020], c0[$ofs1+0x0040], $r0
000018: e492d401 00200780 mad.rn.f32 $r0, s[$ofs1+0x0028], c0[$ofs1+0x0048], $r0
000020: e493d601 00200780 mad.rn.f32 $r0, s[$ofs1+0x002c], c0[$ofs1+0x004c], $r0
000028: e494d801 00200780 mad.rn.f32 $r0, s[$ofs1+0x0030], c0[$ofs1+0x0050], $r0
000030: e495da01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0034], c0[$ofs1+0x0054], $r0
000038: e496dc01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0038], c0[$ofs1+0x0058], $r0
000040: e497de01 00200780 mad.rn.f32 $r0, s[$ofs1+0x003c], c0[$ofs1+0x005c], $r0
000048: e498e005 00200780 mad.rn.f32 $r1, s[$ofs1+0x0040], c0[$ofs1+0x0060], $r0
000050: 10000801 4400c780 mov.b32 $r0, s[0x0010]
000058: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1
[/codebox]
and this is for the code with the if statement:
[codebox]
// Disassembling _Z4testPf
000000: 10000801 4400c780 mov.b32 $r0, s[0x0010]
000008: d00e0005 80c00780 mov.u32 $r1, g[$r0]
000010: b07c03fd 600107c8 set.gt.u16.f32.f32 $p0|$o127, $r1, $r124
000018: 10008005 0000005f mov.b32 $r1, 0x000005c0
000020: 10000005 2440c280 @$p0.ne mov.b32 $r1, c1[0x0000]
000028: 00000205 c0000780 movsh.b32 $ofs1, $r1, 0x00000000
000030: 1100f204 mov.half.b32 $r1, s[0x0024]
000034: c4810204 mul.half.rn.f32 $r1, $r1, c0[$ofs1+0x0004]
000038: 14000009 2400c780 mov.b32 $r2, c0[$ofs1+0x0000]
000040: e002d009 00204780 mad.rn.f32 $r2, s[0x0020], $r2, $r1
000048: 10001405 4400c780 mov.b32 $r1, s[0x0028]
000050: e4820209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0008], $r2
000058: 10001605 4400c780 mov.b32 $r1, s[0x002c]
000060: e4830209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x000c], $r2
000068: 10001805 4400c780 mov.b32 $r1, s[0x0030]
000070: e4840209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0010], $r2
000078: 10001a05 4400c780 mov.b32 $r1, s[0x0034]
000080: e4850209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0014], $r2
000088: 10001c05 4400c780 mov.b32 $r1, s[0x0038]
000090: e4860209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0018], $r2
000098: 10001e05 4400c780 mov.b32 $r1, s[0x003c]
0000a0: e4870209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x001c], $r2
0000a8: 10002005 4400c780 mov.b32 $r1, s[0x0040]
0000b0: e4880205 00008780 mad.rn.f32 $r1, $r1, c0[$ofs1+0x0020], $r2
0000b8: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1
// segment: const (1:0000)
0000: 000005e4
[/codebox]
It seems like in the first example mad’s operate directly on the memory addresses,
while in the second example the data is moved first to the registers.
Is this necessary for some architectural reason I do not understand, or can this be avoided?
I would rather have the first version of the code - it is much faster…
In the original thread cbuchner1 did offer me a hack to get around this problem - to copy
the entire kernel in every possible branch. It is not really practical when using around 10 different filters,
and a few more lines of code than I have above. Does any of you have any ideas, how to force the compiler to do the
right thing?
Thanks!