Compiler generated code for constant memory access - a question


Sorry for cross-posting - I have originally posted this question in the General forum, but I guess

this is a much better place…

I have a question regarding the compilation of a simple CUDA kernel. It is by no means

functional, it is just the simplest code I managed to come up with that shows the problem I have.

I would like to apply a filters to a grid of points (who doesn’t…)… The filters are few,

so I store them in the constant memory for caching. Depending on some condition decided at runtime,

I would like every thread to use the appropriate set of coefficients.

The below code shows the basic idea:


device constant float coeff1[3*3];

device constant float coeff2[3*3];

global void test(float *out){

extern shared float shm;

float * kernel = coeff1;

if(out[0]>0) kernel = coeff2;

out[0] =

kernel[0]*shm[0] +   kernel[1]*shm[1] +   kernel[2]*shm[2] +

kernel[3]*shm[3] +   kernel[4]*shm[4] +   kernel[5]*shm[5] +

kernel[6]*shm[6] +   kernel[7]*shm[7] +   kernel[8]*shm[8] ;



Now, there is a substantial difference between the output generated in this scenario,

and the output generated for just a single set of filter coefficients (the if statement in the above commented out).

FYI, I am using 2.2 version of CUDA compilation tools (V0.2.1221).

This is the output of decuda for the code without the if statement


// Disassembling _Z4testPf

000000: d00b8005 20000780 mov.b32 $ofs1, 0x000005c0

000008: c491d201 00200780 mul.rn.f32 $r0, s[$ofs1+0x0024], c0[$ofs1+0x0044]

000010: e490d001 00200780 mad.rn.f32 $r0, s[$ofs1+0x0020], c0[$ofs1+0x0040], $r0

000018: e492d401 00200780 mad.rn.f32 $r0, s[$ofs1+0x0028], c0[$ofs1+0x0048], $r0

000020: e493d601 00200780 mad.rn.f32 $r0, s[$ofs1+0x002c], c0[$ofs1+0x004c], $r0

000028: e494d801 00200780 mad.rn.f32 $r0, s[$ofs1+0x0030], c0[$ofs1+0x0050], $r0

000030: e495da01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0034], c0[$ofs1+0x0054], $r0

000038: e496dc01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0038], c0[$ofs1+0x0058], $r0

000040: e497de01 00200780 mad.rn.f32 $r0, s[$ofs1+0x003c], c0[$ofs1+0x005c], $r0

000048: e498e005 00200780 mad.rn.f32 $r1, s[$ofs1+0x0040], c0[$ofs1+0x0060], $r0

000050: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000058: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1


and this is for the code with the if statement:


// Disassembling _Z4testPf

000000: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000008: d00e0005 80c00780 mov.u32 $r1, g[$r0]

000010: b07c03fd 600107c8 $p0|$o127, $r1, $r124

000018: 10008005 0000005f mov.b32 $r1, 0x000005c0

000020: 10000005 2440c280 @$ mov.b32 $r1, c1[0x0000]

000028: 00000205 c0000780 movsh.b32 $ofs1, $r1, 0x00000000

000030: 1100f204 mov.half.b32 $r1, s[0x0024]

000034: c4810204 mul.half.rn.f32 $r1, $r1, c0[$ofs1+0x0004]

000038: 14000009 2400c780 mov.b32 $r2, c0[$ofs1+0x0000]

000040: e002d009 00204780 mad.rn.f32 $r2, s[0x0020], $r2, $r1

000048: 10001405 4400c780 mov.b32 $r1, s[0x0028]

000050: e4820209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0008], $r2

000058: 10001605 4400c780 mov.b32 $r1, s[0x002c]

000060: e4830209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x000c], $r2

000068: 10001805 4400c780 mov.b32 $r1, s[0x0030]

000070: e4840209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0010], $r2

000078: 10001a05 4400c780 mov.b32 $r1, s[0x0034]

000080: e4850209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0014], $r2

000088: 10001c05 4400c780 mov.b32 $r1, s[0x0038]

000090: e4860209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0018], $r2

000098: 10001e05 4400c780 mov.b32 $r1, s[0x003c]

0000a0: e4870209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x001c], $r2

0000a8: 10002005 4400c780 mov.b32 $r1, s[0x0040]

0000b0: e4880205 00008780 mad.rn.f32 $r1, $r1, c0[$ofs1+0x0020], $r2

0000b8: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

// segment: const (1:0000)

0000: 000005e4


It seems like in the first example mad’s operate directly on the memory addresses,

while in the second example the data is moved first to the registers.

Is this necessary for some architectural reason I do not understand, or can this be avoided?

I would rather have the first version of the code - it is much faster…

In the original thread cbuchner1 did offer me a hack to get around this problem - to copy

the entire kernel in every possible branch. It is not really practical when using around 10 different filters,

and a few more lines of code than I have above. Does any of you have any ideas, how to force the compiler to do the

right thing?


How much faster is it, and on which GPU? Can you get the instruction/cycle ratios in each case (using the profiler)?

What is the occupancy?

I’m asking because I would have expected version 2 to be at least as fast, or even faster…

BTW, did you try with CUDA 3.0? (extracting assembly from cubins is harder, but you can still compare performance)

If you’re using Fermis that might be because of dual issue, see: [post=“1002997”]my speedy SGEMM[/post] thread.

I think that’s an example of what I call “performance (non)portability”.

Thanks for your answers.

The described code uses 15 registers, occupancy is 1, instruction throughput obviously varies for the two cases, but

is up to 0.85. I use Tesla 1060.

CUDA 3.0 is currently not an option for me. I am forced to use an old (185.18.36) driver in my Ubuntu,

since I have problems starting X otherwise - I have another Nvidia card for display, and somehow I could not

make the newer drivers display anything on the Quadro. That is i guess a problem for another thread ;)

Now, about the performance…

With the if statement and different filters the code is ~25% slower, which I personally consider to be a disaster.

I have written a simple program (5.38 KB)


which implements what we have discussed. I have also included the trick

by cbuchner1. For a setup of 4096x4096 points, the fps results are as follows:

no if’s, the same filter - 463 fps


4 if’s, different filters at - 340 fps

the boundaries

cbuchner1 trick - 370 fps

different filters at boundaries

So unfortunately the trick did not work. I guess too many if statements.

In the attached implementation I did not use different filters, but rather a simple offset

into the same filter array, so that the values are close in the memory and the cache still works.

Anyway, the filters differ only at the boundaries, so the effect should be negligible if it was

due to some memory reading latency (most of the points are processed using the same


I am really lost here. This does not seem rational. As it looks now, it will be faster to first

process the data with the same filter, and afterward do some special treatment of the

boundaries. Not really elegant.

As your filter is only 3x3 in size, it may be worth sacrificing 9 registers to store the kernel. Load the filter kernel into registers once after the if conditions, then process several pixels in a loop to amortize the setup time.

Sorry, can’t do. It is 3x3 in size here, in the test application. In reality, it is bigger. Plus, I need registers

for other things as well. My current register count is 15. Adding more will limit my occupancy.