Compiler generated code for constant memory access - a question

Hi all,

I have a question regarding the compilation of a simple CUDA kernel. It is by no means

functional, it is just the simplest code I managed to come up with that shows the problem I have.

I would like to apply a filters to a grid of points (who doesn’t…)… The filters are few,

so I store them in the constant memory for caching. Depending on some condition decided at runtime,

I would like every thread to use the appropriate set of coefficients.

The below code shows the basic idea:


device constant float coeff1[3*3];

device constant float coeff2[3*3];

global void test(float *out){

extern shared float shm;

float * kernel = coeff1;

if(out[0]>0) kernel = coeff2;

out[0] =

kernel[0]*shm[0] +   kernel[1]*shm[1] +   kernel[2]*shm[2] +

kernel[3]*shm[3] +   kernel[4]*shm[4] +   kernel[5]*shm[5] +

kernel[6]*shm[6] +   kernel[7]*shm[7] +   kernel[8]*shm[8] ;



Now, there is a substantial difference between the output generated in this scenario,

and the output generated for just a single set of filter coefficients (the if statement in the above commented out).

FYI, I am using 2.2 version of CUDA compilation tools (V0.2.1221).

This is the output of decuda for the code without the if statement


// Disassembling _Z4testPf

000000: d00b8005 20000780 mov.b32 $ofs1, 0x000005c0

000008: c491d201 00200780 mul.rn.f32 $r0, s[$ofs1+0x0024], c0[$ofs1+0x0044]

000010: e490d001 00200780 mad.rn.f32 $r0, s[$ofs1+0x0020], c0[$ofs1+0x0040], $r0

000018: e492d401 00200780 mad.rn.f32 $r0, s[$ofs1+0x0028], c0[$ofs1+0x0048], $r0

000020: e493d601 00200780 mad.rn.f32 $r0, s[$ofs1+0x002c], c0[$ofs1+0x004c], $r0

000028: e494d801 00200780 mad.rn.f32 $r0, s[$ofs1+0x0030], c0[$ofs1+0x0050], $r0

000030: e495da01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0034], c0[$ofs1+0x0054], $r0

000038: e496dc01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0038], c0[$ofs1+0x0058], $r0

000040: e497de01 00200780 mad.rn.f32 $r0, s[$ofs1+0x003c], c0[$ofs1+0x005c], $r0

000048: e498e005 00200780 mad.rn.f32 $r1, s[$ofs1+0x0040], c0[$ofs1+0x0060], $r0

000050: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000058: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1


and this is for the code with the if statement:


// Disassembling _Z4testPf

000000: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000008: d00e0005 80c00780 mov.u32 $r1, g[$r0]

000010: b07c03fd 600107c8 $p0|$o127, $r1, $r124

000018: 10008005 0000005f mov.b32 $r1, 0x000005c0

000020: 10000005 2440c280 @$ mov.b32 $r1, c1[0x0000]

000028: 00000205 c0000780 movsh.b32 $ofs1, $r1, 0x00000000

000030: 1100f204 mov.half.b32 $r1, s[0x0024]

000034: c4810204 mul.half.rn.f32 $r1, $r1, c0[$ofs1+0x0004]

000038: 14000009 2400c780 mov.b32 $r2, c0[$ofs1+0x0000]

000040: e002d009 00204780 mad.rn.f32 $r2, s[0x0020], $r2, $r1

000048: 10001405 4400c780 mov.b32 $r1, s[0x0028]

000050: e4820209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0008], $r2

000058: 10001605 4400c780 mov.b32 $r1, s[0x002c]

000060: e4830209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x000c], $r2

000068: 10001805 4400c780 mov.b32 $r1, s[0x0030]

000070: e4840209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0010], $r2

000078: 10001a05 4400c780 mov.b32 $r1, s[0x0034]

000080: e4850209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0014], $r2

000088: 10001c05 4400c780 mov.b32 $r1, s[0x0038]

000090: e4860209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0018], $r2

000098: 10001e05 4400c780 mov.b32 $r1, s[0x003c]

0000a0: e4870209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x001c], $r2

0000a8: 10002005 4400c780 mov.b32 $r1, s[0x0040]

0000b0: e4880205 00008780 mad.rn.f32 $r1, $r1, c0[$ofs1+0x0020], $r2

0000b8: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

// segment: const (1:0000)

0000: 000005e4


It seems like in the first example mad’s operate directly on the memory addresses,

while in the second example the data is moved first to the registers.

Is this necessary for some architectural reason I do not understand, or can this be avoided?

I would rather have the first version of the code - it is much faster…

I’d appreciate any help.


This is what you want to do.


device constant float coeff1[3*3];

device constant float coeff2[3*3];

global void test(float *out){

extern shared float shm;



out[0] = 

coeff1[0]*shm[0] +   coeff1[1]*shm[1] +   coeff1[2]*shm[2] +

coeff1[3]*shm[3] +   coeff1[4]*shm[4] +   coeff1[5]*shm[5] +

coeff1[6]*shm[6] +   coeff1[7]*shm[7] +   coeff1[8]*shm[8] ;




out[0] = 

coeff2[0]*shm[0] +   coeff2[1]*shm[1] +   coeff2[2]*shm[2] +

coeff2[3]*shm[3] +   coeff2[4]*shm[4] +   coeff2[5]*shm[5] +

coeff2[6]*shm[6] +   coeff2[7]*shm[7] +   coeff2[8]*shm[8] ;



If the compiler tries to be smart, be smarter.

There may be an issue with branch divergence however.

Thanks. This trick did indeed produce the code I wanted. However, this solution is

difficult when you have, say, 10 sets of coefficients and a few more lines of code with it.

Both from the programming point of view, and the size of the compiled kernel - this

is a bit awkward…

Do you think this is a bug in the compiler, or is it an intended behavior? Maybe I should

post to a different forum (e.g. Development)? I am a bit new here…

Thanks again…

When the compiler does not know from which constant memory address to pull the coefficients (because you change the pointer at run time), it will generate more generic code to access the data. I’d say this is “intended behavior”.

The CUDA developer forum may be more suited for very specific programming questions. This is more a “general” area.


I am a bit puzzled by certain ‘identifiers’ in the code. What is the meaning of c0

  • , s
  • and g
  • ? Also any references will be appreciated.