Compiler generated code for constant memory access - a question

angainor · June 3, 2010, 8:13pm

Hi all,

I have a question regarding the compilation of a simple CUDA kernel. It is by no means

functional, it is just the simplest code I managed to come up with that shows the problem I have.

I would like to apply a filters to a grid of points (who doesn’t…)… The filters are few,

so I store them in the constant memory for caching. Depending on some condition decided at runtime,

I would like every thread to use the appropriate set of coefficients.

The below code shows the basic idea:

[codebox]

device constant float coeff1[3*3];

device constant float coeff2[3*3];

global void test(float *out){

extern shared float shm;

float * kernel = coeff1;

if(out[0]>0) kernel = coeff2;

out[0] =

kernel[0]*shm[0] +   kernel[1]*shm[1] +   kernel[2]*shm[2] +

kernel[3]*shm[3] +   kernel[4]*shm[4] +   kernel[5]*shm[5] +

kernel[6]*shm[6] +   kernel[7]*shm[7] +   kernel[8]*shm[8] ;

}

[/codebox]

Now, there is a substantial difference between the output generated in this scenario,

and the output generated for just a single set of filter coefficients (the if statement in the above commented out).

FYI, I am using 2.2 version of CUDA compilation tools (V0.2.1221).

This is the output of decuda for the code without the if statement

[codebox]

// Disassembling _Z4testPf

000000: d00b8005 20000780 mov.b32 $ofs1, 0x000005c0

000008: c491d201 00200780 mul.rn.f32 $r0, s[$ofs1+0x0024], c0[$ofs1+0x0044]

000010: e490d001 00200780 mad.rn.f32 $r0, s[$ofs1+0x0020], c0[$ofs1+0x0040], $r0

000018: e492d401 00200780 mad.rn.f32 $r0, s[$ofs1+0x0028], c0[$ofs1+0x0048], $r0

000020: e493d601 00200780 mad.rn.f32 $r0, s[$ofs1+0x002c], c0[$ofs1+0x004c], $r0

000028: e494d801 00200780 mad.rn.f32 $r0, s[$ofs1+0x0030], c0[$ofs1+0x0050], $r0

000030: e495da01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0034], c0[$ofs1+0x0054], $r0

000038: e496dc01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0038], c0[$ofs1+0x0058], $r0

000040: e497de01 00200780 mad.rn.f32 $r0, s[$ofs1+0x003c], c0[$ofs1+0x005c], $r0

000048: e498e005 00200780 mad.rn.f32 $r1, s[$ofs1+0x0040], c0[$ofs1+0x0060], $r0

000050: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000058: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

[/codebox]

and this is for the code with the if statement:

[codebox]

// Disassembling _Z4testPf

000000: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000008: d00e0005 80c00780 mov.u32 $r1, g[$r0]

000010: b07c03fd 600107c8 set.gt.u16.f32.f32 $p0|$o127, $r1, $r124

000018: 10008005 0000005f mov.b32 $r1, 0x000005c0

000020: 10000005 2440c280 @$p0.ne mov.b32 $r1, c1[0x0000]

000028: 00000205 c0000780 movsh.b32 $ofs1, $r1, 0x00000000

000030: 1100f204 mov.half.b32 $r1, s[0x0024]

000034: c4810204 mul.half.rn.f32 $r1, $r1, c0[$ofs1+0x0004]

000038: 14000009 2400c780 mov.b32 $r2, c0[$ofs1+0x0000]

000040: e002d009 00204780 mad.rn.f32 $r2, s[0x0020], $r2, $r1

000048: 10001405 4400c780 mov.b32 $r1, s[0x0028]

000050: e4820209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0008], $r2

000058: 10001605 4400c780 mov.b32 $r1, s[0x002c]

000060: e4830209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x000c], $r2

000068: 10001805 4400c780 mov.b32 $r1, s[0x0030]

000070: e4840209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0010], $r2

000078: 10001a05 4400c780 mov.b32 $r1, s[0x0034]

000080: e4850209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0014], $r2

000088: 10001c05 4400c780 mov.b32 $r1, s[0x0038]

000090: e4860209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x0018], $r2

000098: 10001e05 4400c780 mov.b32 $r1, s[0x003c]

0000a0: e4870209 00008780 mad.rn.f32 $r2, $r1, c0[$ofs1+0x001c], $r2

0000a8: 10002005 4400c780 mov.b32 $r1, s[0x0040]

0000b0: e4880205 00008780 mad.rn.f32 $r1, $r1, c0[$ofs1+0x0020], $r2

0000b8: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

// segment: const (1:0000)

0000: 000005e4

[/codebox]

It seems like in the first example mad’s operate directly on the memory addresses,

while in the second example the data is moved first to the registers.

Is this necessary for some architectural reason I do not understand, or can this be avoided?

I would rather have the first version of the code - it is much faster…

I’d appreciate any help.

Thanks!

cbuchner1 · June 3, 2010, 9:11pm

This is what you want to do.

[codebox]

device constant float coeff1[3*3];

device constant float coeff2[3*3];

global void test(float *out){

extern shared float shm;

if(out[0]>0)

{

out[0] = 

coeff1[0]*shm[0] +   coeff1[1]*shm[1] +   coeff1[2]*shm[2] +

coeff1[3]*shm[3] +   coeff1[4]*shm[4] +   coeff1[5]*shm[5] +

coeff1[6]*shm[6] +   coeff1[7]*shm[7] +   coeff1[8]*shm[8] ;

}

else

{

out[0] = 

coeff2[0]*shm[0] +   coeff2[1]*shm[1] +   coeff2[2]*shm[2] +

coeff2[3]*shm[3] +   coeff2[4]*shm[4] +   coeff2[5]*shm[5] +

coeff2[6]*shm[6] +   coeff2[7]*shm[7] +   coeff2[8]*shm[8] ;

}

[/codebox]

If the compiler tries to be smart, be smarter.

There may be an issue with branch divergence however.

angainor · June 4, 2010, 12:32pm

Thanks. This trick did indeed produce the code I wanted. However, this solution is

difficult when you have, say, 10 sets of coefficients and a few more lines of code with it.

Both from the programming point of view, and the size of the compiled kernel - this

is a bit awkward…

Do you think this is a bug in the compiler, or is it an intended behavior? Maybe I should

post to a different forum (e.g. Development)? I am a bit new here…

Thanks again…

cbuchner1 · June 4, 2010, 12:36pm

When the compiler does not know from which constant memory address to pull the coefficients (because you change the pointer at run time), it will generate more generic code to access the data. I’d say this is “intended behavior”.

The CUDA developer forum may be more suited for very specific programming questions. This is more a “general” area.

Christian

lsolano · July 13, 2010, 6:44am

I am a bit puzzled by certain ‘identifiers’ in the code. What is the meaning of c0

, s

and g

? Also any references will be appreciated.

[codebox]

// Disassembling _Z4testPf

000000: d00b8005 20000780 mov.b32 $ofs1, 0x000005c0

000008: c491d201 00200780 mul.rn.f32 $r0, s[$ofs1+0x0024], c0[$ofs1+0x0044]

000010: e490d001 00200780 mad.rn.f32 $r0, s[$ofs1+0x0020], c0[$ofs1+0x0040], $r0

000018: e492d401 00200780 mad.rn.f32 $r0, s[$ofs1+0x0028], c0[$ofs1+0x0048], $r0

000020: e493d601 00200780 mad.rn.f32 $r0, s[$ofs1+0x002c], c0[$ofs1+0x004c], $r0

000028: e494d801 00200780 mad.rn.f32 $r0, s[$ofs1+0x0030], c0[$ofs1+0x0050], $r0

000030: e495da01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0034], c0[$ofs1+0x0054], $r0

000038: e496dc01 00200780 mad.rn.f32 $r0, s[$ofs1+0x0038], c0[$ofs1+0x0058], $r0

000040: e497de01 00200780 mad.rn.f32 $r0, s[$ofs1+0x003c], c0[$ofs1+0x005c], $r0

000048: e498e005 00200780 mad.rn.f32 $r1, s[$ofs1+0x0040], c0[$ofs1+0x0060], $r0

000050: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000058: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

[/codebox]

Thanks.

Topic		Replies	Views
Compiler generated code for constant memory access - a question CUDA Programming and Performance	5	6735	June 7, 2010
Tables not correct when using __constant__ CUDA Programming and Performance	12	14979	July 28, 2010
Pass arguments through constant memory CUDA Programming and Performance	20	8718	August 11, 2010
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7497	January 25, 2010
Constant memory provides no improvement CUDA Programming and Performance cuda , algorithm	16	289	January 17, 2025
constant memory problem CUDA Programming and Performance	7	9784	January 29, 2010
Cuda test particle simulation with 10^6 particles and 10^5 constants CUDA Programming and Performance cuda	14	751	October 12, 2021
Why compiler prefer to use registers to cache hot data rather than constant memory? CUDA Programming and Performance	22	1883	November 7, 2022
use .ptx to check "const buffer in branch" .ptx, branch, constant buffer CUDA Programming and Performance	1	3821	August 27, 2007
Constant Arrays CUDA Programming and Performance	13	30829	November 24, 2007

Compiler generated code for constant memory access - a question

Related topics