Using volatile What have i actually done?

While trying to reduce the number or registers used by a kernel that will at some point be part of a much larger one, ive experimented with the volatile keyword.

This is what the kernel looked like at first:

__global__ void convolve(float* result) 

{

	const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;

	const int z = idx/(DATA_W*DATA_H);

	const int y = (idx - z * DATA_W * DATA_H) / DATA_W;

	const int x = Â (idx - z * DATA_W * DATA_H - y * DATA_W);

	if(x<DATA_W&&y<DATA_H&&z<DATA_D)

	{

 Â 

 Â float sum = 0;

 Â for(int kz = -(KERNEL_D - KERNEL_Z - 1); kz <= KERNEL_Z; kz++)

 Â {

 �  for(int ky = -(KERNEL_H - KERNEL_Y - 1); ky <= KERNEL_Y; ky++)

 �  {

 �   for(int kx = -(KERNEL_W - KERNEL_X - 1); kx <= KERNEL_X; kx++)

 �   {

 �  �  int dx = (idx - z * DATA_W * DATA_H - y * DATA_W) + kx;

 �  �  int dy = y + ky;

 �  �  int dz = z + kz;

 �  �  if(dx < 0) dx = 0;

 �  �  if(dy < 0) dy = 0;

 �  �  if(dz < 0) dz = 0;

 �  �  if(dx >= DATA_W) dx = DATA_W - 1;

 �  �  if(dy >= DATA_H) dy = DATA_H - 1;

 �  �  if(dz >= DATA_D) dz = DATA_D - 1;

 �  �  float filterVal = d_Kernel[((KERNEL_Z-kz)*KERNEL_H + (KERNEL_Y - ky)) * KERNEL_W + (KERNEL_X - kx)];

 �  �  float pixelVal = tex3D(texData,dx ,dy,dz);

 �  �  sum += filterVal * pixelVal;

 �   }

 �  }

 Â }

 Â result[(z*DATA_H+y) * DATA_W + x] = (float)sum;

	}

	

}

Where the KERNEL_WHATEVER are const int’s in the header.

Ive put those values in registers, as far as i can tell anyway… and the register count has gone down?

From 18 to 12

__global__ void convolve(float* result) 

{

	const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;

	const int z = idx/(DATA_W*DATA_H);

	const int y = (idx - z * DATA_W * DATA_H) / DATA_W;

	const int x = Â (idx - z * DATA_W * DATA_H - y * DATA_W);

	<b>volatile int kkd=KERNEL_D,kkh = KERNEL_H,kkz = KERNEL_Z,kky = KERNEL_Y, kkx = KERNEL_X, kkw = KERNEL_W;</b>

	if(x<DATA_W&&y<DATA_H&&z<DATA_D)

	{

 Â 

 Â float sum = 0;

 Â <b>for(int kz = -(kkd - kkz - 1); kz <= kkz; kz++)</b>

 Â {

 �  <b>for(int ky = -(kkh - kky - 1); ky <= kky; ky++)</b>

 �  {

 �   <b>for(int kx = -(kkw - kkx - 1); kx <= kkx; kx++)</b>

 �   {

 �  �  int dx = (idx - z * DATA_W * DATA_H - y * DATA_W) + kx;

 �  �  int dy = y + ky;

 �  �  int dz = z + kz;

 �  �  if(dx < 0) dx = 0;

 �  �  if(dy < 0) dy = 0;

 �  �  if(dz < 0) dz = 0;

 �  �  if(dx >= DATA_W) dx = DATA_W - 1;

 �  �  if(dy >= DATA_H) dy = DATA_H - 1;

 �  �  if(dz >= DATA_D) dz = DATA_D - 1;

 �  �  <b>float filterVal = d_Kernel[((kkz-kz)*kkh+ (kky - ky)) * kkw + (kkx - kx)];</b>

 �  �  float pixelVal = tex3D(texData,dx ,dy,dz);

 �  �  sum += filterVal * pixelVal;

 �   }

 �  }

 Â }

 Â result[(z*DATA_H+y) * DATA_W + x] = (float)sum;

	}

	

}

So… this is probably the worst question anyone can ask… but… what have i actually done?

Edit… well it seems i cant use bold inside code blocks. I guess you can still see which lines have been modified with the [B] so ill leave them there.

I’m not really sure why you’re using the volatile keyword here - isn’t it for variables that can change value at any time? These are quite the opposite - compile-time constants.

The reg count decreasing is probably expected, as I don’t think you can store a volatile in a register. Are you sure that it hasn’t just moved them into local memory?

I thoutht the volatile keyword was used to force something to stay in a register.

lmam is still 0 as per the cubin file.

Looking at the ptx, it seems to be doing just that:

mov.s32  %r1, 7;              	// 

	mov.s32  %r2, %r1;            	// 

	mov.s32  %r3, 7;              	// 

	mov.s32  %r4, %r3;            	// 

	mov.s32  %r5, 3;              	// 

	mov.s32  %r6, %r5;            	// 

	mov.s32  %r7, 3;              	// 

	mov.s32  %r8, %r7;            	// 

	mov.s32  %r9, 3;              	// 

	mov.s32  %r10, %r9;            // 

	mov.s32  %r11, 7;              // 

	mov.s32  %r12, %r11;          	//

7 and 3 are the values for kernel sizes and kernel centers…

Well, i can honestly say the volatile keyword still eludes me!

Volatile means that the value can change at any time. Thus, you are forcing the compiler to NOT keep 7 and 3 sitting around int the same register all the time, but reload it every time it is needed. You can confirm this by comparing the two ptx outputs: see if one loads 7 and 3 once into a register and the other loads every loop iteration. (the list source on option can help identify where loops are)

volatile keyword is used for variables that you do not want to be included in the compiler optimization and hence should be written back to memory. Without the volatile keyword, you can’t really be sure when the update to variable is visible to other threads (unless you call __syncthreads()).

Thanks for the replies guys.

3 and 7 as literals are not loaded again in the ptx, but r2 r4 r6… are loaded, registers which contain 3s and 7s. So they seem to be copied back from memory… so this brings me to this:

I guess this is a stupid question but im only wondering about this now…
KERNEL_D and all of these are declared as const int in the header of the .cu file.
Where do they reside when the kernel is executing? I didnt copy them over to cuda memory.
Im saving 6 registers by using 6 volatile variables… so where are the values copied from? local mem is still lmem=0.

As the volatile keyword is used to avoid the read/write optimization for shared memory variables; my guess is therefore that the volatile variables are placed in shared memory. Have you checked your shared memory usage?

It will probably take decuda to figure out for certain. My guess is that the literals are just store in the literal portion of the mv assembly instruction: in other words the values are stored in the code segment.

Umm, wouldn’t they get copied in each time from constant memory? EDIT: I don’t believe ptx stores constants in the code segment, does it?

Anyway, that’s a cool trick to force things out of registers.

I do have this at the begining of the ptx:

.entry _Z8convolvePf

	{

	.reg .u16 %rh<5>;

	.reg .u32 %r<131>;

	.reg .f32 %f<13>;

	.reg .pred %p<17>;

	.param .u32 __cudaparm__Z8convolvePf_result;

	// kkd = 0

	// kkh = 4

	// kkz = 8

	// kky = 12

	// kkx = 16

	// kkw = 20

	// kz = 24

	// ky = 28

	// kx = 32

	.loc	14	42	0

 // Â 38 Â 

 // Â 39 Â 

 // Â 40 Â 

 // Â 41 Â 

 // Â 42 Â __global__ void convolve(float* result)

My x86 (i know this isnt x86 but i guess it can relate somehow!) assembler is a bit rusty, but it looks like its placing the variables on some kind of stack? where 0 4 8… are the offsets. Something like MisterAnderson said anyway.

Shared memory usage stays the same (says the cubin) with or without the volatile variables.

Ailleur,

Whatever you have done – is going to help many reduce their register count!

Thanks!

It would be interesting to know how this impacted on the performance of a kernel where occupancy isn’t increased.

I gave it a quick glance some days ago when i first tried it. Time went up a bit and instructions count too, but afaik it was within 5%.

And this was dont with only 1 run so not the most scientific method.

Ill do a more thorough investigation later today.

I have tried to do the same thing with DATA_W and such (see first post) but the count hasnt gone down. I guess im even more lost as to what is happening. Guess ill have to bang my head on the decudad code for this one.