How to force variables to be on a register, local memory or shared memory?

Hi everybody,

I am having a “strange” behaviour of the cuda compiler concerning the memory usage.

Well, I have a kernel that fails to execute due to “too many resources requested for launch”, the cubin file contains :

code {
name = _Z6stereoPfS_PKhS1_j
lmem = 16
smem = 48
reg = 27
bar = 0
bincode {

}
const {
segname = const
segnum = 1
offset = 0
bytes = 20
mem {

}
}
}

The problem is the number of registers required per thread.
Once i added a “fake” instruction that assigns a variable to itselef (bestd[pix] =bestd[pix]), the kernel executes perfectly, and the cubin file contains :

code {
name = _Z6stereoPfS_PKhS1_j
lmem = 16
smem = 48
reg = 18
bar = 0
bincode {

}
const {
segname = const
segnum = 1
offset = 0
bytes = 24
mem {

}
}
}

How come a “fake” instruction could change the registers usage within a kernel ? Could you please help me!

Thanks

May b, You had that instruction inserted inside a LOOP which was UNROLLED by the compiler.

Indeed, the instruction i added is in a for loop (4 iterations). I tried to force the compiler not to unroll the loop (#pragma unroll 1), nothing changed!

NOt sure #pragma unroll 1 would stop unrolling. But anyway, 4 iterations is pretty less… Should NOT cause that much drastic change that you are seeing.

OR Is your code in a non-global function that you call from many places from the global kernel ?

Anyway, if you are sure that only one C statement has been effectively added, I think what you are seeing is extremely strange. Only an NVIDIA compiler guy can answer this baffling situation.

Here is the general structure of my kernel function. I have two loops of 4 iterations that do almost the same thing. If i comment one or the two of the “strange” instructions, the kernel fails to launch due to the incredible increase in registers usage (from 18 to 27)!!!

__global__ void stereo( float *disparityLeft,

                        float *disparityRight,

                        const unsigned char *left,

                        const unsigned char *right,

                        size_t width )

{

    (variables declaration and loading from device memory..)

   for(int pix=0; pix<4; pix++) {

        ...

        for(d=STEREO_MIND; d<=STEREO_MAXD; d++) {

            sum=0;

            for(is=ii-STEREO_RADIUS_W; is<=ii+STEREO_RADIUS_W; is++) {

                sum += diff(sleft[is],sright[is-d]);

            }

            ...

            __syncthreads();

            for(i=-STEREO_RADIUS_H; i<=STEREO_RADIUS_H; i++,is+=STEREO_THREADS_W) {

               ....

            }

            if(sum < bestcs) {

                 bestcs = sum;

                 bestd[pix] = d-1;

                    

            }

            

        }

        // the "strange" instruction, when i comment this line, the number of registers used by the kernel increases by 9 !!!!

        bestd[pix] =bestd[pix];

    }

    (Write results in shared memory).

   

    for(int pix=0; pix<4; pix++) {

        ...

        for(d=STEREO_MIND; d<=STEREO_MAXD; d++) {

            sum=0;

            for(is=ii-STEREO_RADIUS_W; is<=ii+STEREO_RADIUS_W; is++) {

                sum += diff(sleft[is+d],sright[is]);

            }

            ...

            __syncthreads();

            for(i=-STEREO_RADIUS_H; i<=STEREO_RADIUS_H; i++,is+=STEREO_THREADS_W) {

               ....

            }

            if(sum < bestcs) {

                 bestcs = sum;

                 bestd[pix] = d-1;

                    

            }

            

        }

        // the "strange" instruction, when i comment this line, the number of registers used by the kernel increases by 9 !!!!

        bestd[pix] =bestd[pix];

    }

    (Write results in shared memory).

}  

Hallo! I have the similar strange situation (I already posted a topic about it). I have a kernel properly working till I’m adding a definition of a fake variable
const FMatrix p = divuA12+divuA22;
this p is never used further in the kernel. However with its declaration kernel doesn’t produce any computations. It’s also strange that in emulation mode everyting is fine.

It’s also interesting that before declaration of p cubin was
lmem = 2856
smem = 24
reg=32

and after declaration it becomes
lmem = 3156
smem = 24
reg=34

The kernel is executed in 256 threads and about 1000 blocks.

You are exhausting the register resources on the device: 34 * 256 = 8704 > 8192 (the number of registers per multiprocessor). If you check for errors after your kernel call, it should report “too many resources requested for launch”

As to exactly why adding an unused variable to your kernel increases the number of registers, I can’t say. You must be confusing the dead code or register optimizers. If you post the full code an NVIDIA rep may file a bug report with the compiler team.