How to force to use registers instead of shared memory?

For example, if part of my code is as follows:

shared float sh;

register float tmp = sh[i];

a[0] += tmp;

a[1] += tmp;

a[2] += tmp;

a[3] += tmp;

I wish to load sh[i] into a register first, and then use this register to do the following computation. However, according to file generated by decuda, no register is used, but each time “tmp” is referenced in the source code, it is replaced by a shared memory access. I checked the ptx file, and register is used, so I assume it alright all the way through nvcc compiler, and what replaces register with shared memory access is ptxas.

Any idea how to force to use register? Thanks a lot! It is very important for me!!!

Maybe a volatile at register declaration would do? Or a __threadfence_block() after reading the sh[i] value?
This is actually strange, because what I believed is that shared values are cached in the registers instead…

Maybe you are at the limit of register usage?

Thanks for your reply, but unfortunately, neither works. Besides, __threadfence_block() has this annoying side effect of preventing the compiler from unrolling. I am using GTX200 series, and I am sure there is plenty of registers to use. It’s a shame we can’t write assembly code ourselves.

There is a way, but it is not officially suported.

How do you propose to do it?

Check out third last and fourth last post in this thread:
[url=“http://forums.nvidia.com/index.php?showtopic=161368”]The Official NVIDIA Forums | NVIDIA

Remember however, even if you inline some PTX-assemply code, it is still subject to PTX->CUDA code optimisations and it may get transformed.

you can use cudasm to do this,

mov tmp, sh[i];

add a[0], a[0], tmp;

add a[1], a[1], tmp;

add a[2], a[2], tmp;

add a[3], a[3], tmp;

I will suggest a workaround I used in the thread

http://forums.nvidia.com/index.php?showtopic=159033

especially section 4.2 in the report.

step 1: modify “float tmp = sh[i]” to “float tmp = sh[i] * 4.0f ;”

nvcc has no choice but translate it to

MUL tmp, sh[i], 4.0f 

(you can use decuda to check this)

float tmp = sh[i] * 4.0f;

a[0] += tmp;

a[1] += tmp;

a[2] += tmp;

a[3] += tmp;

step 2: use decuda to generate .asm file

step 3: find the assembly code corresponding to “float tmp = sh[i] * 4.0f ;”,

for example, it must be of the form

mul.rn.f32 $r1, s[0x0050], c1[0x0040]

then change above code to

lsd.b32 $r1, s[0x0050]

or

mov.b32 $r1, s[0x0050]

(please see the report for detailed description)

step 4: use cudasm to assemble modified .asm file into binary file

step 5: use driver API to load binary file.

If you want to use this method, you need patience to take care of each step.