OpenCL and CUDA registers usage optimization

I’m currently writing an OpenCL kernel (but I suppose that in CUDA in will be the same), and currently I try to optimize for NVidia GPU.

I currently use 63 registers in my kernel, this kernel is very big and so it use all the GPU registers. I’m looking for some way to:

  1. See which variables are in registers and which are then in global memory (Because if I have not enough registers it seems the compiler save the variables in global memory).

  2. Is there a way to specify which variable is more important (or which should be in registers). Because I use some variables that are present but less used. A way to give priority ?

Is there other optimization strategy when we already use all the registers ?
For now I have try the following:
a) use the ‘volatile’ keyword, to use with care and specific case
b) to put the code in scope block, ie. {…}, this way variables have only a limited scope
c) In CUDA we can use ‘cudaDeviceSetCacheConfig’ but I can’t find any corresponding method in OpenCL

BTW : I have also try to read the PTX code and search for all the “.reg” keywords but the problem is that the PTX is unreadable, I don’t know which register is used for which variable in my code. I have’nt find any way to have the correspondance !