Some questions about PTX and cubin. How to know where my registers are used for?


I have some questions about ptx code. My current code is using too many registers. I invoked (64, 2) threads for one block and each of the threads in my algorithm explicitly uses about 100 registers. But when I compile it with --ptxas-options=-v, it reports 121 registers plus 120 bytes local memory usage. Now I have the following questions:

  1. I looked at ptx code and there was no local memory usage. I mean I searched “.local” but found non. So is this simply because I use too many registers? But I use only 100 registers explicitly in my algorithm! Could other temporary variables take so many registers?

  2. The only local memory usage that can be observed is due to careless programming, right? (e.g. can’t determine the index of an array during compilation) Only in this case, we can clearly see there is local memory declared in ptx code. If the local memory usage is only due to too many registers used, we can’t tell this from ptx code alone.

  3. As I can see in ptx code, there are about 900 temporary registers declared. When compiled from ptx file to cubin file, it’s further optimized and thus most of these registers are eliminated, right?

  4. How can I know where my registers are used and what they are used for? Or which part of my code results in local memory usage? I can’t see any local memory usage from ptx code now.

  5. Decuda is only available before sdk 3.0 and after that, cubin file uses ELF format which disabled decuda. Is this correct? By the way, why I can’t find any decuda download available online now? This link ( seems unavailable any more.

  6. In programming guide page 17, there is a sample code like the followings. I wonder how to allocate a two dimensional array like A[N][N] on device memory? I know we can use linear memory instead and that’s why I think maybe it’s an error in the manual.


// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N],

float C[N][N])


int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

C[i][j] = A[i][j] + B[i][j];


int main()



// Kernel invocation

dim3 dimBlock(16, 16);

dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,

(N + dimBlock.y – 1) / dimBlock.y);

MatAdd<<<dimGrid, dimBlock>>>(A, B, C);


Sincerely waiting for your reply and your help. Thanks.

ptx uses an infinite number of registers. Register allocation is done and register spilling is introduced on compiling from ptx to cubin, so you won’t see the local memory use in ptx code (unless a variable is explicitly moved to local memory because it’s address is taken or it’s an array which is used with non-constant indices).

Apparently decuda has recently been updated for use with elf as well. However, I have no experience with that, still use it with old cubin files only.

  1. The compiler will use as many registers as it sees fit, for optimization/subexpression elimination purposes.

  2. Correct, as far as I know.

  3. nvcc generates PTX in what’s called static single assignment form, so the number of registers used in nvcc’s --ptx output bears no relation to the number of registers coming out of ptxas.

  4. Hard to say. You can try looking at the disassembly from decuda/nouveau for more details.

5a. I’ve written a script to make it possible to use decuda or the nouveau disassemblers (nv50dis/nvc0dis) with the ELF cubin files from CUDA 3.0/3.1:…t&p=1080943. I’ve only tried it on Linux, though.

5b. You can still get decuda from there, you just need to git clone the repository.

  1. A two-dimensional array with constant bounds like that is allocated row-major anyway, so you can do it with linear memory. Alternatively you could define a 2d shared or local array as long as N is known at compile time.

Oh, thank you so much. :-)