Some questions about PTX and cubin. How to know where my registers are used for?

Hardy616 · July 8, 2010, 6:48pm

Hello,

I have some questions about ptx code. My current code is using too many registers. I invoked (64, 2) threads for one block and each of the threads in my algorithm explicitly uses about 100 registers. But when I compile it with --ptxas-options=-v, it reports 121 registers plus 120 bytes local memory usage. Now I have the following questions:

I looked at ptx code and there was no local memory usage. I mean I searched “.local” but found non. So is this simply because I use too many registers? But I use only 100 registers explicitly in my algorithm! Could other temporary variables take so many registers?
The only local memory usage that can be observed is due to careless programming, right? (e.g. can’t determine the index of an array during compilation) Only in this case, we can clearly see there is local memory declared in ptx code. If the local memory usage is only due to too many registers used, we can’t tell this from ptx code alone.
As I can see in ptx code, there are about 900 temporary registers declared. When compiled from ptx file to cubin file, it’s further optimized and thus most of these registers are eliminated, right?
How can I know where my registers are used and what they are used for? Or which part of my code results in local memory usage? I can’t see any local memory usage from ptx code now.
Decuda is only available before sdk 3.0 and after that, cubin file uses ELF format which disabled decuda. Is this correct? By the way, why I can’t find any decuda download available online now? This link (http://github.com/laanwj/decuda/downloads) seems unavailable any more.
In programming guide page 17, there is a sample code like the followings. I wonder how to allocate a two dimensional array like A[N][N] on device memory? I know we can use linear memory instead and that’s why I think maybe it’s an error in the manual.

[codebox]

// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N],

float C[N][N])

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

C[i][j] = A[i][j] + B[i][j];

}

int main()

{

...

// Kernel invocation

dim3 dimBlock(16, 16);

dim3 dimGrid((N + dimBlock.x â€“ 1) / dimBlock.x,

(N + dimBlock.y â€“ 1) / dimBlock.y);

MatAdd<<<dimGrid, dimBlock>>>(A, B, C);

}

Sincerely waiting for your reply and your help. Thanks.

tera · July 8, 2010, 9:30pm

ptx uses an infinite number of registers. Register allocation is done and register spilling is introduced on compiling from ptx to cubin, so you won’t see the local memory use in ptx code (unless a variable is explicitly moved to local memory because it’s address is taken or it’s an array which is used with non-constant indices).

Apparently decuda has recently been updated for use with elf as well. However, I have no experience with that, still use it with old cubin files only.

ihaque · July 9, 2010, 8:44am

The compiler will use as many registers as it sees fit, for optimization/subexpression elimination purposes.
Correct, as far as I know.
nvcc generates PTX in what’s called static single assignment form, so the number of registers used in nvcc’s --ptx output bears no relation to the number of registers coming out of ptxas.
Hard to say. You can try looking at the disassembly from decuda/nouveau for more details.

5a. I’ve written a script to make it possible to use decuda or the nouveau disassemblers (nv50dis/nvc0dis) with the ELF cubin files from CUDA 3.0/3.1: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA. I’ve only tried it on Linux, though.

5b. You can still get decuda from there, you just need to git clone the repository.

A two-dimensional array with constant bounds like that is allocated row-major anyway, so you can do it with linear memory. Alternatively you could define a 2d shared or local array as long as N is known at compile time.

Hardy616 · July 9, 2010, 2:03pm

The compiler will use as many registers as it sees fit, for optimization/subexpression elimination purposes.

Correct, as far as I know.

nvcc generates PTX in what’s called static single assignment form, so the number of registers used in nvcc’s --ptx output bears no relation to the number of registers coming out of ptxas.

Hard to say. You can try looking at the disassembly from decuda/nouveau for more details.

5a. I’ve written a script to make it possible to use decuda or the nouveau disassemblers (nv50dis/nvc0dis) with the ELF cubin files from CUDA 3.0/3.1: http://forums.nvidia.com/index.php?showtop…t&p=1080943. I’ve only tried it on Linux, though.

5b. You can still get decuda from there, you just need to git clone the repository.

A two-dimensional array with constant bounds like that is allocated row-major anyway, so you can do it with linear memory. Alternatively you could define a 2d shared or local array as long as N is known at compile time.

Oh, thank you so much. :-)

Topic		Replies	Views
Got some questions about ptx code ptx code, cubin file, decuda, register usage CUDA Programming and Performance	1	3183	July 8, 2010
Optimizing ptx CUDA Programming and Performance	10	9239	April 24, 2008
register usage according to the ptx file CUDA Programming and Performance	3	4312	June 26, 2009
Register Limit? Compilation to .cubin using local memory CUDA Programming and Performance	5	2520	December 11, 2008
Finding .local variables from ptx is there a way ? please help CUDA Programming and Performance	7	10475	January 5, 2010
Weird local mem usage CUDA Programming and Performance	6	2254	January 13, 2009
ptx question CUDA Programming and Performance	4	3905	October 16, 2008
Help with variable allocate in register CUDA Programming and Performance	12	1808	July 24, 2016
Counting register number in PTX CUDA Programming and Performance	5	1904	June 27, 2011
Understanding how the compiler assigns registers Checking the .cubin file CUDA Programming and Performance	4	3406	November 10, 2008

Some questions about PTX and cubin. How to know where my registers are used for?

Related topics