Apparently the -cubin option for nvcc, and “–ptxas-options=”-v"" return the number of registers used per thread per block to launch the kernel.

But how is this calculated?

And are the variables that use those registers listed anywhere in any output?

Firstly - we don’t need all our thread titles in caps. It’s annoying.

You can either examine the ptx assembly (-ptx compiler flag) or I think decuda ( may be what you want (not actually used it myself).

OK, sorry for the caps. I wanted attention drawn to my problems asap coz I’ve been trying to sort my problem out for a while now. There is me, a Tesla machine and this forum. That’s it. I do not have any work colleagues or supervisors to ask. I’ve been given a Tesla to program as part of my PhD without any local human help on hand.

Yes, that decuda looks interesting and could be what I want. Thanks.

The ptx code emitted by nvcc will appear to use a very large number of registers because the compiler is using static single assignment form. Every register in the ptx file is assigned only once, because this makes some optimizations easier to apply.

The job of ptxas is to take the ptx code and convert it to a cubin file, collapsing the register usage to the minimum possible. decuda reverse this, and will show you human-readable code with the actual register assignment.

Only utterly unoptimized compilers have a direct 1-1 correspondence between variables and registers.

P.S. Just curious: How does this info move you past a stumbling block?

The stumbling block was that I did not know that my CUDA program was returning an error message that told me that with 400 particles there were not enough resources to launch the kernel. I did not know how to catch this error, or that such an error existed. Now I know this I can amend my code so that I use more shared memory instead of registers, and/or more than one kernel.

Your reaction suggests that you have only 1 block. For making efficient use of GPU, you need to have lots of blocks (like thousands).

That’s not really true. Due to the way DDR works, my kernel’s performance drops two fold if I have more than one block per multiprocessor. But yes, obviously you should have more than one block total.

Well, I would call it a very important general guideline. Anybody taking his GPU into account when designing his CUDA code has made CUDA code that does not scale. If you designed for 8800GTX (16 MP) and you go to GTX280 (30 MP) you get like half the possible performance.

You probably have a relatively small problem, then I can imagine you can indeed gain performance by deviating from such guideline. (makes mental note to check one of my kernels for performance-gain-possibilities ;))

Yes, at the moment I am using only one block of 400 particles. This is because a book I am currently reading gives a small example of a fluid flow for 400 particles, so to help understand both the mathematical algorithm and the Tesla it seemed like a good idea to implement that (I understand that the max threads for one block is 512).

Eventually I will probably be working with millions of particles, and increasing the number of multiprocessors to 5 or 6, so I expect grid dimensions will play a massive role in performance. But for now I’m just understanding how to (a) write CUDA code, (B) compile it, © get the correct results which are compared to a run on a single CPU on a different machine.

The next problem I will be looking at is a larger and more complex example of fluid flow which will allow many experiments on variable grid dimensions.

512 is the upper maximum of threads in a block. In reality, it depends on the resource usage of your kernel. If you use more than 8192/512 = 16 registers in your kernel, you cannot use 512 threads per block, because a multiprocessor has only 8192 registers available.

A multiprocessor can actually support 768 threads, so if you would make your thread-block have 256 threads, you could run 3 blocks per multiprocessor (but then you cannot have more than 10 registers in your kernel (8192/768 = 10.6666)

These numbers are different for GT200, there your can have 1024 threads per MP and you have 16384 registers available.

So for your problem it is probably wise to make thread-blocks of 256. Then you use 2 blocks and do something like the followiin in your kernel:

unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < num_particles)

Yeah, that Tesla was a big waste of money seeing as a gtx 260 would be doing you a lot of good right about now.

Btw, you didn’t answer my question in the other thread. Have you tried -maxrregcount=N ?

I also want to calculate the number of registers per thread. How to do it? How to use --ptxas-options?

nvcc --ptxas-options -v ???

Just add the --ptxas-options “-v” to your normal compile job.
nvcc -o foo --ptxas-options “-v”


if you’re asking me the question about maxregcount then no I have not. Not yet anyway. I did some brief reading up on it what does and I understand what it does. But I think I’m best amending my code so that multiple blocks are used, which I’m going to have to do anyway if I eventually I will be using millions of threads/particles and which will require more than one kernel, and trying to get the profiler working so I can learn how to tune the code. At the moment the profiler does not work in that no profiler output file is output or if it is then it is not in the working directory.

nvcc -o yourfile.exe --ptxas-options=“-v”

It prints out the number of registers per thread per kernel, and constant and shared memory use per thread per kernel.

In visual studio, right-click the main .cu file, go to properties, and edit the Custom Build Step.

That’s a good strategy. Btw, the profiler is not a “real” profiler, technically, although its info is very useful. Use the clock() function and some macros to do real profiling.