What is code compiled with -arch=sm_13 slower?

I have a small code that tries to calculate the infinite sum of 1/(n*log(n)*log(n). I am using CUDA 2.1. When

I compile with -arch=sm_13 and without it, I get drastically different register usages:

nvcc -O3 -use_fast_math --ptxas-options="-v -O3" slowsum.cu -o slowsum

In file included from slowsum.cu:6:

nvcchelp.h:14:47: warning: backslash and newline separated by space

ptxas info : Compiling entry function ‘_Z20kernel_slowsum_floatliPf’

ptxas info : Used 8 registers, 4136+40 bytes smem, 4 bytes cmem[1]

nvcc -O3 -arch=sm_13 -use_fast_math --ptxas-options="-v -O3" slowsum.cu -o slowsum

In file included from slowsum.cu:6:

nvcchelp.h:14:47: warning: backslash and newline separated by space

ptxas info : Compiling entry function ‘_Z20kernel_slowsum_floatliPf’

ptxas info : Used 20 registers, 4136+40 bytes smem, 4 bytes cmem[1]

This change in register usage makes a signficant change in performance because of occupancy. My question

is why does this happen?








#include “nvcchelp.h”

#define ANSWER 2.109742801237

#define REALTYPE float

shared REALTYPE sdata[1024];

global void kernel_slowsum_float(long N, int width, REALTYPE *total) {

REALTYPE n=(REALTYPE)((blockIdx.x*blockDim.x+threadIdx.x)*width)+2.0;

REALTYPE sum=0.0;


    for (int i=0;i<width;i++) {





// Now do the reduction

    unsigned int tid = threadIdx.x;

    sdata[tid] = sum;


for(unsigned int s=1; s < blockDim.x; s *= 2)


            int index = 2 * s * tid;

if (index < blockDim.x)


                        sdata[index] += sdata[index + s];




    // write result for this block to global mem

    if (tid == 0) {

    //      printf("Working on %d %d %f\n", blockIdx.x,threadIdx.x,sdata[0]);

            total[blockIdx.x] = sdata[0];



int main (int argc, char** argv) {

int deviceCount;

    int dev=0;

    cudaDeviceProp deviceProp;

/* Initialize */


    if (deviceCount == 0) {

            fprintf(stderr, "Error: No devices supporting CUDA.\n");



    cudaGetDeviceProperties(&deviceProp, dev);

    cudaGetDeviceProperties(&deviceProp, dev);

    if (deviceProp.major < 1) {

            fprintf(stderr,"Error: Device does not support CUDA\n");



    printf("Using device: %d\n", dev, deviceProp.name);


/* Setup the threads and grids */

    long N=10;

    int p=8;

    for(int i=1;i<p;i++) N=N*10;

    int width=1000;

    int block_size=500;

printf(“Element size: %ld\n”, N);

    printf("Block size %ld\n", block_size);

    printf("Grid size %ld\n", N/width/block_size);

    dim3 dimBlock(block_size,1);

    int grid_size=N/width/block_size;

    dim3 dimGrid(grid_size,1); 

printf("%d %d %d\n", block_size, N, N/width/block_size);

    /* Kernel */

    double t1;


REALTYPE *total;

    total=(REALTYPE *) malloc(sizeof(REALTYPE)*(grid_size));

    for (int i=0;i<grid_size;i++) {



REALTYPE *d_total;

    cudaMalloc((void **) &d_total, sizeof(REALTYPE)*(grid_size));

    cudaMemcpy(d_total, total, sizeof(REALTYPE)*block_size, cudaMemcpyHostToDevice);

    kernel_slowsum_float<<<dimGrid,dimBlock>>> (N,width,d_total);

    cudaMemcpy(total, d_total, sizeof(REALTYPE)*(grid_size), cudaMemcpyDeviceToHost);


    for (int i=0;i<grid_size;i++) {

            printf("Got %d %lf\n", i, total[i]);



printf(“Time to compute first kernel: %lf\n”, gettime()-t1);

    printf("What did I get: %lf\n", sum);

CUT_EXIT(argc, argv);



Because all of your constants are doubles, you’re going to have a lot of double precision operations. Append a f to your constants to make sure they’re floats. (This is true of CPUs too)

Wow. That fixed the register usage. I find that really surprising, but I am glad that fixed it.


To store a double, two registers must be used together. That can inflate your register count very quickly.

It would be really nice if they changed the default behavier of constants to float, and had you explicitly declare doubles like 3.1415926535d or something.

Why would it be really nice?

The concept of unadorned floating point constants being implicitly double is rooted in the ANSI C89 standard and clearly understood by everyone whoever actually read it or any of the standards that have followed it. This isn’t a peculiarly of nvcc, it is a basic feature of the C language. Every standards compliant C compiler works this way.

Why should nvidia go out of their way to make nvcc ignore the basic behaviour of standard C just to accommodate the sloppy programming habits of those that do not understand the language they are working in?

On a slightly less offensive note (hey, I agree with you in principle ;)), it may be time to consider supporting a less low-level language in addition to C for CUDA. If you look through the forum, a lot of people struggle more with C than with CUDA. Heck, there seem to be people hell-bend on learning CUDA as their first language. On one hand, this is cool for CUDA, as more users are always a plus. On the other hand, CUDA is not really designed to be used by people who do not exactly understand what is going on both at the hardware level and at the compiler level. So I suspect there will be a lot of people rejecting CUDA simply because C is not a language you pick up on a rainy afternoon.

Strictly speaking, there is no need to understand pointers, the C++ object model, or even CUDA’s memory model to write some cuda code. You could set up your language so that transferring objects and structs between host and device “just works”. You could make doubles the special case and not the default…

But then again, that would probably mean designing YAPL…

Maybe something that cross-compiles to OpenCL would be a good idea. Hm.

After comparing the CUDA syntax and the OpenCL Syntax on wiki, I have to say that CUDA is a winner. Even though it doesnt have the word “open” in its name, I find the runtime API code easier to understand than the example on wiki. Im really hoping that ATi starts using the CUDA language :D. I can always hope.

Because in 20 years of C programming on multiple platforms I have never had to worry about this. If the variable is float, the compiler should figure out that the constant is float. Or the compiler should

complain when you try and cast a variable to a different type, give me a warning. Since it is critical to performance in CUDA, this would be very helpful to make sure I am not making mistakes.

Uh, yeah you do. You don’t in x86 because the only time it matters is when you are dealing with SSE and therefore have to pack and explicitly cast everything yourself anyway, but on any platform where you have dramatically different performance between single and double you will encounter this.

CUDA is as close to C as we can reasonably make it (okay actually it’s as close to the C subset of C++ plus templates as we can reasonably make it), so we are going to follow standard C behavior here. “Doesn’t act like x86” is not a sensible reason for changing the behavior.

To stir the mix a little more, I’ll just chip in that Fortran does the opposite - all (floating point) literal constants are assumed to be single precision, unless tagged otherwise. This can cause confusion, as people specify pi to 20 decimal places, and you have to tell them that everything after the first six was thrown away (prior to promotion back up to double precision…).

I strongly support the stance that CUDA should follow the standard C practice in this (well, if [font=“Courier New”]nvfc[/font] is ever released, that should follow standard Fortran practice). If it doesn’t you have a mess - what happens to the literals in the host code, which just get handed off to the system compiler? Is [font=“Courier New”]nvcc[/font] supposed to quietly re-write them to be ‘correct’ (if that can even be determined)?

Actually, you should have been worrying about it. C has very clearing defined constant typing and casting behaviour, and if you are expecting that compiler will “nanny” you and introduce hidden, silent casting, you probably need to find a different language to program in. What you are proposing fundamentally breaks every line of standards compliant C code which uses double precision arithmetic which has been written since about 1980.

It really shouldn’t. You cannot have it both ways. Either the compiler permits the full range of C casting behaviour at the behest (and responsibility) of the programmer, or it doesn’t permit any. You can have C90 or Java. Take your pick.

I don’t in x86 because there are not SSE is not apart of the C standard. It is the job of the compiler to do that. I could write assembly code, but that isn’t portable (having NVIDIA make CUDA portable to mutli-core/non-x86 architectures can be left for another post). Yes, I technically write 0.5 instead of 0.5f, but basic compiler analysis should pick that up and do the correct thing (just consider it as a float).

The issue isn’t x86. I have never had to explicitly specify on any architecture (Sparc, Alpha, Itanium, x86, x86_64). I just talked to another guy who has been doing HPC longer than I have (ie. he cares about performance) and it has never been an issue.

The issue isn’t what the standard says, it is what analysis the compiler can do. If I have the code:

float x;


The compiler should see that I am setting the value to a float and assume that I mean 0.0f. Even if it doesn’t, it could detect that I am trying to set a double to a float, which will cause a loss of precision and generate a warning for not typecasting the variable correctly.

And I have now learned my lesson that in CUDA I should always set the types of my floats. I am ok with that.


I have learned my lesson. I get it. However, I am curious about your last comment. If I do the following:

float x;

double y;


I can get most compilers to tell me that I have a casting problem. What is the difference between y being a variable or a constant? What am I trying to have both ways? That I want auto-casting for constants but also want the warnings?

Ignoring the efficiency of autovectorization for a moment, in standard x86 floating point there’s no difference between float and double except to what the x87 results are cast to at the end of each (chain of) computation. Every actual computation in x87 is 80-bit, just the storage formats differ. The compiler is picking this up in CUDA, it’s just that if you multiply a float by a double the C spec actually says that you want a double-precision multiply. It’s not doing the wrong thing.

And for maximum confusion, GCC defaults on x86-64 platforms to using the SSE registers and instructions even for scalar calculations because it avoids the bizarre x87 register stack architecture. So on those platforms, the intermediate calculations really are only 64-bit. I don’t know if that means that intermediate calculations for single precision are correspondingly 32-bit on x86-64. (There is no speed difference when using SSE in this scalar mode, so I don’t think it matters as far as performance is concerned.)

Haha, I love compilers. x86 has a lot of weird archaic design considerations and lots of other things behave differently!