What is code compiled with -arch=sm_13 slower?

ctierney42 · April 21, 2009, 8:46pm

I have a small code that tries to calculate the infinite sum of 1/(n*log(n)*log(n). I am using CUDA 2.1. When

I compile with -arch=sm_13 and without it, I get drastically different register usages:

nvcc -O3 -use_fast_math --ptxas-options=“-v -O3” slowsum.cu -o slowsum

In file included from slowsum.cu:6:

nvcchelp.h:14:47: warning: backslash and newline separated by space

ptxas info : Compiling entry function ‘_Z20kernel_slowsum_floatliPf’

ptxas info : Used 8 registers, 4136+40 bytes smem, 4 bytes cmem[1]

nvcc -O3 -arch=sm_13 -use_fast_math --ptxas-options=“-v -O3” slowsum.cu -o slowsum

In file included from slowsum.cu:6:

nvcchelp.h:14:47: warning: backslash and newline separated by space

ptxas info : Compiling entry function ‘_Z20kernel_slowsum_floatliPf’

ptxas info : Used 20 registers, 4136+40 bytes smem, 4 bytes cmem[1]

This change in register usage makes a signficant change in performance because of occupancy. My question

is why does this happen?

Thanks,

Craig

[codebox]

#include<stdio.h>

#include<stdlib.h>

#include<math.h>

#include<string.h>

#include “nvcchelp.h”

#define ANSWER 2.109742801237

#define REALTYPE float

shared REALTYPE sdata[1024];

global void kernel_slowsum_float(long N, int width, REALTYPE *total) {

REALTYPE n=(REALTYPE)((blockIdx.x*blockDim.x+threadIdx.x)*width)+2.0;

REALTYPE sum=0.0;

    REALTYPE v;

    for (int i=0;i<width;i++) {

            v=log(n);

            sum+=1.0/(n*v*v);

            n+=1.0;

    }

// Now do the reduction

    unsigned int tid = threadIdx.x;

    sdata[tid] = sum;

    __syncthreads();

for(unsigned int s=1; s < blockDim.x; s *= 2)

    {

            int index = 2 * s * tid;

if (index < blockDim.x)

            {

                        sdata[index] += sdata[index + s];

            }

            __syncthreads();

    }

    // write result for this block to global mem

    if (tid == 0) {

    //      printf("Working on %d %d %f\n", blockIdx.x,threadIdx.x,sdata[0]);

            total[blockIdx.x] = sdata[0];

    }

}

int main (int argc, char** argv) {

int deviceCount;

    int dev=0;

    cudaDeviceProp deviceProp;

/* Initialize */

    cudaGetDeviceCount(&deviceCount);

    if (deviceCount == 0) {

            fprintf(stderr, "Error: No devices supporting CUDA.\n");

            exit(1);

    }

    cudaGetDeviceProperties(&deviceProp, dev);

    cudaGetDeviceProperties(&deviceProp, dev);

    if (deviceProp.major < 1) {

            fprintf(stderr,"Error: Device does not support CUDA\n");

            exit(1);

    }

    printf("Using device: %d\n", dev, deviceProp.name);

    cudaSetDevice(dev);

/* Setup the threads and grids */

    long N=10;

    int p=8;

    for(int i=1;i<p;i++) N=N*10;

    int width=1000;

    int block_size=500;

printf(“Element size: %ld\n”, N);

    printf("Block size %ld\n", block_size);

    printf("Grid size %ld\n", N/width/block_size);

    dim3 dimBlock(block_size,1);

    int grid_size=N/width/block_size;

    dim3 dimGrid(grid_size,1);

printf(“%d %d %d\n”, block_size, N, N/width/block_size);

    /* Kernel */

    double t1;

    t1=gettime();

REALTYPE *total;

    total=(REALTYPE *) malloc(sizeof(REALTYPE)*(grid_size));

    for (int i=0;i<grid_size;i++) {

            total[i]=0.0;

    }

REALTYPE *d_total;

    cudaMalloc((void **) &d_total, sizeof(REALTYPE)*(grid_size));

    cudaMemcpy(d_total, total, sizeof(REALTYPE)*block_size, cudaMemcpyHostToDevice);

    kernel_slowsum_float<<<dimGrid,dimBlock>>> (N,width,d_total);

    cudaMemcpy(total, d_total, sizeof(REALTYPE)*(grid_size), cudaMemcpyDeviceToHost);

REALTYPE sum;

    for (int i=0;i<grid_size;i++) {

            printf("Got %d %lf\n", i, total[i]);

            sum+=total[i]; 

    }

printf(“Time to compute first kernel: %lf\n”, gettime()-t1);

    printf("What did I get: %lf\n", sum);

CUT_EXIT(argc, argv);

}

[/codebox]

tmurray · April 21, 2009, 9:08pm

Because all of your constants are doubles, you’re going to have a lot of double precision operations. Append a f to your constants to make sure they’re floats. (This is true of CPUs too)

ctierney42 · April 21, 2009, 9:38pm

Wow. That fixed the register usage. I find that really surprising, but I am glad that fixed it.

Thanks.

seibert · April 22, 2009, 1:33am

To store a double, two registers must be used together. That can inflate your register count very quickly.

Keldor314 · April 22, 2009, 8:26am

It would be really nice if they changed the default behavier of constants to float, and had you explicitly declare doubles like 3.1415926535d or something.

avidday · April 22, 2009, 8:38am

Why would it be really nice?

The concept of unadorned floating point constants being implicitly double is rooted in the ANSI C89 standard and clearly understood by everyone whoever actually read it or any of the standards that have followed it. This isn’t a peculiarly of nvcc, it is a basic feature of the C language. Every standards compliant C compiler works this way.

Why should nvidia go out of their way to make nvcc ignore the basic behaviour of standard C just to accommodate the sloppy programming habits of those that do not understand the language they are working in?

T.B · April 22, 2009, 8:58am

On a slightly less offensive note (hey, I agree with you in principle ;)), it may be time to consider supporting a less low-level language in addition to C for CUDA. If you look through the forum, a lot of people struggle more with C than with CUDA. Heck, there seem to be people hell-bend on learning CUDA as their first language. On one hand, this is cool for CUDA, as more users are always a plus. On the other hand, CUDA is not really designed to be used by people who do not exactly understand what is going on both at the hardware level and at the compiler level. So I suspect there will be a lot of people rejecting CUDA simply because C is not a language you pick up on a rainy afternoon.

Strictly speaking, there is no need to understand pointers, the C++ object model, or even CUDA’s memory model to write some cuda code. You could set up your language so that transferring objects and structs between host and device “just works”. You could make doubles the special case and not the default…

But then again, that would probably mean designing YAPL…

Maybe something that cross-compiles to OpenCL would be a good idea. Hm.

shifter1 · April 22, 2009, 12:59pm

After comparing the CUDA syntax and the OpenCL Syntax on wiki, I have to say that CUDA is a winner. Even though it doesnt have the word “open” in its name, I find the runtime API code easier to understand than the example on wiki. Im really hoping that ATi starts using the CUDA language :D. I can always hope.

ctierney42 · April 22, 2009, 5:33pm

Because in 20 years of C programming on multiple platforms I have never had to worry about this. If the variable is float, the compiler should figure out that the constant is float. Or the compiler should

complain when you try and cast a variable to a different type, give me a warning. Since it is critical to performance in CUDA, this would be very helpful to make sure I am not making mistakes.

tmurray · April 22, 2009, 6:03pm

Uh, yeah you do. You don’t in x86 because the only time it matters is when you are dealing with SSE and therefore have to pack and explicitly cast everything yourself anyway, but on any platform where you have dramatically different performance between single and double you will encounter this.

CUDA is as close to C as we can reasonably make it (okay actually it’s as close to the C subset of C++ plus templates as we can reasonably make it), so we are going to follow standard C behavior here. “Doesn’t act like x86” is not a sensible reason for changing the behavior.

YDD · April 22, 2009, 7:25pm

To stir the mix a little more, I’ll just chip in that Fortran does the opposite - all (floating point) literal constants are assumed to be single precision, unless tagged otherwise. This can cause confusion, as people specify pi to 20 decimal places, and you have to tell them that everything after the first six was thrown away (prior to promotion back up to double precision…).

I strongly support the stance that CUDA should follow the standard C practice in this (well, if [font=“Courier New”]nvfc[/font] is ever released, that should follow standard Fortran practice). If it doesn’t you have a mess - what happens to the literals in the host code, which just get handed off to the system compiler? Is [font=“Courier New”]nvcc[/font] supposed to quietly re-write them to be ‘correct’ (if that can even be determined)?

avidday · April 22, 2009, 8:41pm

Actually, you should have been worrying about it. C has very clearing defined constant typing and casting behaviour, and if you are expecting that compiler will “nanny” you and introduce hidden, silent casting, you probably need to find a different language to program in. What you are proposing fundamentally breaks every line of standards compliant C code which uses double precision arithmetic which has been written since about 1980.

It really shouldn’t. You cannot have it both ways. Either the compiler permits the full range of C casting behaviour at the behest (and responsibility) of the programmer, or it doesn’t permit any. You can have C90 or Java. Take your pick.

ctierney42 · April 22, 2009, 8:49pm

I don’t in x86 because there are not SSE is not apart of the C standard. It is the job of the compiler to do that. I could write assembly code, but that isn’t portable (having NVIDIA make CUDA portable to mutli-core/non-x86 architectures can be left for another post). Yes, I technically write 0.5 instead of 0.5f, but basic compiler analysis should pick that up and do the correct thing (just consider it as a float).

The issue isn’t x86. I have never had to explicitly specify on any architecture (Sparc, Alpha, Itanium, x86, x86_64). I just talked to another guy who has been doing HPC longer than I have (ie. he cares about performance) and it has never been an issue.

The issue isn’t what the standard says, it is what analysis the compiler can do. If I have the code:

float x;

x=0.0;

The compiler should see that I am setting the value to a float and assume that I mean 0.0f. Even if it doesn’t, it could detect that I am trying to set a double to a float, which will cause a loss of precision and generate a warning for not typecasting the variable correctly.

And I have now learned my lesson that in CUDA I should always set the types of my floats. I am ok with that.

Craig

ctierney42 · April 22, 2009, 8:52pm

I have learned my lesson. I get it. However, I am curious about your last comment. If I do the following:

float x;

double y;

x=y;

I can get most compilers to tell me that I have a casting problem. What is the difference between y being a variable or a constant? What am I trying to have both ways? That I want auto-casting for constants but also want the warnings?

tmurray · April 22, 2009, 9:29pm

Ignoring the efficiency of autovectorization for a moment, in standard x86 floating point there’s no difference between float and double except to what the x87 results are cast to at the end of each (chain of) computation. Every actual computation in x87 is 80-bit, just the storage formats differ. The compiler is picking this up in CUDA, it’s just that if you multiply a float by a double the C spec actually says that you want a double-precision multiply. It’s not doing the wrong thing.

seibert · April 22, 2009, 9:55pm

And for maximum confusion, GCC defaults on x86-64 platforms to using the SSE registers and instructions even for scalar calculations because it avoids the bizarre x87 register stack architecture. So on those platforms, the intermediate calculations really are only 64-bit. I don’t know if that means that intermediate calculations for single precision are correspondingly 32-bit on x86-64. (There is no speed difference when using SSE in this scalar mode, so I don’t think it matters as far as performance is concerned.)

tmurray · April 22, 2009, 9:58pm

Haha, I love compilers. x86 has a lot of weird archaic design considerations and lots of other things behave differently!

Topic		Replies	Views
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	49868	August 24, 2009
Cuda compiler will optimize code to use more registers than available by attempting to cache parameters CUDA Programming and Performance	12	2267	November 14, 2017
emu vs debug, different values CUDA Programming and Performance	48	15721	February 5, 2009
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	436	March 3, 2024
Are there plans to implement -ffinite-math-only -fno-signed-zeros? CUDA Programming and Performance	11	55	December 2, 2024
Used Registers vs Live Registers CUDA Programming and Performance	14	3274	June 28, 2020
A block size less than 32? CUDA Programming and Performance	37	7763	December 17, 2018
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134546	May 26, 2010
Register usage too high How to reduce register usage? CUDA Programming and Performance	33	7247	December 4, 2011
Cuda code performance CUDA Programming and Performance	14	3089	December 16, 2014

What is code compiled with -arch=sm_13 slower?

nvcc -O3 -use_fast_math --ptxas-options=“-v -O3” slowsum.cu -o slowsum

nvcc -O3 -arch=sm_13 -use_fast_math --ptxas-options=“-v -O3” slowsum.cu -o slowsum

Related topics