double precision

Hello,

if I compile the code below with “double” type I obtain the following error message

pgcc -o c2.exe c2.c -ta=nvidia,cc11 -Minfo=accel -fast
NOTE: your trial license will expire in 0 days, 9.13 hours.
main:
     60, Accelerator region ignored
     63, Accelerator restriction: invalid loop
     64, Accelerator restriction: datatype not supported: s

What can I do? Thanks a lot!

In [1] one can read that double-precision is supported and CUDA 64Bit code can be “looped” on 32Bit processors. I use pgi9-0.2 and the graphic card details reads as follows

pgaccelinfo
Device Number:                 0
Device Name:                   GeForce 9800 GTX/9800 GTX+
Device Revision Number:        1.1
Global Memory Size:            536150016
Number of Multiprocessors:     16
Number of Cores:               128
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           8192
Warp Size:                     32
Maximum Threads per Block:     8192
Maximum Block Dimensions:      512 x 512 x 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1674 MHz

[1] http://www.pgroup.com/resources/accel.htm#dp
The souces reads as follows:

typedef double dtype;

int main( int argc, char* argv[] )
{
    int n;      /* size of the vector */

    dtype *a;  /* the vector */
    dtype *r;  /* the results */
    dtype *e;  /* expected results */

    dtype s, c;
    struct timeval t1, t2, t3;

  long cgpu, chost;
    int i,*t;
    if( argc > 1 )
        n = atoi( argv[1] );
    else
        n = 100000;
    if( n <= 0 ) n = 100000;

    a = (dtype*)malloc(n*sizeof(dtype));
    r = (dtype*)malloc(n*sizeof(dtype));
    e = (dtype*)malloc(n*sizeof(dtype));

    for( i = 0; i < n; ++i ) a[i] = (dtype)(i+1) * 2.0;
    acc_init( acc_device_nvidia );

    gettimeofday( &t1, NULL );
    #pragma acc region
    {
        for( i = 0; i < n; ++i ){
            s = sin(a[i]);
            c = cos(a[i]);
            r[i] = s*s + c*c;
        }
    }
    gettimeofday( &t2, NULL );
}

Hi spam.me,

While I get different errors then you, I’m able work around them by adding header files and using the “-Msafeptr” flag. Note that “double” is supported however, we currently only support 64-bit Linux host objects.

cat cc.c
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
#include <math.h>
#include <accel.h>
#include <accelmath.h>


typedef double dtype;

int main( int argc, char* argv[] )
{
    int n;      /* size of the vector */

    dtype *a;  /* the vector */
    dtype *r;  /* the results */
    dtype *e;  /* expected results */

    dtype s, c;
    struct timeval t1, t2, t3;

  long cgpu, chost;
    int i,*t;
    if( argc > 1 )
        n = atoi( argv[1] );
    else
        n = 100000;
    if( n <= 0 ) n = 100000;

    a = (dtype*)malloc(n*sizeof(dtype));
    r = (dtype*)malloc(n*sizeof(dtype));
    e = (dtype*)malloc(n*sizeof(dtype));

    for( i = 0; i < n; ++i ) a[i] = (dtype)(i+1) * 2.0;
    acc_init( acc_device_nvidia );

    gettimeofday( &t1, NULL );
    #pragma acc region
    {
        for( i = 0; i < n; ++i ){
            s = sin(a[i]);
            c = cos(a[i]);
            r[i] = s*s + c*c;
        }
    }
    gettimeofday( &t2, NULL );
}
% pgcc -ta=nvidia -Minfo=accel -fast cc.c -Msafeptr -o cc.exe
main:
     39, Generating copyin(a[0:n-1])
         Generating copyout(r[0:n-1])
     41, Loop is parallelizable
         Accelerator kernel generated
         41, #pragma for parallel, vector(256)
             Using register for a
% cc.exe
%

Thanks for your fast repley.

Of course, the header are included in the source file and pgi runs under a 64-bit Linux system (current XUbuntu)

But please note, the compiler error occurs if I compile the source by

pgcc -ta=nvidia,cc11 ...

If I apply

pgcc -ta=nvidia,cc13

the compilation is fine, but I can’t run the program, which means I obtain the following error message

call to cuModuleLoadData returned error 300: Invalid Source



Hi spam.me,

Sorry my misunderstanding. As far as I can tell, NVIDIA did not support double precision until compute capability version 1.3 so you will not be able to use “double” with your card.

  • Mat

Dear mkcolg,

yes my test graphic card has no 64-Bit processors, but in CUDA it is possible to “loop” 64Bit operations in a manner that the 32Bit processors can compute the problem. Of course doubles-processes on 32Bit processors runs slow but I want to run some benchmarks before I charge my university for some Tesla Cards.

Is it planned to include that those 64-to32-loop options in the compiler/pragmas or they are already?

Hi spam.me,

We had not heard of this technique before nor can find it in NVIDIA’s CUDA programming guide (version 2.1). We do see where they will ‘demote’ 64-bit to 32-bits but I don’t think this what you want.

In any event, I don’t think it something we would support given new NVIDIA cards do support double precision.

  • Mat

Dear mkcolg,

sorry for the confusion. The mention stuff was not part of CUDA but of ATI Stream. If you’re interested, please take a look in [1] on page 3-36/122 Double-Precision Arithmetic.

Have a nice day!


[1] http://developer.amd.com/gpu_assets/Stream_Computing_User_Guide.pdf