Double precision in CUDA 2.3

Hi to everyone,

I am new to CUDA computing and I would like to ask a question the more experienced members about double precision in the CUDA 2.3 environment. First of all I have a GTX295 card which, according to the CUDA programming guide, supports compute capability 1.3, meaning that it supports double precision numbers. Also I am using the nvcc compiler through the Microsoft developer studio 2008 (actually the code is built using a one of the SDK sample programs as a template).

So I was experiencing with a simple code, which can be found here ( [url=“My first CUDA program! | /// Parallel Panorama ///”]http://llpanorama.wordpress.com/2008/05/21...t-cuda-program/[/url] ), I have modified it a bit as following (in short I have changed the way the matrix a_h is allocated and the array elements that have to be calculated):

#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host
int main(void)
{

float *a_h=new float 60;
float *a_d; // Pointer to host & device arrays
const int N = 60; // Number of elements in arrays
size_t size = N * sizeof(float);
// a_h = (float )malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i
2;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int nblocksize=500;
int Nblocks=N/nblocksize+ (N%nblocksize == 0 ? 0 : 1);
square_array <<< Nblocks, nblocksize >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf(“%d %f\n”, i, a_h[i]);
// Cleanup
delete a_h; cudaFree(a_d);
}

So when I run the executable the program works as it should since it gives the correct results. But when I replace all float declarations to double, then I get a warning :

warning : Double is not supported. Demoting to float

Also the results are wrong (to be precise it is like the array elements were never processed by the device, since what is printed is just the elements intialized by the host). Why do I get the warning and the error results ? Do I miss anything here?

Thanks in advance.

compile with -arch sm_13

You are absolutely right I forgot to enable 1.3 compute architecture from the compiler. Thank you very much for your quick response!

External Media

I get the same error.

I specifically bought a device with support for doubles, a GTX 260

In my common.mk, the versions are included:

# Add new SM Versions here as devices with new Compute Capability are released

SM_VERSIONS := sm_10 sm_11 sm_12 sm_13

Is this perhaps a different reference to the versions?

there is a clear error in your global fuction.

global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
a[idx] = a[idx] * a[idx];
}

the 2th para is not used, so if the number of thread is not equal to N, your result is of course error.

#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;

float temp = 0.0f;
for(int i = idx; i < N; i += blockDim.x * gridDim.x)
{
	temp = a[idx];
	a[idx] = temp * temp;
}

}

// main routine that executes on the host
int main(void)
{
const int N = 21; // Number of elements in arrays
float *a_h=new float[N];
float *a_d; // Pointer to host & device arrays
size_t size = N * sizeof(float);

// a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++)
    a_h[i] = (float)i*2;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int nblocksize=512; //I edit here. or 512
int Nblocks=(N - 1)/nblocksize + 1; //I also edit here.
square_array <<< Nblocks, nblocksize >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++)
    printf("%d %f\n", i, a_h[i]);
// Cleanup
delete [] a_h;
cudaFree(a_d);

}