problem with double precision unpredictable results Different run give differents errors or no error

Hi all,

I’m facing a nasty problem with my cuda first steps … I was first thinking I was doing something wrong in my own code, so I just tried to reproduce it with a given sample of the new GPU SDK.

This is related to the double usage (I was not able to reproduce this with float).

I’ve got unpredictable wrong results at different places and sometimes good runs (ie I’ve got PASSED some times and FAILED some other times) when doing 5 / 6 runs one just after another. It appears only for “long runs” and from previous tests on another similar code it only appears to “swap” some indices of the array which are contiguous 2 by 2 most of the time.

I was thinking of a problem on my card as if it was defective memory, but the 2 contiguous idx errors make me think as if it was some kind of wrong prediction somewhere or thread concurrency for accessing the double ALU … Any though ? Can someone confirm me that he has the same kind of problem ? or not ?

Thanks in advance.

B.

So here’s the stolen code from vectorAdd in the C/src folder adapted for illustrated my problems (in short I just replace float by double, and put a bigger N value) :

[codebox]/*

  • Copyright 1993-2010 NVIDIA Corporation. All rights reserved.

  • NVIDIA Corporation and its licensors retain all intellectual property and

  • proprietary rights in and to this software and related documentation.

  • Any use, reproduction, disclosure, or distribution of this software

  • and related documentation without an express license agreement from

  • NVIDIA Corporation is strictly prohibited.

  • Please refer to the applicable NVIDIA end user license agreement (EULA)

  • associated with this source code for terms and conditions that govern

  • your use of this NVIDIA software.

*/

/* Vector addition: C = A + B.

  • This sample is a very basic sample that implements element by element

  • vector addition. It is the same as the sample illustrating Chapter 3

  • of the programming guide with some additions like error checking.

*/

// Includes

#include <stdio.h>

#include <cutil_inline.h>

// Variables

double* h_A;

double* h_B;

double* h_C;

double* d_A;

double* d_B;

double* d_C;

bool noprompt = false;

// Functions

void Cleanup(void);

void RandomInit(double*, int);

void ParseArguments(int, char**);

// Device code

global void VecAdd(const double* A, const double* B, double* C, int N)

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

    C[i] = A[i] + B[i];

}

// Host code

int main(int argc, char** argv)

{

printf("Vector addition\n");

int N = 512*240*60;

size_t size = N * sizeof(double);

ParseArguments(argc, argv);

// Allocate input vectors h_A and h_B in host memory

h_A = (double*)malloc(size);

if (h_A == 0) Cleanup();

h_B = (double*)malloc(size);

if (h_B == 0) Cleanup();

h_C = (double*)malloc(size);

if (h_C == 0) Cleanup();

// Initialize input vectors

RandomInit(h_A, N);

RandomInit(h_B, N);

// Allocate vectors in device memory

cutilSafeCall( cudaMalloc((void**)&d_A, size) );

cutilSafeCall( cudaMalloc((void**)&d_B, size) );

cutilSafeCall( cudaMalloc((void**)&d_C, size) );

// Copy vectors from host memory to device memory

cutilSafeCall( cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice) );

// Invoke kernel

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cutilCheckMsg("kernel launch failure");

#ifdef _DEBUG

cutilSafeCall( cudaThreadSynchronize() );

#endif

// Copy result from device memory to host memory

// h_C contains the result in host memory

cutilSafeCall( cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost) );

// Verify result

int i;

for (i = 0; i < N; ++i) {

    double sum = h_A[i] + h_B[i];

    if (fabs(h_C[i] - sum) > 1e-5)

    {

        printf("%d / %d  : %g = %g\n", i, N, h_C[i], sum);

        break;

    }

}

printf("%s \n", (i == N) ? "PASSED" : "FAILED");

Cleanup();

}

void Cleanup(void)

{

// Free device memory

if (d_A)

    cudaFree(d_A);

if (d_B)

    cudaFree(d_B);

if (d_C)

    cudaFree(d_C);

// Free host memory

if (h_A)

    free(h_A);

if (h_B)

    free(h_B);

if (h_C)

    free(h_C);

cutilSafeCall( cudaThreadExit() );

if (!noprompt) {

    printf("\nPress ENTER to exit...\n");

    fflush( stdout);

    fflush( stderr);

    getchar();

}

exit(0);

}

// Allocates an array with random double entries.

void RandomInit(double* data, int n)

{

for (int i = 0; i < n; ++i)

    data[i] = rand() / (double)RAND_MAX;

}

// Parse program arguments

void ParseArguments(int argc, char** argv)

{

for (int i = 0; i < argc; ++i)

    if (strcmp(argv[i], "--noprompt") == 0 ||

                    strcmp(argv[i], "-noprompt") == 0)

            {

        noprompt = true;

        break;

    }

}

[/codebox]

I’m using a Tesla C1060 card (output of DeviceQuery) :

[codebox]

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “Tesla C1060”

CUDA Driver Version: 3.10

CUDA Runtime Version: 3.10

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294770688 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.30 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Concurrent kernel execution: No

Device has ECC support enabled: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 1, Device = Tesla C1060

[/codebox]

Of course, I have changed the common.mk to compile in sm_13 mode and not sm_10.

Edit : adding my configuration, if that matters

My config :

Linux 64 (Debian sid)

Tesla C1060

Cuda toolkit latest (I guess 3.1)

SDK latest (I guess 3.1)

Only using nvcc as far as i know :

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2010 NVIDIA Corporation

Built on Mon_Jun__7_18:56:31_PDT_2010

Cuda compilation tools, release 3.1, V0.2.1221

CPU : intel Xeon E5620 @ 2.4GHz

RAM : 6122556 B

I don’t use any output / display on this machine.

Back again,

More testing, still some strange behavior, but that time it’s the same for float. It’s just harder to reproduce … In that case, it’s not 2 index but 4 consecutive indexes that are on faults on 8192 … Could it be a defective memory ? Is there some may to do tests about that ?

That’s becoming vey very annoying. I cannot trust the results … Does somebody got the same kind of problem or is it my card which is defective ?

Thank you in avance.

B.

Back again,

More testing, still some strange behavior, but that time it’s the same for float. It’s just harder to reproduce … In that case, it’s not 2 index but 4 consecutive indexes that are on faults on 8192 … Could it be a defective memory ? Is there some may to do tests about that ?

That’s becoming vey very annoying. I cannot trust the results … Does somebody got the same kind of problem or is it my card which is defective ?

Thank you in avance.

B.

cuda-memtest performs extensive memory tests: http://sourceforge.net/projects/cudagpumemtest/

cuda-memtest performs extensive memory tests: http://sourceforge.net/projects/cudagpumemtest/

Hello again,

I have tested (just for the fun since I’m not an expert) the cuda-memcheck stuff and it reports no error at all, and no problems occurs in that case (all the results are right in that case, no more test failure, at least on the ten’s of runs done).
If I run under cuda-gdb, then there’s less errors, but still some. And of course, cuda-gdb + set cuda memcheck on gives no error at all …

:wacko:

Any idea on what do do now ?

Regards,

B.

Hello again,

I have tested (just for the fun since I’m not an expert) the cuda-memcheck stuff and it reports no error at all, and no problems occurs in that case (all the results are right in that case, no more test failure, at least on the ten’s of runs done).
If I run under cuda-gdb, then there’s less errors, but still some. And of course, cuda-gdb + set cuda memcheck on gives no error at all …

:wacko:

Any idea on what do do now ?

Regards,

B.

Thanks, Doing the Tests now …

Thanks, Doing the Tests now …

And Stress test failing … Ok so what’s next ? returning the card to nividia ? Doing something else ? Is there a way to blacklist bad memory block in the driver ?

Thanks again.

Regards,

B.

And Stress test failing … Ok so what’s next ? returning the card to nividia ? Doing something else ? Is there a way to blacklist bad memory block in the driver ?

Thanks again.

Regards,

B.

I’m not aware of any way of blacklisting bad memory. If this were a consumer card, I’d say reduce the memory clock rate, but it is a Tesla C1060. NVIDIA unfortunately doesn’t have a direct sales/warranty replacement service to contact (that I’m aware of). When we had to replace several S1070’s under warranty, we had to go through the vendor we purchased them from.

I’m not aware of any way of blacklisting bad memory. If this were a consumer card, I’d say reduce the memory clock rate, but it is a Tesla C1060. NVIDIA unfortunately doesn’t have a direct sales/warranty replacement service to contact (that I’m aware of). When we had to replace several S1070’s under warranty, we had to go through the vendor we purchased them from.