Hi all,
I’m facing a nasty problem with my cuda first steps … I was first thinking I was doing something wrong in my own code, so I just tried to reproduce it with a given sample of the new GPU SDK.
This is related to the double usage (I was not able to reproduce this with float).
I’ve got unpredictable wrong results at different places and sometimes good runs (ie I’ve got PASSED some times and FAILED some other times) when doing 5 / 6 runs one just after another. It appears only for “long runs” and from previous tests on another similar code it only appears to “swap” some indices of the array which are contiguous 2 by 2 most of the time.
I was thinking of a problem on my card as if it was defective memory, but the 2 contiguous idx errors make me think as if it was some kind of wrong prediction somewhere or thread concurrency for accessing the double ALU … Any though ? Can someone confirm me that he has the same kind of problem ? or not ?
Thanks in advance.
B.
So here’s the stolen code from vectorAdd in the C/src folder adapted for illustrated my problems (in short I just replace float by double, and put a bigger N value) :
[codebox]/*
-
Copyright 1993-2010 NVIDIA Corporation. All rights reserved.
-
NVIDIA Corporation and its licensors retain all intellectual property and
-
proprietary rights in and to this software and related documentation.
-
Any use, reproduction, disclosure, or distribution of this software
-
and related documentation without an express license agreement from
-
NVIDIA Corporation is strictly prohibited.
-
Please refer to the applicable NVIDIA end user license agreement (EULA)
-
associated with this source code for terms and conditions that govern
-
your use of this NVIDIA software.
*/
/* Vector addition: C = A + B.
-
This sample is a very basic sample that implements element by element
-
vector addition. It is the same as the sample illustrating Chapter 3
-
of the programming guide with some additions like error checking.
*/
// Includes
#include <stdio.h>
#include <cutil_inline.h>
// Variables
double* h_A;
double* h_B;
double* h_C;
double* d_A;
double* d_B;
double* d_C;
bool noprompt = false;
// Functions
void Cleanup(void);
void RandomInit(double*, int);
void ParseArguments(int, char**);
// Device code
global void VecAdd(const double* A, const double* B, double* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
// Host code
int main(int argc, char** argv)
{
printf("Vector addition\n");
int N = 512*240*60;
size_t size = N * sizeof(double);
ParseArguments(argc, argv);
// Allocate input vectors h_A and h_B in host memory
h_A = (double*)malloc(size);
if (h_A == 0) Cleanup();
h_B = (double*)malloc(size);
if (h_B == 0) Cleanup();
h_C = (double*)malloc(size);
if (h_C == 0) Cleanup();
// Initialize input vectors
RandomInit(h_A, N);
RandomInit(h_B, N);
// Allocate vectors in device memory
cutilSafeCall( cudaMalloc((void**)&d_A, size) );
cutilSafeCall( cudaMalloc((void**)&d_B, size) );
cutilSafeCall( cudaMalloc((void**)&d_C, size) );
// Copy vectors from host memory to device memory
cutilSafeCall( cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice) );
// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cutilCheckMsg("kernel launch failure");
#ifdef _DEBUG
cutilSafeCall( cudaThreadSynchronize() );
// Copy result from device memory to host memory
// h_C contains the result in host memory
cutilSafeCall( cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost) );
// Verify result
int i;
for (i = 0; i < N; ++i) {
double sum = h_A[i] + h_B[i];
if (fabs(h_C[i] - sum) > 1e-5)
{
printf("%d / %d : %g = %g\n", i, N, h_C[i], sum);
break;
}
}
printf("%s \n", (i == N) ? "PASSED" : "FAILED");
Cleanup();
}
void Cleanup(void)
{
// Free device memory
if (d_A)
cudaFree(d_A);
if (d_B)
cudaFree(d_B);
if (d_C)
cudaFree(d_C);
// Free host memory
if (h_A)
free(h_A);
if (h_B)
free(h_B);
if (h_C)
free(h_C);
cutilSafeCall( cudaThreadExit() );
if (!noprompt) {
printf("\nPress ENTER to exit...\n");
fflush( stdout);
fflush( stderr);
getchar();
}
exit(0);
}
// Allocates an array with random double entries.
void RandomInit(double* data, int n)
{
for (int i = 0; i < n; ++i)
data[i] = rand() / (double)RAND_MAX;
}
// Parse program arguments
void ParseArguments(int argc, char** argv)
{
for (int i = 0; i < argc; ++i)
if (strcmp(argv[i], "--noprompt") == 0 ||
strcmp(argv[i], "-noprompt") == 0)
{
noprompt = true;
break;
}
}
[/codebox]
I’m using a Tesla C1060 card (output of DeviceQuery) :
[codebox]
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: “Tesla C1060”
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.10
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294770688 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 1, Device = Tesla C1060
[/codebox]
Of course, I have changed the common.mk to compile in sm_13 mode and not sm_10.
Edit : adding my configuration, if that matters
My config :
Linux 64 (Debian sid)
Tesla C1060
Cuda toolkit latest (I guess 3.1)
SDK latest (I guess 3.1)
Only using nvcc as far as i know :
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2010 NVIDIA Corporation
Built on Mon_Jun__7_18:56:31_PDT_2010
Cuda compilation tools, release 3.1, V0.2.1221
CPU : intel Xeon E5620 @ 2.4GHz
RAM : 6122556 B
I don’t use any output / display on this machine.