I am using GeForce 9800 GT and got this crush problem when I tried a simple kernel with many global memory writes. Hope that someone can tell me what is going on. Thanks~:)
-
Operating System
Linux version 2.6.22.5-31 -
Synopsis description of the problem
The kernel will just crush and return without any change on the data on global memory. -
Detailed description of the problem
The crush comes with the following test:
const double aLength=20000000; //length of array
const int loop_count=55000000; //number of writes to global memory
global void test(int* A, clock_t * F)
{
int bx = blockIdx.x;
F[0]=clock();
for (int k=0; k<loop_count; k++)
{
A[k/2000]=5;
}
F[1]=clock();
}
int main(int argc, char** argv)
{
int m_size_A = aLengthsizeof(int);
int * h_A = (int) malloc(m_size_A); //matrix A on host
clock_t * h_F=(clock_t ) malloc (16sizeof (clock_t));
for (int i=0; i<aLength; i++)
h_A[i]=0;
for (int i=0; i<16; i++)
h_F[i]= 0;
int * d_A; //A on device memory
clock_t * d_F; //to get the clock info in kernel
CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, m_size_A));
CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, m_size_A, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMalloc((void**) &d_F, 16*sizeof (clock_t)));
CUDA_SAFE_CALL(cudaMemcpy(d_F, h_F, 16*sizeof (clock_t), cudaMemcpyHostToDevice));
// setup execution parameters
dim3 threads(1);
dim3 grid(1);
// execute the kernel
test<<< grid, threads >>>(d_A, d_F);
// check if kernel execution generated and error
CUT_CHECK_ERROR("Kernel execution failed");
// copy result from device to host
CUDA_SAFE_CALL(cudaMemcpy(h_A, d_A, m_size_A, cudaMemcpyDeviceToHost) );
CUDA_SAFE_CALL(cudaMemcpy(h_F, d_F, 16*sizeof (clock_t), cudaMemcpyDeviceToHost));
printf("For b_Write with %d mem accesses: A[0]=%d, A[1]=%d, %s\n", loop_count, h_A[0], h_A[1], (h_A[0]>0 ? "Success": "Failure" ));
printf("clock before loop:%u, clock after loop:%u\n", h_F[0], h_F[1]);
// clean up memory
free(h_A);
CUDA_SAFE_CALL(cudaFree(d_A));
free(h_F);
CUDA_SAFE_CALL(cudaFree(d_F));
}
With each loop_count being executed 10 times, this will always crush when loop_count >55000000, always success when loop_count<30000000 and sometimes success in between. When it fails, the host array will not be changed and no intermediate execution result can be returned back to host. What can cause this problem please? Is it due to too many global memory access? Is it not allowed by CUDA?
-
CUDA toolkit release version
CUDA 2.0 -
SDK release version
comes with CUDA 2.0 -
Compiler for CPU host code
gcc version 4.2.1 (SUSE Linux) -
System description including:
GPU : GeForce 9800 GT
CPU : Intel® Core™2 Quad CPU Q9450 @ 2.66GHz