Problem report--Crush problem with too many global memory accesses

I am using GeForce 9800 GT and got this crush problem when I tried a simple kernel with many global memory writes. Hope that someone can tell me what is going on. Thanks~:)

  • Operating System
    Linux version

  • Synopsis description of the problem
    The kernel will just crush and return without any change on the data on global memory.

  • Detailed description of the problem
    The crush comes with the following test:

const double aLength=20000000; //length of array
const int loop_count=55000000; //number of writes to global memory

global void test(int* A, clock_t * F)
int bx = blockIdx.x;
for (int k=0; k<loop_count; k++)

int main(int argc, char** argv)
int m_size_A = aLengthsizeof(int);
int * h_A = (int
) malloc(m_size_A); //matrix A on host
clock_t * h_F=(clock_t ) malloc (16sizeof (clock_t));

for (int i=0; i<aLength; i++)

for (int i=0; i<16; i++)
    h_F[i]= 0;
int * d_A; //A on device memory

clock_t * d_F; //to get the clock info in kernel

CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, m_size_A));
CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, m_size_A, cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMalloc((void**) &d_F, 16*sizeof (clock_t)));
CUDA_SAFE_CALL(cudaMemcpy(d_F, h_F, 16*sizeof (clock_t), cudaMemcpyHostToDevice));

// setup execution parameters

dim3 threads(1);
dim3 grid(1);

// execute the kernel
test<<< grid, threads >>>(d_A, d_F);

// check if kernel execution generated and error
CUT_CHECK_ERROR("Kernel execution failed");

// copy result from device to host
CUDA_SAFE_CALL(cudaMemcpy(h_A, d_A, m_size_A, cudaMemcpyDeviceToHost) );
CUDA_SAFE_CALL(cudaMemcpy(h_F, d_F, 16*sizeof (clock_t), cudaMemcpyDeviceToHost));

printf("For b_Write with %d mem accesses: A[0]=%d, A[1]=%d, %s\n", loop_count, h_A[0], h_A[1], (h_A[0]>0 ? "Success": "Failure"   ));
printf("clock before loop:%u, clock after loop:%u\n", h_F[0], h_F[1]);

// clean up memory


With each loop_count being executed 10 times, this will always crush when loop_count >55000000, always success when loop_count<30000000 and sometimes success in between. When it fails, the host array will not be changed and no intermediate execution result can be returned back to host. What can cause this problem please? Is it due to too many global memory access? Is it not allowed by CUDA?

  • CUDA toolkit release version
    CUDA 2.0

  • SDK release version
    comes with CUDA 2.0

  • Compiler for CPU host code
    gcc version 4.2.1 (SUSE Linux)

  • System description including:
    GPU : GeForce 9800 GT
    CPU : Intel® Core™2 Quad CPU Q9450 @ 2.66GHz

Sounds like a watchdog problem. Are you running X on the same graphics card? What are your kernel execution times?

I wasn’t aware clock() worked on the GPU. I don’t quite trust that. Also - why is aLength a double?

Yes, the machine only has one graphics card so it also runs X on Machine. The kernel crushes even without the clock() function. When the loop_count=10000000, the execution time is 2478.0.

I wanted to test something with huge array so I set it to double at the beginning.

How did you calculate the time. Profiler? If you used a timer did run cudaThreadSynchronize() after launching the kernel and before you stopped it? I’m not sure what the units are, but if it’s milliseconds then I imagine the watchdog will be causing you problems with a count of ~20,000,000. IIRC it’ll terminate your kernel if it runs for more than 5 seconds.

Ok - but double is floating point. Surely you want an unsigned integer or long integer data type?

It still crushes even if I delete all timer related code and clock(), and also change the double to int. The reason I use int is just thinking that the limit of double is much larger than long int, although I may never want to use that big number… -----.-----