CUDA 1.0 device crashes during memcpy with large looped kernel Help please

I am doing some bench marking for a project and the code seems to work fine until I start increasing the size of a loop inside the kernal. This is on a CUDA 1.0 device (8800GTS with 320M RAM)

For narrowing down what was causing the issue, I created the following code to verify it wasn’t the data size causing the issue:
typedef unsigned int UINT32; /* 4 byte /
typedef unsigned long long UINT64; /
8 bytes */

global void CUDA_TEST( void *kp, void *ret, UINT32 loop) {
UINT32 *k = (UINT32 *)kp;
UINT64 val=0;

for (UINT64 i = 0; i <loop; i++)
    val += *k;
*((UINT64 *)ret) = val;  


extern “C” void CUDA_TEST_WRAP( void *kp, void *ret, UINT32 loop) {

unsigned int CUDA_TEST_KP;
cutilSafeCall( cudaMalloc( (void
) &CUDA_TEST_KP,sizeof(int)));
cutilSafeCall( cudaMalloc( (void**) &CUDA_TEST_RET,sizeof(UINT64)));
int mycount = 2000000000;
cutilSafeCall( cudaMemcpy( CUDA_TEST_KP, &mycount, sizeof(int),
cudaMemcpyHostToDevice) );
UINT64 testloop = 10000000;

printf(“made it through\n”);
UINT64 my_result;
cutilSafeCall( cudaMemcpy(&my_result, CUDA_TEST_RET, sizeof(UINT64),

printf("worked for %I64u repeats and returned %I64u\n",testloop,my_result);

The code above works fine for ANY values of mycount. I have tried for testloop values up to 10,000,000. However, once I set testloop to 100,000,000 it crashes trying to execute cudaMemcpy. I get the “made it through” message in all cases.

I have confirmed that the UINT32 and UINT64 are indeed the size expected(4 and 8 bytes respectively)

Can anyone direct me toward what the issue might be? Is my issue related to a bad coding assumption? Or is it just an issue of trying something beyond the abilities of the CUDA 1.0 device? Any help would be GREATLY appreciated.

How long is the kernel running before it crashes? Is there a screen attached to the GPU? You might just be hitting the watchdog timer.

Thank you VERY much for the input. It gives me something to look into.

I don’t have the number handy but I recall it lasting maybe a second on a “good” run. I’ll have a good number tomorrow.

Shouldn’t the cudaThreadSynchronize ensure that the CUDA code be finished before trying to do the memcpy? The memcpy is where it is crashing as far as I know.

There is a screen attached to the GPU.

I’ll take a look into the watchdog timer.