I am doing some bench marking for a project and the code seems to work fine until I start increasing the size of a loop inside the kernal. This is on a CUDA 1.0 device (8800GTS with 320M RAM)
For narrowing down what was causing the issue, I created the following code to verify it wasn’t the data size causing the issue:
typedef unsigned int UINT32; /* 4 byte /
typedef unsigned long long UINT64; / 8 bytes */
global void CUDA_TEST( void *kp, void *ret, UINT32 loop) {
UINT32 *k = (UINT32 *)kp;
UINT64 val=0;
for (UINT64 i = 0; i <loop; i++)
{
val += *k;
}
*((UINT64 *)ret) = val;
}
extern “C” void CUDA_TEST_WRAP( void *kp, void *ret, UINT32 loop) {
CUDA_TEST<<<1,1>>>(kp,ret,loop);
}
TESTING CODE:
unsigned int CUDA_TEST_KP;
UINT64 CUDA_TEST_RET;
cutilSafeCall( cudaMalloc( (void) &CUDA_TEST_KP,sizeof(int)));
cutilSafeCall( cudaMalloc( (void**) &CUDA_TEST_RET,sizeof(UINT64)));
int mycount = 2000000000;
cutilSafeCall( cudaMemcpy( CUDA_TEST_KP, &mycount, sizeof(int),
cudaMemcpyHostToDevice) );
UINT64 testloop = 10000000;
CUDA_TEST_WRAP(CUDA_TEST_KP,CUDA_TEST_RET,testloop);
cudaThreadSynchronize();
printf(“made it through\n”);
UINT64 my_result;
cutilSafeCall( cudaMemcpy(&my_result, CUDA_TEST_RET, sizeof(UINT64),
cudaMemcpyDeviceToHost));
printf("worked for %I64u repeats and returned %I64u\n",testloop,my_result);
cudaFree(CUDA_TEST_KP);
cudaFree(CUDA_TEST_RET);
The code above works fine for ANY values of mycount. I have tried for testloop values up to 10,000,000. However, once I set testloop to 100,000,000 it crashes trying to execute cudaMemcpy. I get the “made it through” message in all cases.
I have confirmed that the UINT32 and UINT64 are indeed the size expected(4 and 8 bytes respectively)
Can anyone direct me toward what the issue might be? Is my issue related to a bad coding assumption? Or is it just an issue of trying something beyond the abilities of the CUDA 1.0 device? Any help would be GREATLY appreciated.