Kernel can not run parallelly with CPU codes OK in XP, failed in Vista

I’m writing a program with CUDA supported on Windows Vista. My CUDA host function has the following structure:


My_kernel <<< … >>> ( … );
CudaThreadSynchronize(); ---- (1)


… ---- (2)


return;

where (1) takes about 3 ms to wait all threads complete, and (2) is some process that are independent to the kernel and takes about 2 ms.
I want to let CPU run (2) while waiting the CUDA threads, and synchronize the thread after (2) is done. So I modify the codes to:


My_kernel <<< … >>> ( … );


… ---- (2)

CudaThreadSynchronize(); ---- (1)


return;

In this setting, the kernel launch take only negligible time, and (2) also take 2 ms to do its works.
But (1), the CudaThreadSynchronize() methods, still take 3 ms to wait all threads, and the overall time does not decrease at all.

In order to test, I replace (2) by some dummy codes like this :

My_kernel <<< … >>> ( … );

int i;
int dummy = 1;
for (i = 0 ; i < 10000000; i++)
dummy = dummy * 2 % 10; ---- (2)

dump_value[0] = dummy; // let the compiler not automatically remove the dummy code

CudaThreadSynchronize(); ---- (1)

While (2) now takes 100~1000 ms and are completely independent to (1), CudaThreadSynchronize() still takes 3 ms to do the synchronization. It seem that the CUDA threads do not actually run until CudaThreadSynchronize() is used.

I also try to add redundant codes in the kernel function to let it take 40ms to complete, but the CudaThreadSynchronize() time is also 40ms after CPU ran 100~1000ms dummy code. In other words, the CPU does not blocked by the launched kernel, but the kernel also does not run when the CPU is doing its work.

The time is measured by assembly codes which read the CPU ticks:

#define ReadTSC(x) __asm cpuid
__asm rdtsc
__asm mov dword ptr x, eax
__asm mov dword ptr x+4, edx

so I can compute the time cost like:

ReadTSC(start_tick);
cudaThreadSynchronize();
ReadTSC(end_tick);
printf(“Launch kernel : %.4f (ms)”, float(end_tick - start_tick) / 2400000); // 2.4GHz CPU

I am sure that these codes do not affect the process time, since the overall time cost of host function is still the same when I comments all of them, and the time I get is also consistent with the running time of host function.

Strangely, when I move my code to two colleagues’ computers, both of them run my code IN PARALLEL which is what I want to get.
The differences between us:

I:
CPU : Intel Core2Quad Q6600 2.40GHz
Graphics : NVIDIA GeForce 9800 GTX+
OS : Windows Vista 32 bits
Fellow 1:
CPU : Intel Core2Dual E8400 3.00GHz
Graphics : NVIDIA GeForce 9600
OS : Windows XP SP3
Fellow 2:
CPU : Intel Core2Quad Q6600 2.40GHz
Graphics : NVIDIA GeForce 9800 GTX+
OS : Windows XP SP3

So it seems to be a bug for CUDA on Windows Vista. Does anyone have the same problem? I’m considering to file a bug report.

post your full code

This is weird and I am investigating!

For testing I modify the sample code of MatrixMul from NVIDIA CUDA SDK:

void
runTest(int argc, char** argv)
{
CUT_DEVICE_INIT(argc, argv);

// set seed for rand()
srand(2006);

// allocate host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);
unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B );

// initialize host memory
randomInit(h_A, size_A);
randomInit(h_B, size_B );

// allocate device memory
float* d_A;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size_A));
float* d_B;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_B, mem_size_B ));

// copy host memory to device
CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, mem_size_A,
                          cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL(cudaMemcpy(d_B, h_B, mem_size_B,
                          cudaMemcpyHostToDevice) );

// allocate device memory for result
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* d_C;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size_C));

// allocate host memory for the result
float* h_C = (float*) malloc(mem_size_C);

// create and start timer
unsigned int timer = 0;

// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(WC / threads.x, HC / threads.y);

// execute the kernel
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB );

CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));
cudaThreadSynchronize();    ----------------------------------------------------------( A )
CUT_SAFE_CALL(cutStopTimer(timer));
printf("Processing time (CUDA) : %f (ms) \n", cutGetTimerValue(timer));
CUT_SAFE_CALL(cutDeleteTimer(timer));

// copy result from device to host
CUDA_SAFE_CALL(cudaMemcpy(h_C, d_C, mem_size_C,
                          cudaMemcpyDeviceToHost) );

// compute reference solution
float* reference = (float*) malloc(mem_size_C);

CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));
computeGold(reference, h_A, h_B, HA, WA, WB ); ----------------------------------( B )
CUT_SAFE_CALL(cutStopTimer(timer));
printf("Processing time (CPU): %f (ms) \n", cutGetTimerValue(timer));
CUT_SAFE_CALL(cutDeleteTimer(timer));

// check result
CUTBoolean res = cutCompareL2fe(reference, h_C, size_C, 1e-6f);
printf("Test %s \n", (1 == res) ? "PASSED" : "FAILED");
if (res!=1) printDiff(reference, h_C, WC, HC);

// clean up memory
free(h_A);
free(h_B );
free(h_C);
free(reference);
CUDA_SAFE_CALL(cudaFree(d_A));
CUDA_SAFE_CALL(cudaFree(d_B ));
CUDA_SAFE_CALL(cudaFree(d_C));

}

For bigger size matrices, ( A ) costs 10ms on CUDA and ( B ) costs ~200ms on CPU. On the reorder:

// execute the kernel
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB );

// compute reference solution
float* reference = (float*) malloc(mem_size_C);

CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));
computeGold(reference, h_A, h_B, HA, WA, WB ); ----------------------------------( B )
CUT_SAFE_CALL(cutStopTimer(timer));
printf("Processing time (CPU): %f (ms) \n", cutGetTimerValue(timer));
CUT_SAFE_CALL(cutDeleteTimer(timer));

CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));
cudaThreadSynchronize();    ----------------------------------------------------------( A )
CUT_SAFE_CALL(cutStopTimer(timer));
printf("Processing time (CUDA) : %f (ms) \n", cutGetTimerValue(timer));
CUT_SAFE_CALL(cutDeleteTimer(timer));
// copy result from device to host

CUDA_SAFE_CALL(cudaMemcpy(h_C, d_C, mem_size_C,
                          cudaMemcpyDeviceToHost) );

On my colleagues’ computer, ( A ) costs ~0ms now as the threads should complete before ( B ) finished, but on my computer, ( A ) still costs 10ms to do its work.

I just tested on another computer with Windows Vista, and the problem remains, but I also find an easy workaround to solve this problem:

My_Kernel <<< … >>> ( … );

cudaEvent_t evt;
cudaEventCreate(&evt);

cudaEventRecord(evt, NULL); // force start CUDA threads

… — Let CPU do its work before the threads complete

cudaEventDestroy(evt);

cudaThreadSynchronize();

The cuda event is often used for non-busy waiting, but I found that cudaEventRecord() make the threads run while they don’t in normal for Windows Vista. Something important but strange is that you must put cudaEventDestroy(evt) AFTER your CPU code, or the threads will not run parallel with CPU and this workaround is not worked.