I’m writing a program with CUDA supported on Windows Vista. My CUDA host function has the following structure:
…
My_kernel <<< … >>> ( … );
CudaThreadSynchronize(); ---- (1)
…
… ---- (2)
…
…
return;
where (1) takes about 3 ms to wait all threads complete, and (2) is some process that are independent to the kernel and takes about 2 ms.
I want to let CPU run (2) while waiting the CUDA threads, and synchronize the thread after (2) is done. So I modify the codes to:
…
My_kernel <<< … >>> ( … );
…
… ---- (2)
…
CudaThreadSynchronize(); ---- (1)
…
return;
In this setting, the kernel launch take only negligible time, and (2) also take 2 ms to do its works.
But (1), the CudaThreadSynchronize() methods, still take 3 ms to wait all threads, and the overall time does not decrease at all.
In order to test, I replace (2) by some dummy codes like this :
My_kernel <<< … >>> ( … );
int i;
int dummy = 1;
for (i = 0 ; i < 10000000; i++)
dummy = dummy * 2 % 10; ---- (2)
dump_value[0] = dummy; // let the compiler not automatically remove the dummy code
CudaThreadSynchronize(); ---- (1)
While (2) now takes 100~1000 ms and are completely independent to (1), CudaThreadSynchronize() still takes 3 ms to do the synchronization. It seem that the CUDA threads do not actually run until CudaThreadSynchronize() is used.
I also try to add redundant codes in the kernel function to let it take 40ms to complete, but the CudaThreadSynchronize() time is also 40ms after CPU ran 100~1000ms dummy code. In other words, the CPU does not blocked by the launched kernel, but the kernel also does not run when the CPU is doing its work.
The time is measured by assembly codes which read the CPU ticks:
#define ReadTSC(x) __asm cpuid
__asm rdtsc
__asm mov dword ptr x, eax
__asm mov dword ptr x+4, edx
so I can compute the time cost like:
ReadTSC(start_tick);
cudaThreadSynchronize();
ReadTSC(end_tick);
printf(“Launch kernel : %.4f (ms)”, float(end_tick - start_tick) / 2400000); // 2.4GHz CPU
I am sure that these codes do not affect the process time, since the overall time cost of host function is still the same when I comments all of them, and the time I get is also consistent with the running time of host function.
Strangely, when I move my code to two colleagues’ computers, both of them run my code IN PARALLEL which is what I want to get.
The differences between us:
I:
CPU : Intel Core2Quad Q6600 2.40GHz
Graphics : NVIDIA GeForce 9800 GTX+
OS : Windows Vista 32 bits
Fellow 1:
CPU : Intel Core2Dual E8400 3.00GHz
Graphics : NVIDIA GeForce 9600
OS : Windows XP SP3
Fellow 2:
CPU : Intel Core2Quad Q6600 2.40GHz
Graphics : NVIDIA GeForce 9800 GTX+
OS : Windows XP SP3
So it seems to be a bug for CUDA on Windows Vista. Does anyone have the same problem? I’m considering to file a bug report.