After finally figured out all the installation/linking/compiling sample codes on both Linux and Windows 7 (if anyone has questions re the Windows setup, including how to use VS 2010, email me and I will be glad to help. As for Linux, that’s a relatively simple matter I think). I am now able to concentrate on writing a simple test code to compare CPU and GPU.
The code basically compares the time spend on doing some relatively cpu intensive calcs on the CPU, and doing the same calc on GPU, but repeated N times. The kernel code looks something like this:
[i]
global kernel(float* c) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
float x;
int i, j;
// doing some simple add/subtract/multiplication, etc in some big loop
for (i = 0; i < 100000; ++i) {
for (j = 0; j < 100000; ++j) {
// x = …;
}
}
// output –
c[idx] = x;
}
[/i]
The main function body that calls the gpu calc looks something like this:
[i]//…
float v[5000];
float* dv;
cudaMalloc((void**)%dv, 5000*sizeof(float));
clock_t start = clock();
kernel<<<40,1024>>>(dv);
cudaMemcpy(v, dv, 5000*sizeof(float), cudaMemcpyDeviceToHost);
clock_t end = clock();
cout << (float)(end - start)/(float)CLOCKS_PERSEC * 1000. << (ms) << endl;
[/i]
Now, this code runs all fine, and it produces some timing results (which wasn;t what i expected, but I will leave it out for another post…). But this is the part that’s currently confuses the hell out of me:
if I comment out the last line of the kernel code, ie, change from:
c[idx] = x;
to:
// c[idx] = x;
and the timing output says 0.000 (ms) !!! ie, no time spent on GPu at all !!
And here’s the even weirder part, I can change this last line to:
if (idx < -100) c[0] = x;
ie, this line will never get executed because idx will never go negative, but the timing result goes back to non-zero, same as the roginal code !! and the same timing result goes for this scenario:
if (idx < 2) c[idx] = x;
So it looks like somehow the fact that the output vector either gets accessed, or ‘mentioned’ in the code, the code produces the same non-zero timing result. It does not matter how often the output vector gets accessed, or even accessed at all.
I thought about maybe the cpu code went ahead and didn;t wait for the GPU to finish (therefore the timing result is zero), but according to the programming guide, the fact that you called ‘cudaMemcpy…’ after the kernel call means waiting for the gpu code to finish. But to be sure, I also throw in the call ‘cudaThreadSynchronize()’ after the kernel call, same results, ie, clock period = 0.0.
So my guess is that there’s something I don’t quite understand going on wrt to waiting for gpu to finish, and how the output vector was manipulated in the code that affects the timing to the execution.
Now, I am basically stuck as I can’t really produce a simple test to show that parallel calc on the GPU is indeed N x faster than CPU. Any help would be GREATLY APPRECIATED!! (I am even willing to pay for the advice if needed!!)
Thx
Trent