stymied by my first cuda simple test, need help!

After finally figured out all the installation/linking/compiling sample codes on both Linux and Windows 7 (if anyone has questions re the Windows setup, including how to use VS 2010, email me and I will be glad to help. As for Linux, that’s a relatively simple matter I think). I am now able to concentrate on writing a simple test code to compare CPU and GPU.

The code basically compares the time spend on doing some relatively cpu intensive calcs on the CPU, and doing the same calc on GPU, but repeated N times. The kernel code looks something like this:

[i]
global kernel(float* c) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;

float x;
int i, j;
// doing some simple add/subtract/multiplication, etc in some big loop
for (i = 0; i < 100000; ++i) {
for (j = 0; j < 100000; ++j) {
// x = …;
}
}

// output –
c[idx] = x;
}
[/i]

The main function body that calls the gpu calc looks something like this:

[i]//…

float v[5000];
float* dv;
cudaMalloc((void**)%dv, 5000*sizeof(float));

clock_t start = clock();

kernel<<<40,1024>>>(dv);
cudaMemcpy(v, dv, 5000*sizeof(float), cudaMemcpyDeviceToHost);

clock_t end = clock();

cout << (float)(end - start)/(float)CLOCKS_PERSEC * 1000. << (ms) << endl;
[/i]

Now, this code runs all fine, and it produces some timing results (which wasn;t what i expected, but I will leave it out for another post…). But this is the part that’s currently confuses the hell out of me:

if I comment out the last line of the kernel code, ie, change from:

c[idx] = x;

to:

// c[idx] = x;

and the timing output says 0.000 (ms) !!! ie, no time spent on GPu at all !!

And here’s the even weirder part, I can change this last line to:

if (idx < -100) c[0] = x;

ie, this line will never get executed because idx will never go negative, but the timing result goes back to non-zero, same as the roginal code !! and the same timing result goes for this scenario:

if (idx < 2) c[idx] = x;

So it looks like somehow the fact that the output vector either gets accessed, or ‘mentioned’ in the code, the code produces the same non-zero timing result. It does not matter how often the output vector gets accessed, or even accessed at all.

I thought about maybe the cpu code went ahead and didn;t wait for the GPU to finish (therefore the timing result is zero), but according to the programming guide, the fact that you called ‘cudaMemcpy…’ after the kernel call means waiting for the gpu code to finish. But to be sure, I also throw in the call ‘cudaThreadSynchronize()’ after the kernel call, same results, ie, clock period = 0.0.

So my guess is that there’s something I don’t quite understand going on wrt to waiting for gpu to finish, and how the output vector was manipulated in the code that affects the timing to the execution.

Now, I am basically stuck as I can’t really produce a simple test to show that parallel calc on the GPU is indeed N x faster than CPU. Any help would be GREATLY APPRECIATED!! (I am even willing to pay for the advice if needed!!)

Thx

Trent

zhuq1688@gmail.com

It would be a pleasure if you could use the code-text function of the forum next time.

The behavior you observed with deleting c[idx] = x; is a great success in today compiler-techniques i guess ;) The compiler does probably not compile he code that modifies x. This is only an assumption. That would also describe the second phenomenon, because nvcc knows that the threadIdx cant get < 0.

Can someone affirm that?

Was this line a typo on the part of your post (and not in your code)?

One possibility for the 0 timing is that your kernel is silently failing. I’d advise checking all your operation’s error codes and seeing if they’re all completing successfully.

code compiles and produces the desired output when I do pass around the output from device to host. the issue is trying to makes sense of what kind of parallel speed up I could get, and compare to my naive expectation.

The best I can guess is that the cuda compiler looks at the kernel function during compilation, and if it sees that the function doesn’t do any output/write (even though the output/write condition will never be met), it simply instructs the gpu not to do anything. It’d nice if someone from NVIDIA can confirm this.

That is correct. The compiler has very agressive dead code elimination, and will remove code that has no effect (i.e. not involved in a write to global memory) unless you trick the compiler.