GPU not actually calculating

I’m using:
Nvidia Geforce 9500M GS (compute capability 1.1)
ubuntu 9.04 notebook remix
Cuda toolkit 2.2
nvidia driver 185.18.08 beta
newest version of eclipse
gcc 4.3

I’ve installed any and everything I’ve seen on the forums, changed the paths, and many other things. I can compile and run all examples from “make” in the appropriate directory, and can execute all of them with TEST PASSED at the end of the simulations.

My problem is when I actually look at the code and try to modify it. I’ve set up eclipse to do a random example - histogram64. It compiles and runs fine at first glance for 1 iteration. The GPU average time will be some amount of milliseconds and the CPU time will be a little longer. However, if I crank up the iterations as far as I can I get something like this in those areas:

Running GPU histogram (1000000000 iterations)…
histogram64GPU() time (average) : 0.000000 msec //1915341817155.926270 MB/sec


histogram64CPU() time : 12.749000 msec //748.038549 MB/sec

The GPU numbers look fishy to me, and it doesn’t seem to take any time to run at all. The only warning in eclipse is “Unresolved inclusion: <cutil_inline.h>”. I have a similar problem when trying to do this example
http://www.ddj.com/hpc-high-performance-computing/207402986
except this can’t find inclusion <cuda.h>, it thinks global is a syntax error, and it thinks incrementArrayOnDevice is a syntax error. This example ends up “running” but has no output at all. Something is definitely messed up.

Does anyone have any suggestions for what to look for? I have no experience with linux and only program on Matlab, so be very clear if possible.

The remaing part is just me venting…
This is how far I’ve gotten after spending maybe a total of a week straight, 8 hours a day trying to set this CUDA stuff up. Now that my computer is set to dual boot ubuntu and vista I’ve gotten MUCH farther, as nothing worked in vista, but still. My advisor is considering buying a 1U Tesla unit or two as he is upgrading all of his compute nodes, but if I can’t show him any potential speed up that will go out the window. This will probably be the case considering I’ve spent so long just trying to get to a programmable state on this CUDA stuff that I could have had a fully parallel matlab implementation of my code debugged and running by now. I guess I just wish NVIDIA had this more efficiently implemented, with better documentation and programming tutorials. It seems like many of the posts here are on installation issues or upgrading software bugs rather than actual CUDA related topics!

It sounds like however you have setup you build system in Eclipse, it is broken. The symptoms you report trying to compile that example from DDJ is consistent with compiling CUDA code with the regular C compiler and not nvcc.

For what it is worth, I copied that code from DDJ into a text file, hacked together a 4 line Makefile from an existing one in the SDK, and it compiled and ran without error. The total time required was about 60 seconds.

It might be helpful if you build, run and post the output of the deviceQuery example in the SDK as a first step.

ok, so I set up eclipse the exact way that is said in this link:
http://lifeofaprogrammergeek.blogspot.com/…evelopment.html

And here is my output
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA

Device 0: “GeForce 9500M GS”
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 536150016 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.95 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

Using “make” in the SDK files gives the same solution. But I tried the ddj example and it compiles with no errors, but the second I run it the console has a bar that says
Again the only warnings in the eclipse editor are inclusions and syntax errors for global and such.

Thanks for helping though!!!

I should also say that I added printf test to show me when it enters and leaves main in the ddj example. This completes instantly, not after 60 seconds or so like you’ve reported.

OK, so you have a perfectly functional CUDA installation which obviously works correctly. So am I to understand that your problem is mostly that you want to work in Eclipse, but can’t get it to work?

BTW: The 60 second reference was the total amount of time it took to cut and paste the code from DDJ, write the makefile, compile the code and run it to confirm it as valid and functional.