Sample Program Template - 8600M GT Outperforms T10 Tesla

I just installed a Tesla S1070, on a Dell XPS running Linux.

I installed the SDK, compiled the sample program “Template” and ran it, it reported a run-time of:
~400 ms

I have been previously developing on my laptop, Macbook Pro using a 8600M GT (much slower than a T10 I assume)… and run the Template sample program for a run time of:
~30 ms.

Interestingly enough, running any other sample program produces more logical results of the T10 outperforming my 8600M GT, by far.

Unfortunately, I used the Template program provided as a basis for my program. My program provides superior performance using the GPU, than using a serial CPU based algorithm, when run on my laptop. However, when run on the Tesla S1070 with the Dell XPS running an i7 CPU, the CPU significantly outperforms the GPU algorithm.

I suspect that for whatever reason the template sample program (unmodified) runs slower on a T10 than on a 8600MGT, is the same reason my own program fails to provide the same results on the XPS machine, as when run on my laptop.

Any ideas as to why this is happening?

There is something wrong with your setup.

This is on a S0170 running CUDA 2.2
[cuda@compute-0-1 ~]$ /usr/local/NVIDIA_CUDA_SDK/bin/linux/release/template -noprompt
Using device 0: Tesla C1060
Processing time: 46.374001 (ms)

The template code is a trivial test, use 1 block and 32 threads, it is there just to show the basic setup.
You should run more than 1 block and more than 32 threads per block

That is what I suspected.

I installed the latest 2.2 driver, sdk, toolkit, debugger.

So, that managed to reduce the run-time down to ~100 ms.

My laptop (2.1 driver) 8600M GT still runs a lean mean ~30 ms.

Which, is faster than your Tesla S1070 too!!

Furthermore, after installing 2.2, running any cuda program is preceeded by about 5-8 second pause, and then the program runs.

I added some print out statements to the reduction program to test this, and found that it now takes 5-8 seconds to do the cudaChooseDevice()

Could this have anything to do with running in CentOS? As I understand it, CentOS is binary compatible with RHEL.

Thanks for your help!

has anyone else encountered an issue with CUDA 2.2 slowing the cudaChooseDevice function?

if that’s on a cluster or some machine with no X running, it’s probably just delay due to the kernel module initialization

Oh! Thank you! It’s because X Server was off. Interesting. :)

Interestingly, that change affected the timing of the template program. Reducing it by 20! Now it runs in 83 ms. Not sure why this makes a difference to the timing.

Is there any documentation on how the cuda timer works?