I just installed a Tesla S1070, on a Dell XPS running Linux.
I installed the SDK, compiled the sample program “Template” and ran it, it reported a run-time of:
~400 ms
I have been previously developing on my laptop, Macbook Pro using a 8600M GT (much slower than a T10 I assume)… and run the Template sample program for a run time of:
~30 ms.
Interestingly enough, running any other sample program produces more logical results of the T10 outperforming my 8600M GT, by far.
Unfortunately, I used the Template program provided as a basis for my program. My program provides superior performance using the GPU, than using a serial CPU based algorithm, when run on my laptop. However, when run on the Tesla S1070 with the Dell XPS running an i7 CPU, the CPU significantly outperforms the GPU algorithm.
I suspect that for whatever reason the template sample program (unmodified) runs slower on a T10 than on a 8600MGT, is the same reason my own program fails to provide the same results on the XPS machine, as when run on my laptop.
This is on a S0170 running CUDA 2.2
[cuda@compute-0-1 ~]$ /usr/local/NVIDIA_CUDA_SDK/bin/linux/release/template -noprompt
Using device 0: Tesla C1060
Processing time: 46.374001 (ms)
Test PASSED
The template code is a trivial test, use 1 block and 32 threads, it is there just to show the basic setup.
You should run more than 1 block and more than 32 threads per block
Oh! Thank you! It’s because X Server was off. Interesting. :)
Interestingly, that change affected the timing of the template program. Reducing it by 20! Now it runs in 83 ms. Not sure why this makes a difference to the timing.
Is there any documentation on how the cuda timer works?