Tesla S1070 performance issues

Keith_Rutkowski · July 28, 2009, 7:02pm

I have been working with a Tesla S1070 doing some performance profiling under Fedora 11 64-bit running a multithreaded program where each thread simultaneously manages its own GPU. The program is nothing special and just uses pthreads and Cuda. These are the numbers I get from each thread using the Cuda event mechanism for timing, synchronously running an empty kernel, and copying 512K device->host:
[font=“Courier New”]

of threads/devices---->|[/font][font=“Courier New”]1|[/font][font=“Courier New”]2|[/font][font=“Courier New”]3|[/font][font=“Courier New”]4|[/font]

(I hope the table turns out ok in web view…)

Notice the HUGE (20x) jump from 1-2, but then the linear change between 2-3-4. Can anybody else verify these types of performance trends?

Is there some sort of big nasty lock happening in the Nvidia driver that is causing this problem? Or some other Linux Nvidia driver problem?
Or is it a hardware issue with the Tesla S1070? (I don’t believe it is since multiple Teslas suffer the same scaling problem)
Any insight as to why this behavior is observed?
How can I get around this “latency” problem? Since using 2 Tesla S1070 actually takes over 8ms just to launch a kernel. That is quite a lot of time wasted doing no computation on either GPU or CPU.

(By the way, somebody needs to fix the Linux Cuda 2.3 Nvidia driver links on the Nvidia website. They all seem to reference 190.16 when the driver is actually 190.18.)

tmurray · July 28, 2009, 7:43pm

Would need to see code before I can give any sort of reasonable answer.

e.ping · July 29, 2009, 12:29am

slowdown on copy is “expected behaviour”, the Teslas share 1-2 PCIe slots depending on your HW configuration. Seems your test is perfectly synchronised, so there’s definitely some bus contention.

A possible problem I can think of is “warmup”. What happens if you do this once, and measure perf only when run your test again, from within the same program? Run a loop. Is performance only weird in the first iteration, or always across many iterations 2…n?

Keith_Rutkowski · July 29, 2009, 1:51pm

Thanks for the help! I actually discovered an ID-10T error that was causing the problem… I was setting the Cuda device from the main thread, not from the individual threads. So the performance was a result of all threads actually using the same device, not separate devices. I did notice the “warmup” behavior you spoke of, and multiple calls within the program are faster than the first.

Topic		Replies	Views
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2777	November 17, 2021
Tesla S1070 server hardware trying to process 100s of GBs of data CUDA Programming and Performance	17	13909	January 27, 2009
Basic Question (over CUDA concepts) CUDA Programming and Performance	0	1995	April 23, 2010
compitable servers for S1070 collect some information CUDA Programming and Performance	20	27506	August 8, 2011
Why does kernel with __syncthreads() and conditional checks run faster than kernel without on NVIDIA Tesla K20M? CUDA Programming and Performance	0	401	January 3, 2018
having problem with simpe CUDA code Code debug CUDA Programming and Performance	4	1629	November 7, 2009
Tesla device problem Is it broken or it is just driver CUDA Programming and Performance	3	1007	March 16, 2012
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	20905	April 20, 2011
Performance difference between Tesla and system where Cuda GPU is used as display device CUDA Programming and Performance	8	5906	September 2, 2009
"too many resources requested for launch" on only 2 new Teslas CUDA Programming and Performance	20	1980	August 23, 2010

Tesla S1070 performance issues

of threads/devices---->|[/font][font=“Courier New”]1|[/font][font=“Courier New”]2|[/font][font=“Courier New”]3|[/font][font=“Courier New”]4|[/font]

Related topics