Tesla S1070 performance issues

I have been working with a Tesla S1070 doing some performance profiling under Fedora 11 64-bit running a multithreaded program where each thread simultaneously manages its own GPU. The program is nothing special and just uses pthreads and Cuda. These are the numbers I get from each thread using the Cuda event mechanism for timing, synchronously running an empty kernel, and copying 512K device->host:
[font=“Courier New”]

of threads/devices---->|[/font][font=“Courier New”]1|[/font][font=“Courier New”]2|[/font][font=“Courier New”]3|[/font][font=“Courier New”]4|[/font]

[font=“Courier New”]ms. to launch a kernel–>|[/font][font=“Courier New”] 0.1 |[/font][font=“Courier New”] 2.0 |[/font][font=“Courier New”] 3.1 |[/font][font=“Courier New”] 4.1 |[/font]
[font=“Courier New”]ms. to copy memory------>|[/font][font=“Courier New”] 0.05|[/font][font=“Courier New”] 1.9 |[/font][font=“Courier New”] 2.9 |[/font][font=“Courier New”] 3.9 |[/font]

(I hope the table turns out ok in web view…)

Notice the HUGE (20x) jump from 1-2, but then the linear change between 2-3-4. Can anybody else verify these types of performance trends?

  • Is there some sort of big nasty lock happening in the Nvidia driver that is causing this problem? Or some other Linux Nvidia driver problem?
  • Or is it a hardware issue with the Tesla S1070? (I don’t believe it is since multiple Teslas suffer the same scaling problem)
  • Any insight as to why this behavior is observed?
  • How can I get around this “latency” problem? Since using 2 Tesla S1070 actually takes over 8ms just to launch a kernel. That is quite a lot of time wasted doing no computation on either GPU or CPU.

(By the way, somebody needs to fix the Linux Cuda 2.3 Nvidia driver links on the Nvidia website. They all seem to reference 190.16 when the driver is actually 190.18.)

Would need to see code before I can give any sort of reasonable answer.

slowdown on copy is “expected behaviour”, the Teslas share 1-2 PCIe slots depending on your HW configuration. Seems your test is perfectly synchronised, so there’s definitely some bus contention.

A possible problem I can think of is “warmup”. What happens if you do this once, and measure perf only when run your test again, from within the same program? Run a loop. Is performance only weird in the first iteration, or always across many iterations 2…n?

Thanks for the help! I actually discovered an ID-10T error that was causing the problem… I was setting the Cuda device from the main thread, not from the individual threads. So the performance was a result of all threads actually using the same device, not separate devices. I did notice the “warmup” behavior you spoke of, and multiple calls within the program are faster than the first.