I have been working with a Tesla S1070 doing some performance profiling under Fedora 11 64-bit running a multithreaded program where each thread simultaneously manages its own GPU. The program is nothing special and just uses pthreads and Cuda. These are the numbers I get from each thread using the Cuda event mechanism for timing, synchronously running an empty kernel, and copying 512K device->host:
[font=“Courier New”]
of threads/devices---->|[/font][font=“Courier New”]1|[/font][font=“Courier New”]2|[/font][font=“Courier New”]3|[/font][font=“Courier New”]4|[/font]
[font=“Courier New”]ms. to launch a kernel–>|[/font][font=“Courier New”] 0.1 |[/font][font=“Courier New”] 2.0 |[/font][font=“Courier New”] 3.1 |[/font][font=“Courier New”] 4.1 |[/font]
[font=“Courier New”]ms. to copy memory------>|[/font][font=“Courier New”] 0.05|[/font][font=“Courier New”] 1.9 |[/font][font=“Courier New”] 2.9 |[/font][font=“Courier New”] 3.9 |[/font]
(I hope the table turns out ok in web view…)
Notice the HUGE (20x) jump from 1-2, but then the linear change between 2-3-4. Can anybody else verify these types of performance trends?
- Is there some sort of big nasty lock happening in the Nvidia driver that is causing this problem? Or some other Linux Nvidia driver problem?
- Or is it a hardware issue with the Tesla S1070? (I don’t believe it is since multiple Teslas suffer the same scaling problem)
- Any insight as to why this behavior is observed?
- How can I get around this “latency” problem? Since using 2 Tesla S1070 actually takes over 8ms just to launch a kernel. That is quite a lot of time wasted doing no computation on either GPU or CPU.
(By the way, somebody needs to fix the Linux Cuda 2.3 Nvidia driver links on the Nvidia website. They all seem to reference 190.16 when the driver is actually 190.18.)