GPU Utilization Drops after Consecutive Executions

Sorry, I am not a driver guy, nor a Linux kernel hacker. Best I know CUDA’s pinned memory is implemented via mmap() on Linux, so I assume any performance implications that apply to mmap’ed memory (including the speed of allocation / deallocation) equally apply to pinned memory in CUDA.

[Later:] I am not sure what kind of host<->device throughput you are expecting. Here is some data from a somewhat older Nehalem-based system with an M2090, running 64-bit Linux and CUDA 4.2:

model name      : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz

pinned

------

^^^^ h2d: bytes=  16777216  time=     2821.92 usec  rate=5945.31MB/sec

^^^^ d2h: bytes=  16777216  time=     2659.08 usec  rate=6309.40MB/sec

paged

-----

^^^^ h2d: bytes=  16777216  time=     3317.12 usec  rate=5057.77MB/sec

^^^^ d2h: bytes=  16777216  time=     4931.93 usec  rate=3401.76MB/sec

You can probably get higher throughput from paged memory on the latest available platforms, but I have no such systems at my disposal. Maybe a helpful CUDA user can post results from a state-of-the-art system.

Hi ekimd,

going back to your original problem. The behavior that your describing (idling of one GPU affecting negatively affecting perf of the other) is very peculiar. Can you generate a bit more logs for us so that I had more info for our HW guys to work with?

Can you gather full log from nvidia-smi looping (to have a continuous data in time) showing the behavior that you’re describing by running cuda apps in a separate terminal?

nvidia-smi -q -l 5 > nvsmi_stdout.txt

Log from the other terminal with the time stamps would be helpful as well + output of nvidia-bug-report.sh.

Can you attach it all to the post?

Thanks,

Przemyslaw Zych

Motherboards using the X58 chipset with triple channel memory can copy to/from pageable memory pretty quickly. My GTX 580 in a single socket X58 system gets:

Host-to-device: pageable=5409 MB/sec, pinned=5814 MB/sec
Device-to-host: pageable=4395 MB/sec, pinned=6092 MB/sec

Here you go. Thanks for the help!
nvsmi_stdout.txt (352 KB)

njuffa and seibert:

Thanks for posting your specs. Here are some specs collected from the various machines I have. All are using CUDA 4.2 and driver version 295.49. All are using transfer sizes of 33554432 bytes. Bandwidths are in MB/s.

H-to-D  D-to-H  Memory    Description

------  ------  --------  -----------

1727.0  1593.7  pageable  M2090 (second device is idle) Dual Xeon E5630 (2.53GHz), 36GB DDR3-1333 running at 1066 MHz

3627.5  3201.7  pageable  M2090 (second device active)  Dual Xeon E5630 (2.53GHz), 36GB DDR3-1333 running at 1066 MHz

5736.3  5529.7  pinned    M2090 (second device is idle) Dual Xeon E5630 (2.53GHz), 36GB DDR3-1333 running at 1066 MHz

5738.4  5528.6  pinned    M2090 (second device active)  Dual Xeon E5630 (2.53GHz), 36GB DDR3-1333 running at 1066 MHz

4021.6  3547.6  pageable  GTX 480  Dual Xeon E5630 (2.53GHz), 24GB DDR3-1333 running at 1066 MHz

5729.1  5804.7  pinned    GTX 480  Dual Xeon E5630 (2.53GHz), 24GB DDR3-1333 running at 1066 MHz

5386.8  5148.3  pageable  GTX 460  i7-950 (3.06GHz), 24GB DDR2 1033 MHz

5720.6  6064.9  pinned    GTX 460  i7-950 (3.06GHz), 24GB DDR2 1033 MHz

2808.2  2960.4  pageable  GTX 280M  i7-920 (2.67GHz), 6GB DDR3 

2910.6  3265.2  pinned    GTX 280M  i7-920 (2.67GHz), 6GB DDR3

1528.6  1501.9  pageable  C1060  Quad-Core Opteron 2378 (2.4GHz), 16GB DDR2 266 MHz

1561.2  1593.7  pinned    C1060  Quad-Core Opteron 2378 (2.4GHz), 16GB DDR2 266 MHz

Both Xeon hosts are using an Intel 5520 IOH 36D chipset (which is supposedly a lot like X58). The M2090s are on a Super Micro X8DTG-DF motherboard, and the GTX 480s are on a Super Micro X8DTG-QF motherboard.

As you can see, several natural questions arise. Why does the i7 perform better with DDR2 memory than either of my Xeons (does the difference in clock speed matter that much)? Why is the pageable memcpy() so much slower than the pinned on the Xeon hosts?

FYI–as stated earlier, for the dual CPU hosts, NUMA issues have already been worked out (i.e., I am using [font=“Courier New”]numactl[/font] with correct options to maximize performance).

Hi ekimd,

I run the log files through our HW team and they can’t think of a mechanism in our driver/GPUs that would cause this kind of behavior. The suggestions that they made point more towards chipset issue.
To resolve this problem you’ll need to go through the official process and go through your OEM since these kind of issues take more resources and time to debug.

Sorry I couldn’t be more helpful.

Regards,
Przemyslaw Zych

Thanks for the help. I’m not surprised that this is would be a chipset issue–given that I see the problem when the bus goes from PCIe gen 2 to 1.

A workaround would be if there is any way to set the minimum performance state on the M2090. If I could keep the lowest state at P11 then this wouldn’t be a problem. The only other thing I can do is have a watchdog timer that “pings” the GPUs every minute to keep the GPUs active. If I do that, I’m not sure what the implications would be as far as wear-and-tear to have the clocks constantly jumping around.

Have you tried to flush the kernel linux io cache when your memcpy start to being slow? You can reset the cache in this way:

sync && echo 1 > /proc/sys/vm/drop_caches

Has this problem be fixed in the meantime? I am having the same problem in a very similar setting with two M2090 cards. I have not found a way to force the performance states by the way.