GPU Utilization Drops after Consecutive Executions

In my quest for optimal GPU performance, I’ve noticed a strange occurance.

On my M2090, the first three times I execute my program I get great performance and using [font=“Courier New”]nvidia-smi[/font] I see the GPU Utilization at 90% and the code only takes 10 seconds to run. However, on the 4th run and ever after, I only see 55% utilization and the code takes 15 seconds to run. I’ve used cuda-memcheck but it doesn’t show anything amiss.

If I unload the CUDA kernel module and reload it. I get back to the 90% utilization for three executions and then it’s back down to 55% ever after.

Has anyone else experienced this before or know what’s going on? This doesn’t happen on my C1060 nor my GTX 460, all using the same driver versions (295.49). My M2090 devices are in a SuperMicro SuperServer 1026GT-TF-FM205

As an aside, this happens with lots of different programs, not just one kernel.

Update:

The slowdown is due to reduced speed of [font=“Courier New”]memcpy()[/font]. I can achieve the mentioned behavior using just the [font=“Courier New”]bandwidthTest[/font] from the SDK. I have to run the program between 30-70 times, but once the slowdown occurs, it never returns to the original speed. Note that this is only with paged memory–pinned memory experiences no slowdown.

Example output:

./bandwidthTest Starting...

Running on...

Device 0: Tesla M2090

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			3616.0

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			3223.7

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			120620.7

Then, after 65 executions:

./bandwidthTest Starting...

Running on...

Device 0: Tesla M2090

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			1811.0

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			1572.0

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			120697.9

To add a little more info to the problem.

I previously thought that once the slowdown occurs, it never returns to normal. That is untrue. I speed returns to normal if I switch CUDA devices.

For example:

[font=“Courier New”]./bandwidthTest --device=0[/font]

[font=“Courier New”]./bandwidthTest --device=1
./bandwidthTest --device=0
[/font]
Now, the speed returns to normal for awhile until it slows down again.

At this point I think there is sufficient evidence pointing to a bug in the nvidia driver. I also tried verison 295.20 but to no avail.

Maybe the bug is in cpu part. When you use dynamical arrays and make many copies the computer starts to use virtual memory which is much slower. Try using pinned memory or static arrays and see if the same happens.

My understanding was that [font=“Courier New”]bandwidthTest[/font] does use static arrays. I have 36GB of memory, so I don’t see how the memory could be going virtual. When I use pinned memory I do not see the slow-down.

For what it’s worth, [font=“Courier New”]hwloc-ls[/font] list the following:

NUMANode L#0 (P#0 24GB) + Socket L#0 + L3 L#0 (12MB)

    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0

      PU L#0 (P#0)

      PU L#1 (P#8)

    L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1

      PU L#2 (P#1)

      PU L#3 (P#9)

    L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2

      PU L#4 (P#2)

      PU L#5 (P#10)

    L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3

      PU L#6 (P#3)

      PU L#7 (P#11)

  NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)

    L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4

      PU L#8 (P#4)

      PU L#9 (P#12)

    L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5

      PU L#10 (P#5)

      PU L#11 (P#13)

    L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6

      PU L#12 (P#6)

      PU L#13 (P#14)

    L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7

      PU L#14 (P#7)

      PU L#15 (P#15)

  HostBridge L#0

    PCIBridge

      PCI 8086:10d3

        Net L#0 "eth0"

    PCIBridge

      PCI 10de:1091

    PCIBridge

      PCI 10de:1091

    PCIBridge

      PCI 8086:10c9

        Net L#1 "eth1"

      PCI 8086:10c9

        Net L#2 "eth2"

    PCIBridge

      PCI 102b:0532

    PCI 8086:3a22

      Block L#3 "sda"

      Block L#4 "sdb"

      Block L#5 "sdc"

      Block L#6 "sdd"

      Block L#7 "sde"

Another note.

It appears that the cause is the PCI-e bus of the second Tesla idling down to Generation 1.1 from 2.0.

As soon as I run something on the second device, the speed of the first device jumps up. I verified this with the command

watch -n 0.5 'nvidia-smi -q | grep Current'

This is somewhat outside my area of expertise, but if this is a multi-socket system, the memory transfer speeds could be affected by NUMA issues, i.e. where are the host-side memory buffers physically located, at the “local” or a “remote” CPU? It would probably make sense to address memory affinity via numactl.

On most architectures, I would suspect NUMA issues could be a problem. However, on this architecture, both CPUs are connect by QPI, and both CPUs are also connected via QPI to the IOH-36D (basically X58) north-bridge. It is this northbridge that both PCI-e 2.0 x16 are attached. The upshot is both CPUs have equal access to both M2090s. As long as I don’t force one CPU to use the memory local to the other CPU (e.g. [font=“Courier New”]numactl -N 0 -m 1[/font]) then NUMA shouldn’t be a problem.

Like I indicated earlier, there is 100% correlation between the slowdowns and when one of the M2090s goes into “idle” mode (which I see as [font=“Courier New”]nvidia-smi[/font] showing the device as PCI-e generation 1).

Note that this is not a persistence mode issue.

Hi,

I’m not sure whether the following is really relevant, but I experienced the same sort of issue with MPI communications (long ago). As for your case, it was on a NUMA machine where the network card was attached primarily to one of the NUMA nodes. At the beginning, all data transfers (including main memory to network card) were at full speed. But after a while, the performance was slowly degrading to become around half of what was expected.
I finally managed to identify the reason of this behaviour to be the IO caches: the system was using the available memory to cache all IOs on the machine, but was allocating this memory linearly starting from the first NUMA node. This memory is not really seen as used by the system, so one can still explicitly allocate all of it. Simply, any call to malloc that can e satisfied without trashing the cache was preferred by the system. Hence, despite not using any memory on the machine, all my explicit allocation were slowly re-directed to the second, then third, etc NUMA nodes, slowly degrading the bandwidth for accessing my device attached to NUMA node 0.
A good way of identifying this sort of problem is by forcing in your case your code to run on NUMA node 0 with a “numactl -N 0 -m 0” when you see your performance degraded. If you suddenly get back you expected bandwidth, you’re probably in the same situation.
Another good indicator of such a problem is the “free” command: if the figure corresponding to “cached” gets high, you might face this problem…
If so, I can give you some hints on how to configure your kernel to avoid (as much as possible) this trap.

Thanks for the ideas.

I’ve tried [font=“Courier New”]numactl[/font] with every imaginable option but it doesn’t improve my performance.

After many executions of my program, the output of [font=“Courier New”]free[/font] shows

total       used       free     shared    buffers     cached

Mem:      37007096    5048128   31958968          0     134488    4140648

-/+ buffers/cache:     772992   36234104

Swap:     16777208          0   16777208

which seems to indicate I don’t have a caching problem.

ekimd,

Can you also take a look if any other performance properties (link width, PCIe generation, Perf state, clocks) are affected?

e.g. with:

watch -n 0.5 'nvidia-smi -q | grep -A 2 "Generation\|Link\|  Clocks\|GPU 0\|Performance"'

Maybe your board is overheating or HW break is pooled.

Thanks for the info.

When I run the bandwidthTest on either device 0 or 1, I get the following:

GPU 0000:02:00.0

    Product Name                : Tesla M2090

    Display Mode                : Disabled

--

  	GPU Link Info

            PCIe Generation

                Max             : 2

                Current         : 2

            Link Width

                Max             : 16x

                Current         : 16x

--

    Performance State           : P0

    Memory Usage

        Total                   : 5375 MB

--

    Clocks

	Graphics                : 650 MHz

        SM                      : 1301 MHz

--

After that the device goes through several performance states and ends on

GPU 0000:02:00.0

    Product Name                : Tesla M2090

    Display Mode                : Disabled

--

  	GPU Link Info

            PCIe Generation

                Max             : 2

                Current         : 1

            Link Width

                Max             : 16x

                Current         : 16x

--

    Performance State           : P12

    Memory Usage

        Total                   : 5375 MB

--

    Clocks

	Graphics                : 50 MHz

        SM                      : 101 MHz

--

As I mentioned earlier, the odd thing is that whenever the second device goes down to PCIe Generation 1 from 2, then I experience the slow bandwidths on the first device. Is this possibly a bug in the Intel IOH-36D (5520 Tylersburg) chipset?

Also, is it normal on a system such as this to see such a wide disparity between the transfer speeds of the paged vs pinned memory?

ekimd,

Is the second nvidia-smi log from a moment when some CUDA application is running? If it’s not, it’s quite normal for GPU to go to idle Perf State (P12) and clock down when nothing is running on it.

If CUDA application is running and you still see P12 it means that most likely some of the evens below might be in effect:

  • your board is overheating, unfortunately M2090 uses external temperature sensor so you’ll need to go through motherboards BMC to query the temperature (I’m not quite sure how to do that).
  • your board is over power budget. “nvidia-smi -q -d POWER” should report Power Management enabled, and some power readings. In case of running just bandwidthTest it’s very unlikely
  • maybe your power connector is loose and it’s not supplying enough power to the board. Try to reattach
  • maybe your board is not seated correctly in the PCI-E slot and the system is negotiating lower PCIe bandwidth
  • are you running some software that is adjusting the boards state (e.g. I’ve heard of applications that use X-Server nvidia api to adjust the state of the GPU)

otherwise it’s unusual situation and it’s probably best if you contact your OEM and provide nvidia-bug-report.sh log and full nvidia-smi logs from both “normal” and “bad” state (while running cuda application) and go with official bug reporting procedure.

Hope this helps,
Przemyslaw Zych

The first test is just after running [font=“Courier New”]bandwidthTest[/font], the latter test is a minute or so afterward.

The idling of the performance state is exactly what I expect to see.

What I don’t expect is the performance state of the second GPU device to affect the performance of the first. That’s why I suggested this may be a bug in Intel’s chipset… (Reminder: this is a two M2090 GPU system)

In one of the posts you mentioned that the PCIe generation is dropping while you do bandwidth test. Is it the only parameter that change on the other GPU? Or does it also change PStates/clocks etc?

Thanks for the quick reply.

I monitor the second GPU device and its exact performance state does not affect the bandwidth of the first GPU device. However, as soon as the second device’s PCIe generation drops from 2 to 1, I immediately notice a drop in bandwidth on the first device. I can overcome this by running something on the second device, thereby forcing it back to PCIe generation 2.

Regarding your question of the disparity between paged and pinned host memory when doing host<->device transfers, this is primarily a function of the host’s system memory throughput. For paged host memory, the data needs to be copied to a pinned buffer provided by the driver, and is DMAed to the device from there. This means the data is touched three times (read, write, dma) on the host before it winds up in the device. With pinned memory, the data on the host is touched only once (dma). With four-channel memory subsystems on modern host systems you should see much closer paged to pinned throughput than on older host platforms, although the addition of PCIe-3 support once again moves the goal post.

Thanks for the info. That is very interesting. I guess I was under the assumption that since this system is using an M2090 with server grade components that the paged memory stats should be a lot closer to what I see with consumer grade hardware using a GTX 460.

Is there a good resource for finding out what all the implications are for pinned memory usage? For example, will I have problems allocating buffers 300MB or more in size? (I have 36GB of RAM.) How do I determine how much pinned memory is available on the system? If I perform CPU computations on data stored in pinned memory, are there any drawbacks? Why is pinned memory allocation slower than paged memory allocation when there is a lot of system memory free?

Thank you for your help.