Slow training of neural networks on GPU

Hey everyone,

I am experiencing slow training of neural networks when training on the GPU. I know this because I have another, inferior, hardware to compare this to.

The inferior hardware is a laptop (Lenovo X1 Carbon) connected to an eGPU containing a 1080TI. The superior hardware is a desktop with an Intel i9-10940x processor with the ASUS PRO WS X299 SAGE II motherboard, which contains 3 GPUs – two 2080TIs and one 1080TI.

When training on the desktop, I am getting 2-4x slower speed as compared to training on the laptop+eGPU. I used nvidia-smi to make sure the GPUs were being used and they were.

The software for both systems is the same – Ubuntu 18.04, and I installed PyTorch using Anaconda3 in exactly the same way. This gives me an conda environment with PyTorch 1.7.1, CUDA 11, and cuDNN 8.0.5. I also tested this on Tensorflow code and got the same result – the desktop training was much slower.

Any help would be much appreciated! Thank you!

Edit: Here is the Tensorflow code I’m using to test this:

import tensorflow as tf 
from tensorflow import keras
from tensorflow.keras import layers
import time

model = keras.Sequential(
    [
        layers.Dense(2, activation="relu", name="layer1"),
        layers.Dense(3, activation="relu", name="layer2"),
        layers.Dense(1, name="layer3"),
    ]
)

inp = tf.ones((100000, 3))
labels = tf.zeros((100000, 1))

model.compile(
    optimizer=keras.optimizers.RMSprop(),
    loss=keras.losses.MeanSquaredError(),
    metrics=[keras.metrics.MeanSquaredError()],
)

tic = time.perf_counter()
model.fit(
    inp,
    labels,
    batch_size=1000,
    epochs=int(500)
)
toc = time.perf_counter()

Consider reporting the actual performance metrics for particular training runs here so people with similar setups can compare to what they are seeing. I would be surprised if anyone can pinpoint the issue from the scant information provided so far.

For the laptop, it takes about 7 seconds to run, while for the desktop it takes around 16 seconds to run. I’m not training anything in order to improve say classification performance. I am just putting the GPUs through its paces in order to test the speed.

Here is the Tensorflow code I am using:

import tensorflow as tf 
from tensorflow import keras
from tensorflow.keras import layers
import time

model = keras.Sequential(
    [
        layers.Dense(2, activation="relu", name="layer1"),
        layers.Dense(3, activation="relu", name="layer2"),
        layers.Dense(1, name="layer3"),
    ]
)

inp = tf.ones((100000, 3))
labels = tf.zeros((100000, 1))

model.compile(
    optimizer=keras.optimizers.RMSprop(),
    loss=keras.losses.MeanSquaredError(),
    metrics=[keras.metrics.MeanSquaredError()],
)

tic = time.perf_counter()
model.fit(
    inp,
    labels,
    batch_size=1000,
    epochs=int(500)
)
toc = time.perf_counter()

print(f'wallclock: {toc - tic}')

I suspect not all of the 3 GPU’s are running at x16 PCIe bandwidth, so something to check. The CPU you are using supports 48 PCIe lanes and 3 x 16 =48 which would leave no lanes for any other motherboard functions.

Looking at Appendix A-1 of the motherboard manual, Pro WS X299 SAGE II Manual | Server & Workstation | ASUS Malaysia it looks like the total number of PCIe lanes feeding 6 slots is 32. So you could have one card in Slot 1 OR 3 and one in 4 OR 5 OR 6 OR 7 with the appropriate BIOS setting adjusted to give x16 or use 2 slots at x8.

With the system training, run nvidia-smi on each card and look at the PCIe speed/width readings.

Thanks for the reply! Hmm, well I did try removing one the GPUs and ran the above code again and got the same result.

I am noticing that in the nvidia-smi that the desktop GPU utilization is low even for large batches, unless the model size is really large.

Basically, for the laptop the GPU idles at around 10-20% with a monitor connected, but when training jumps to 50%. For the desktop GPU, it idles at around 1% with a low-resolution monitor connected, but then “only” jumps up to 6-7% when training. I can get the utilization higher to 50% if I increase the model size though.

The memory I am using is: G.SKILL Ripjaws V Series 64GB (4 x 16GB) 288-Pin DDR4 SDRAM DDR4 3200. I am using two sets for a total of 128GB of memory.

I hope I have illuminated some more things.

Again, with the system training, run nvidia-smi on each card. What are the PCIe speed/width readings? This needs to be done under load, as the PCIe speeds are slowed when the cards are idle.

I’m sorry but which arguments for nvidia-smi should I be running to monitor this?

Sorry, nvidia-smi -q should give you this on all cards,(this is on Linux, probably the same on Windows).

You are looking for the “GPU Link Info” section for each card.

Thank you! I think the values I’m looking for are Tx Throughput and Rx Throughput?

For the laptop, I have Tx: ~22,000KB/s, and Rx: ~140,000KB/s
For the desktop, I have Tx: ~10,000KB/s and Rx: 40,000KB/s

That does illuminate things I feel. Is there a recommendation for a fix, or where I can start looking?

No, this bit here, (ignore the N/A values):

GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Link Width
Max : N/A
Current : N/A

And on the desktop this should occur three times - once for each card.

Ohhh sorry. Yeah the PCIe Generation Max for all says 3. And for each card under load, the Current goes from 1 to 3.

For the Link Width, both Max and Current are x16 all the time.

(thanks for your patience)

OK, that would indicate all three cards are operating at their maximum levels PCIe-wise, although for the reasons I mentioned in my first post, I do not see how this is possible.

I’m out of ideas here, although the Throughput figures you posted earlier would support some degree of throttling occuring there.

Brainstorming:

Are both system returning the same (correct) results? If results are incorrect, they can be generated arbitrarily fast.

Were the performance numbers for the two systems accidentally swapped? Is the executable the exact same binary for both systems, or built locally? Is one build perhaps a debug build, the other a release build? Are any performance-impacting environment variables in effect on either system?

It would be best to create a more controlled experiment by running the laptop and desktop each with just the 1080Ti. Then instrument the code with time stamps or profile as appropriate to track down where the time is spent on each system (CPU/GPU/storage). Any significant shifts in percentage between components between the system should be an excellent pointer to be followed up on.

Is it possible much of total app execution time goes to mass-storage I/O, for example? So maybe a SSD vs HDD situation, perhaps?

Thanks for the suggestions!

I examined the CPU usage of both the laptop and the desktop and saw the laptop under load was working at about 4.4 GHz for all CPUs while the desktop under load was staying at the minimum clock, 1.2 GHz for all CPUs (all this is training the with GPUs). This prompted me to check the CPU frequency ratios in the bios.

So I went into the bios and instead of using the default CPU frequency ratios of Auto, I manually set them to the advertised setting of (4.8 GHz) which seemed to fix things in PyTorch. And actually, when I was stress testing the CPUs under the Auto setting, I saw that the maximum it would achieve for an individual CPU is about 4.2GHz. So I went back and manually set the CPU frequency ratios to 4.1 GHz just as a test, and even here I saw a x2 speedup over the Auto setting.

So definitely something fishy is going on with the motherboard.

Also, what’s even weirder is that PyTorch is sped up now, but Tensorflow is not. Pretty weird!

Thanks for your the suggestions though! Is there any idea of what could be going? Perhaps I should contact ASUS.

While both CPUs and GPUs use dynamic clocking, I haven’t seen a CPU for a desktop system clocked as low as 1.2 GHz when in use.

Intel CPUs typically have a base clock (3.3 GHz for the i9-10940x) and if thermals and power allow, can boost the clock above that baseline up to a maximum boost clock (often limited to the case where only one core is active), which is 4.8 GHz for your CPU. Sometimes the CPUs have to lower the clock below the base clock when wide SIMD (e.g. AVX-2 or AVX-512) is being used, but that is normally on the order of a couple hundred MHz below base.

Yeah that is pretty weird. I think I ended up pinpointing a large part of the problem. It wasn’t necessarily setting the CPU Core Ratio from Auto to manual. It was disable Intel SpeedShift. This alleviated part of the problem in that when I now run the PyTorch code at the bottom of this post, I now beat the laptop.

But this isn’t the end of the story because the above Tensorflow code in the original post edit still runs slower on the desktop+GPU than the laptop+eGPU – about x2 slower. And some PyTorch code also still runs slightly slower than the laptop. But this was enlightening nonetheless. Hmmm…

import torch
import time

device = torch.device('cuda:0')

class NNClass(torch.nn.Module):
    def __init__(self):
        super(NNClass, self).__init__()
        self.fc1 = torch.nn.Linear(3, 128)
        self.fclast = torch.nn.Linear(128, 3)

    def forward(self, inp):
        out = self.fc1(inp)
        return self.fclast(out)

theNN = NNClass().to(device)
lossfunc = torch.nn.MSELoss()
inp = torch.zeros(size=(64, 3), device=device)
labels = torch.zeros(size=(64, 3), device=device)

tic = time.perf_counter()
for idx in range(10000000):
    Fout = theNN(inp)
    loss = lossfunc(Fout, labels)
    loss.backward()

    if idx % 1000 == 0:
        print(f'idx: {idx}')
    if idx == int(3e4):
        break

toc = time.perf_counter()
print(f'wallclock: {toc - tic}')

Honestly, I have no idea about “Core Ratio” and “Intel Speedshift”. The SBIOS in my Dell workstation has relatively few user-selectable settings, and I have only looked at a subset of those. I guess there is an advantage to be given less rope in which to get entangled.

There are all kind of benchmarking apps out there with associated databases that allow comparison to other systems. Maybe try one of those to see whether your desktop falls short in a particular discipline and then take it from there.

Hello, I’m experiencing exactly the same problem with you!
I have the same CPU/motherboard combination with you but with 2 RTX 3090 GPUs.
My finding is, the performance varies with different versions of the BIOS of the motherboard.
If I use the BIOS version 0501 and auto setting of the CPU ratio, they will run up to 4.3GHz,
but if I use the BIOS version 0702 or 0901(the latest one), then they will run at 1.2GHz.