RTX 6000 Ada slower than GeForce RTX 3050 in Python with TensorFlow 2?

Hi everyone,

I am experiencing a noticeable difference in performance comparing our old vs new setup.

I am running a biological analysis (with the mRNA trajectory inference tool UniTVelo) that utilizes TensorFlow 2 for GPU acceleration, within a mambaorg/micromamba:jammy-cuda-11.8.0 docker image, the same on both systems for comparison.

Our old system had an NVIDIA GeForce RTX 3050, the new system runs on an NVIDIA RTX 6000 Ada. The old system takes 30 min for a complete run-through of the analysis whereas the new system needs 42 min.

Is this expected? Or is this somehow due to the fact that Tensorflow2 is not yet supporting CUDA 12? Hence the CUDA 11.8 docker image.


+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:01:00.0 Off |                    0 |
| 35%   65C    P2              94W / 300W |  44110MiB / 46068MiB |     17%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 30%   55C    P2    67W / 130W |   6738MiB /  8192MiB |     85%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Kind regards,
Chris

This GPU has compute capability 8.9 and CUDA 11.8 should support it. However, you would want to find out whether the CUDA-accelerated software you are using includes SASS (machine code) for sm_89 in the fat binary, otherwise there will be overhead for JIT compilation. You may find out that you must build some or all of the software from source code to include sm_89 support.

Comparing the performance of two systems only works well if exactly one variable changes between them, in other words, when we are conducting a well-controlled experiment. Are the two systems in question identically configured (same hardware, same software with same versions, identical configuration files and settings except those directly related to the GPU), except that the GPUs differ? If not you will probably need to do some system-level profiling followed by component-level profiling to see where exactly the performance difference comes from.

I do not know anything about your software stack, but I would assume that its performance does not solely rely on the GPU, but also on other system components, such as the CPU (cores and clock frequencies), system memory (size and speed), mass storage (speed grade of NVMe) even interconnects (version and width of PCIe slots).

Were these system purchased from an experienced (and NVIDIA approved) system integrator, or are they self-configured? If the latter, are you confident that the GPUs are in the correct PCIe slots, and that their power and cooling needs are being met?

If your machines are actually rented by the hour, which cloud provider are you using and which instances, and what is the virtualization software being used?

Is this expected?

In a system where nothing is changed other than that a GTX 3050 is replaced with an RTX 6000 Ada, GPU-accelerated software should see a very significant performance increase, as the latter GPU offers (from memory!) something like 2x the GPU memory bandwidth and 4x the computational throughput of the former GPU, also a much larger GPU memory which is a great performance benefit to many HPC applications.

Thank you very much for the swift response and helpful pointers! I will give them some thought and run more tests.