Cuda function much slower on Arm machine

Hi, I have a couple of Cuda functions that I have been using to evaluate Cuda performance. I ran them on my Windows laptop with a Xeon and T2000 mobile GPU and for the fast function (1 kernel) got a runtime of 0.4ms and the slower function (9 kernels) had a runtime of 24ms.

I then ran the exact same code recompiled for an Ubuntu machine running an Arm CPU and A4000 GPU. For the slower of the two functions I saw a substantial increase in performance from 24ms to 5ms moving to the more powerful GPU. But the faster of the two functions actually got a lot slower at around 12ms.

I’m at a bit of a loss as to what could be causing this disparity in performance. I have checked occupancy and that shouldn’t be an issue and I used the same input data on both machines. Does anyone have any tips for what might be causing it? Unfortunately I won’t be able to share the code since it uses confidential IP.


Edit: Breaking down the kernel in question the runtime is disproportionally larger as well as the deallocation of memory on the Arm machine compared to my laptop

I would generally not expect an A4000 to be slower than a T2000 GPU when executing device code, ie. the duration of a kernel itself. Some possibilities:

  • you are not properly measuring things. The best kernel duration measurements come from a profiler. Also, if you are mixing measurement of host code and device code together, that may further cloud the comparison. There are other possible sources of measurement error if you are deviating from good benchmarking practice.
  • you are compiling the code differently in each case, for example running a debug build (-G) on the A4000, but a release build on the T2000. I would also generally suggest to compile explicitly for the architecture in each case, e.g. -arch=sm_86 for the ampere case, likewise for the turing case (-arch=sm_75). However I think this item is unlikely to be a causal factor.
  • the code generation for the Ampere family GPU is unexpectedly worse than the code generation for the Turing family GPU.

There are probably other possibilities as well, such as not doing an apples-to-apples comparison (different work size, etc.) The above description is covering the device-code side only. When it comes to host code, I know of no reasons to conclude that an unidentified Arm processor will always be faster than an unidentified x86 processor. It might well be that the CUDA runtime API takes longer running on the Arm processor than on your particular x86 in your laptop.

Yes, I am sure you probably think that none of this applies to your case. In that case I have no further suggestions, absent any actual example. But you do have inspection tools like the profilers available.

So I have been messing around with this. I figured out that it seems to be that on the very first call to a cuda kernel it loads in the entire cuda binary. When I swap the slow and fast functions the slow function is MUCH slower and fast function is much faster (still not as fast as I would expect. But it’s not suddenly several orders of magnitude slower.) Builds on both machine used the same CMake file and set to use CMake release mode.

I will try with the specific arch parameters. But I was under the impression that if you don’t define one then it will automatically pick the host arch. Feel free to correct me if I am wrong in this.

I guess I was thrown off by the fact that on my Windows machine it didn’t seem to do this

It might be lazy loading in one case and not in the other. When doing careful analysis like this, it’s important to make as much as possible, including CUDA version, the same between the cases. See here. And again, if you were measuring kernel execution time with a profiler, this effect would be irrelevant. So I suspect your timing methods are confusing you.

Ah, I think that’s it. The windows machine is 12.0 and the Linux machine is 12.2. If lazy loading only became default in 12.2 that sounds like that would be the problem

if you were measuring kernel execution time with a profiler,

I agree if I was just measuring kernel execution time then this probably wouldn’t be an issue. Part of my evaluation was to evaluate a full ‘module’ which in my case would be the time it takes to allocate GPU memory, copy all the needed data from the CPU, run the kernel, copy back the data we need that would be passed onto the next kernel.

In reality we didn’t need perfectly accurate to the MS timings. We just needed it to be faster than our CL version on the various different machines we support.

Thanks again for the help 👍

If you are interested in understanding application-level performance, it is important to separate device (GPU) performance from host (CPU) performance. So do look at and compare kernel execution times between systems.

Ideally, in a GPU-accelerated application, the GPU works on the parallel portion of the workload, while the CPU takes care of the serial portion of the workload. As GPU performance historically has grown faster than CPU performance, the serial portion of a workload can become performance limiting to the overall application (see Amdahl’s Law). That is not just a theoretical concern but is observed in practice.

Unfortuntely, I still see people building GPU-accelerated systems where the host system is underpowered, with the two most common problems being low-frequency CPUs and under-sized system memory. Host-device interconnect is another potential issue, but I have not personally come across that as a limiting factor.