cuda 10 upgrade for K2200 from Cuda 8 reduging performance drastically

I have a GPU code compiled in Cuda 8 which takes around 420ms in K2200. But the same code when I run with the same hardware after upgrade to Cuda 10, the execution time is drastically increased to 550ms.

After I ran the NVIDIA profiler, I could see, cudaMalloc and some other CUDA APIs like: cudaStreamCreate, cudaStreamSynchronize, cudaMemcpyAsync, cudaLaunch were taking more time, though my Kernels showed some improvement.

Any ideas…?

It is unclear what exactly you measured and how. Did you account for cold-start effects and measurement noise? E.g. by running multiple times in a loop and recording the best out of n runs? Did you make sure the machine was idling before you started to run your application?

Your description suggests that the slowdown is on the host side. How far apart were the performance measurements taken? Did you take one set of measurements just before you installed CUDA 10, and the next one just after it?

Host performance can be influenced by many factors. You state that host hardware did not change. But host performance could also change due to environmental factors, for example because of dynamic CPU clocks.

It is certainly possible that additional functionality in CUDA has increased the amount of host-side work necessary in driver and run-time. I don’t use CUDA 10, but have not seen any reports suggesting that this would be the case. I would assume that if fairly significant slowdown had occurred someone else would have reported it (CUDA 10 has been out for a while).

What is the host platform?

Yes I accounted for the cold-start. The data I am sharing, is average of five executions excluding the first one.
How do I check Idle state?
I had previously running cuda 8. So, yes these executions are right before and after installing cuda 10, respectively.
But, Dynamic CPU clocks would have affected my previous setup also…
cudaMalloc-> 82.118ms vs 26.075ms
cudaFree-> 50.893ms vs 17.503ms
cudaLaunch-> 14.372ms vs 13.115ms
My Kernels have improved: 298.2856 vs 300.1971

OK one difference: mvidia-smi on my setups show:
cuda 10: K2200 36C
cuda 8: K2200 39C

Detected 1 CUDA Capable device(s)

Device 0: “Quadro K2200”
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 4043 MBytes (4239785984 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1124 MHz (1.12 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1


Device 0: “Quadro K2200”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 4041 MBytes (4237557760 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1124 MHz (1.12 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro K2200

CUDA 8 --------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel® Xeon® CPU E5-1620 v3 @ 3.50GHz
Stepping: 2
CPU MHz: 1199.980
BogoMIPS: 6984.08
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 10240K
NUMA node0 CPU(s): 0-7

CUDA 10--------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel® Xeon® CPU E5-1650 v3 @ 3.50GHz
Stepping: 2
CPU MHz: 1199.980
BogoMIPS: 6983.91
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11

You said it is the same hardware. But these are two different systems with two different CPUs. And by your own statements and measurement, the difference is on the runtime API side, not the device code side.

In my view it is certainly possible that CPU behavioral differences could give rise to this difference you are trying to track down.

Amdahl’s Law tells us that the acceleration achievable for a partially parallelized task is limited by the serial portion.

To first order, the host-side work of any CUDA API function is such serial work, which therefore benefits from high single-thread CPU performance. Your two different setups are both a good starting point in that regard, as the base frequency for both is 3.5 GHz, but it seems one is faster in practice.

Detailed analysis needs to use controlled experiments, where exactly one variable is changed for any given experiment. Here you changed at least two: the CUDA version and the CPU.