[Sorry, slow typer here. I started composing my answer before cbuchner1’s answer was there. Please excuse the redundancy]
If you desire useful feedback, you would want to post your complete code that you used to measure this. Otherwise its impossible to know what you measured and how. Was your experiment a controlled experiment, i.e. other than swapping the GPU no other changes (hardware or software) were made to the system?
I assume that you measured the execution time for the first instance of cudaMalloc() in your code. Since the CUDA runtime uses lazy initialization, what you actually measured in that case is mostly CUDA context initialization time. As part of initialization, CUDA maps all CPU memory and all GPU memory into a single unified address space, using appropriate operating system API calls. The more total memory there is in a system, the longer this mapping process will take.
The OS API calls used during CUDA context initialization represent serial code, and their execution speed will depend primarily on single-thread CPU performance and secondarily on system memory performance.