Why cudaMalloc lasts longer on V100 vs M2075 ?


For the same code, I see that cudaMalloc lasts longer on the Volta GPU than the Tesla M2075 GPU.

And the difference is quite huge,

  • V100 :cudaMalloc= 314.52ms
  • M0275 : cudaMalloc = 93.27ms

Can someone explain to me why ?

Thank you in advance,

Tesla M2075 is Fermi architecture.

I think the Kepler architecture first introduced unified memory. There may be more overhead involved in doing the necessary bookkeeping on newer architectures as a result.

Also: how did you benchmark this?

If you just did a cudaMalloc() in a main() function and timed that, you’d probably also time the creation of the CUDA context with it.


[Sorry, slow typer here. I started composing my answer before cbuchner1’s answer was there. Please excuse the redundancy]

If you desire useful feedback, you would want to post your complete code that you used to measure this. Otherwise its impossible to know what you measured and how. Was your experiment a controlled experiment, i.e. other than swapping the GPU no other changes (hardware or software) were made to the system?

I assume that you measured the execution time for the first instance of cudaMalloc() in your code. Since the CUDA runtime uses lazy initialization, what you actually measured in that case is mostly CUDA context initialization time. As part of initialization, CUDA maps all CPU memory and all GPU memory into a single unified address space, using appropriate operating system API calls. The more total memory there is in a system, the longer this mapping process will take.

The OS API calls used during CUDA context initialization represent serial code, and their execution speed will depend primarily on single-thread CPU performance and secondarily on system memory performance.


Thank your for your answers.

For more clarification, I used the “nvprof” to get the cudaMalloc time and I used the transpose kernel from CUDA SAMPLES.

Sure when I use CPU timers I will have the time of intialization included. But values I showed are taken from the profiler.

@njuffa You say that " As part of initialization, CUDA maps all CPU memory and all GPU memory into a single unified address space", this is true if we allocate unified memory no ? because there is no use of unified memory in the code.

“unified virtual address space” is different from “unified memory”. See the CUDA Programming Guide.

The former was introduced with compute capability 2.0 (Fermi architecture) if I recall correctly, the latter was introduced with compute capability 3.0 (Kepler architecture).