cudaMalloc() vs cudaMallocManaged() wrt to cudaMemcpy()


CUDA guide states that in new NVIDIA GPU, kernel can run in parallel to DMA operations. I am testing this statement by running an experiment on GTX 1080.

This is my setup. There are two different application, A and B, both using CUDA. I am measuring the time taken by application A with and without application B running in the background. Application B just do cudaMemcpyAsync() to and from GPU (No kernel launches). Application A just does kernel launches (No cudaMemcpy()/cudaMemcpyAsync()). If the statement in CUDA guide is correct, the time taken by application A should be independent of whether application B is running or not in parallel.

When application B uses cudaMalloc() for allocating GPU memory, the time taken by application A with and without application B running in the background is same.

But when application B uses cudaMallocManaged(), the runtime of application A significantly increases as compared to when only application A is running.

I am making sure application A and B run on separate CPU cores. Also, each application has their own CUDA streams so I am not using cudaDeviceSynchronize().

Can someone explain to me why this is happening? You should be able to recreate this experiment on your end also.

It’s not typical to use cudaMemcpyAsync with an allocation created by cudaMallocManaged, so I don’t really understand your test.

But I can use cudaMemcpyAync() with cudaMallocManaged(), correct? It is giving me the functionally correct output (I use cudaMemcpy() sometime to reduce the page faults overhead, though I know I can use cudaMemPrefetch() for that. But I think it is a separate discussion as which to use when).

Anyways, though it might not be typical to use cudaMallocManaged() with cudaMemcpy(), but still as I believe it is a valid scenario, I can’t make sense of the behavior that I am seeing. Also, I believe you would see the same behaviour if I used cudaMemprefetch() instead of cudaMemcpyAync() because both are doing the same thing, doing a DMA operation underneath (though semantics are different)

cudaMemPrefetchAsync is the right tool to use if demand-paging is in effect. If demand-paging is not in effect, there’s no reason I can think of to use any of these with managed allocations, if your goal is movement of data between host and device.

Anyway, you seem to think that the right thing to do to others who may be trying to help you is to suggest that they spend a lot of time creating a test case which may or may not match what you are doing, when you already have the test case and it may be simply a matter of copy and paste for you to provide it.

You’re welcome to do what you want, of course, but if I were looking for help, I would try to make it as easy as possible for others to help me.

PFA the tarball. I had to write the test from scratch because my original test was integated with another project (So this code might seem unclean but it shows the difference in timing). But yes, it makes sense that it will be easier for me to send it to you.

Way to recreate test:

  1. Build (make)
  2. Run ./A alone. It takes around 100us on GTX 1080. Kill it.
  3. Run ./B_unmanaged. Now in another shell run ./A. A takes about 130us. That is okay I suppose. Almost same. Kill both processes.
    $) Run ./B_managed. Now in another shell run ./A. A now takes about 250-350 us. Almost 3x increase in time. Kill both processes.

test.tar (20 KB)

Based on what I see the kernel execution times don’t change (make your A app run for just a few iterations, run your B app, then profile a run of your A app.)

I think what you are witnessing is variability in the context switching times between the two processes.

In my case, I see a mix of execution times for the managed case, where sometimes the A measurement is the same as the unmanaged case, and sometimes it is ~3x higher. But the underlying kernel execution times are not changing.

Anyway, it’s not recommended to run multiple processes and context-switch between them if you want maximum performance/throughput. Instead, launch your work from a single process, or barring that, use CUDA MPS.

How about using ‘taskset’ command to give both processes (A and B) different cores? That is what I am doing to avoid them sharing cores. I am also using ‘schedtool’ command to give them real time priority.

So the command I am running is (my machine has 8 CPU threads):
sudo taskset -c 4-7 schedtool -R -p 99 -e ./A # Run A on 4-7 threads with 99 real time priority
sudo taskset -c 0-3 schedtool -R -p 1 -e ./B_managed # Run B on 0-3 threads with 1 real time priority

Can you try this on your end? Thanks.

Also, what do you mean “the underlying kernel execution times are not changing”? Do you mean the time as shown by nvprof?

Also, in actual scenario I am using MPS but I want some sort of isolation between two application (similar to having isolation between A and B).

What I am guessing is that there is some sort of locking between cudaMemcpy() and kernel launch (inside CUDA library), but only in case of cudaMallocManaged(). Is this true? Any way to bypass this?

Thanks for the quick responses though.

also, you haven’t done a proper cudaStreamCreate in your posted codes. Although I don’t think that is contributing anything anomalous here.

I think that when you define a global stream variable, it has a default constructor that gets called initializing it.

That’s not correct.