Large allocations with cudaMallocManaged slow down synchronization

_loki · October 23, 2020, 8:49pm

Hi,

for my work, I need to work with large data sets, and I’m starting to use UMA. I want to use two cards, so I start two streams. Sometimes I need to synchronize the results. I observe that cudaDeviceSynchronize dramatically slows down the speed. The reason seems to be the large amount of memory allocated with cudaMallocManaged. The following code took about 40s on my system (2x2080TI, I9-7900X):

float* x;

__global__ void writeZero2Zero(float* p)
{
    p[0] = 0;
}

int main()
{
    cudaMallocManaged(&x, 1024ull * 1024 * 1024 * 8);
    for (int i = 0; i < 100; i++)
    {
        std::cout << i << "\n";
        dim3 block(1), grid(1);
        writeZero2Zero<<<grid, block>>>(x);
        cudaDeviceSynchronize();
    }
    return 0;
}

The speed of cudaDeviceSynchronize seems to correlate linearly with the amount of allocated memory.

Can anybody explain this behavior to me?

bye,
loki

Robert_Crovella · October 23, 2020, 10:04pm

which operating system? (I’m guessing windows)

_loki · October 23, 2020, 10:28pm

Yes, it’s Windows 10, Cuda 10.2, driver version 456.71, Visual Studio 19.

Robert_Crovella · October 23, 2020, 11:02pm

on windows, a managed memory allocation (if not already migrated to the device) is migrated en-masse (all at once) to the device, at any kernel launch point. After a kernel launch, the entire allocation is migrated en-masse, at the point of the next cudaDeviceSynchronize() call, to host memory. You have an 8GB allocation. Because of the construction of your loop body, at each pass through the loop, the entire 8GB allocation is migrated to device memory, then migrated back to host memory. Over a PCIE gen3 link each one of these 2 operations will take the better part of 1 second. I’m actually a bit surprised it only takes 40 seconds to run the loop, I would have expected over 100 seconds.

If you want to see better performance than that, move the cudaDeviceSynchronize() call to a point just after the closing curly-brace of the for loop body.

_loki · October 24, 2020, 3:23pm

Thank you for the information. I guess UMA is thus no option for me. The simple example above was just for reproduction. What I actually need is something like this:

float a[2];
float x[2^30];

for (i=0;i<2000;++i)
{
    setDevice(0);
    kernelA(a)      // work on a[0];
    setDevice(1);
    kernelA(a+1)    // work on a[1];
    sync();         // kernelB needs entire a, so sync is needed
    setDevice(0);
    kernelB(a,x);
    setDevice(1);
    kernelB(a,x);
}

So moving cudyDeviceSynchronize outside the loop wouldn’t work.

Just out of curiosity:

you say that the entire allocation is migrated to device memory. But I can run the simple example even after allocating 13GiB to x, but the 2080ti only has 11GiB - isn’t that a contradiction?

Regarding the transfer speed: I measured ~11GiB/s, which is ~43s/800GiB. Could it be that for some reason the transfer only happened in one direction?

Bye,
Ingo

Robert_Crovella · October 24, 2020, 4:18pm

In general, documentation for managed memory behavior is in the CUDA C++ programming guide, section L. Not sure how to find the CUDA C++ programming guide? Go to docs.nvidia.com and click on the “CUDA Toolkit Documentation” link. Hopefully you can navigate from there.

In your original question, your allocation amount seemed to be below the memory size of your GPU. Therefore it’s possible the explanation I had given you was approximately correct (although I believe I made a mistake - see below).

Managed memory in windows follows the pre-pascal regime/behavioral description. There are still at least 2 behavioral possibilities in a multi-GPU scenario.

First, if the system topology permits both GPUs to be in a peer relationship. In that case, the data will be migrated to the GPU that needs it, en-masse, at kernel launch. If 2 GPUs need it, one processor will be given a peer mapping to the data on the other processor.

Second, if the system topology does not permit the GPUs to be in a peer relationship, then instead the allocation will be made as if it were a pinned host allocation, and both GPUs will be given a pointer to the host allocation. In this case migration of data is not necessary.

Based on your initial description I assumed we were in the first category. However in the first category, if you try to allocate more than the memory that is available on the GPU, you get an out of memory error from the cudaMallocManaged call - I just tested that on my windows system and that is what I get.

Now that you are reporting that you can successfully allocate more memory than that, I would assume your GPUs cannot be placed into a peer relationship, so that would put your GPUs into the second category. However in that case the whole statement about migration does not hold water. Unfortunately, I don’t have a system exactly like that to test.

I would also like to amend a previous remark that was I believe in error. In the first category, data migration occurs en-masse to the GPU at the point of kernel launch. However, I misspoke about the return case. After a subsequent cudaDeviceSynchronize(), it is true that data can now be accessed again in host memory, but I’m not certain there is an en-masse data copy. I think the data copy in this direction can still be demand-paged. The net effect of this would be that the forced transfers would only be happening in one direction, which would seem to align with your transfer rate calculation (but inconsistent with the other data point).

Anyway I’m afraid I may have just confused matters.

Moving on to your question, I believe you can still use Unified Memory in the case you describe. As we have now discussed, cudaDeviceSynchronize() has this special characteristic in the pre-pascal regime (i.e. windows regime) that it makes the data visible again on the host, and will result in trigger of transfer of data on subsequent kernel calls.

But cudaDeviceSynchronize() is not the only way to enforce what you want. First of all if you only had 1 GPU, stream semantics would suffice. With 2 GPUs, you could try using cudaStreamSynchronize(), and if that doesn’t work I’m fairly confident that using events would.

cudaStreamSynchronize() method would look like this:

setDevice(0);
kernelA(a, stream1)      // work on a[0];
setDevice(1);
kernelA(a+1, stream2)    // work on a[1];
cudaSetDevice(0);
cudaStreamSychronize(stream1);        // kernelB needs entire a, so sync is needed
cudaSetDevice(1);
cudaStreamSynchronize(stream1);
setDevice(0);
kernelB(a,x);
setDevice(1);
kernelB(a,x);

the event method would look like this:

setDevice(0);
kernelA(a)      // work on a[0];
cudaEventRecord(evt0);
setDevice(1);
kernelA(a+1)    // work on a[1];
cudaEventRecord(evt1);
cudaStreamWaitEvent(stream1, evt0)
cudaStreamWaitEvent(stream1, evt1)
setDevice(0);
kernelB(a,x, stream1);
setDevice(1);
kernelB(a,x, stream1);

Just as your code is shorthand, I’m just trying to give shorthand ideas. These are not exact recipes.

_loki · October 24, 2020, 4:32pm

Thank you very much again. Actually I’m working with streams anyway, so I know that cudaStreamSynchronize is as slow as deviceSync. But I’ll try the event method and let you know how it works.

Bye,
Ingo

Robert_Crovella · October 24, 2020, 4:33pm

Going back to your observation, I’ve just run a test case on a system with a RTX2070 on windows 10, CUDA 11.0, and I’m not able to reproduce anything like your observation of 40 seconds for a loop with 100 kernel calls. So I think something else is going on here.

_loki · October 26, 2020, 8:56am

Just to be sure:

#include <time.h> 
#include <iostream>

__global__ void writeZero2Zero(float* p)
{
    p[0] = 0;
}

int main()
{
    float* x;
    cudaMallocManaged(&x, 1024ull * 1024 * 1024 * 8);
    time_t start = time(NULL);
    for (int i = 0; i < 100; i++)
    {
        dim3 block(1), grid(1);
        writeZero2Zero<<<grid, block>>>(x);
        cudaDeviceSynchronize();
    }
    std::cout << difftime(time(NULL), start) << "\n";
    return 0;
}

→ 39

Robert_Crovella · October 26, 2020, 1:54pm

When I run that code (but changing 8 to 4) I get:

0

For me, anyway, it would be interesting if you could run your executable from a command prompt where the environment variable:

CUDA_VISIBLE_DEVICES=0

is set. i.e.

set the environment var
open a new command prompt
run your compiled executable from above

_loki · October 26, 2020, 2:04pm

When setting CUDA_VISIBLE_DEVICES to any value, I get 0 as well.

EDIT:

When setting it to 0,1, it is slow again. I forgot that CUDA_VISIBLE_DEVICES defines a list, not just a number.

Robert_Crovella · October 26, 2020, 2:14pm

So your observation is a function of the dual-GPU behavior in your particular machine, under the windows managed memory system. That’s what I had suspected, however its still not clear to me exactly what is going on. Based on the timing, however, it seems like there is some migration occurring, per loop iteration. To eliminate that, I would seek to avoid use of synchronization in the loop, and instead use stream semantics to guarantee the consistency of data as you pass it from one kernel to the next.

And if you have 2 GPUs, the only sensible choices for CUDA_VISIBLE_DEVICES are either:

0

1

0,1

1,0

You don’t want to set it to “any value”. If you’re not sure what it does, its documented here.