cudaMemcpyAsync +cudaDeviceSynchronize lead to lots of gpu page fault

shengyushen · February 13, 2019, 3:07am

Dear all:

I am using a simple code running across 4 V100, which copy a large block data from gpu n to gpu n+1, and then use these data on each gpu.

I find there are lots of gpu page fault in it. can any one help me please?

//copy from gpu n to gpu n+1
for (int gpuid = 0; gpuid < num_gpus; gpuid++)
  checkCudaErrors (cudaSetDevice (gpuid));
if (gpuid > 0)
  checkCudaErrors (cudaMemcpyAsync(d_pool1V[gpuid] +sz / (2 * sizeof (float)),d_pool1V[gpuid - 1], int (fract * sz / 2),cudaMemcpyDefault));
}

//wait for all cpy finish
for (int gpuid = 0; gpuid < num_gpus; gpuid++)
{
checkCudaErrors (cudaSetDevice (gpuid));
checkCudaErrors (cudaDeviceSynchronize ());
}

//accessing the newly arrive data with lots of GPU page fault

shengyushen · February 13, 2019, 4:44am

some complete code like this

142   auto t1 = std::chrono::high_resolution_clock::now ();
143   for (int iter = 0; iter < iterations; ++iter)
144   {
145     //copy
146     for (int gpuid = 0; gpuid < num_gpus; gpuid++)
147     {
148         //sync n+1 to n
149         checkCudaErrors (cudaSetDevice (gpuid));
150         if(gpuid>1) {
151           checkCudaErrors (cudaMemcpyAsync (d_dataV[gpuid] + sz / (2 * sizeof (float)), d_dataV[gpuid - 1], i    nt (fract * sz / 2), cudaMemcpyDefault));
152 //      checkCudaErrors (cudaMemPrefetchAsync
153 //           (d_pool1V[gpuid], sz, gpuid));
154         }
155     }
156
157     //sync
158     for (int gpuid = 0; gpuid < num_gpus; gpuid++)
159     {
160         checkCudaErrors (cudaSetDevice (gpuid));
161         checkCudaErrors (cudaDeviceSynchronize ());
162     }
163
164     //vitis
165     for (int gpuid = 0; gpuid < num_gpus; gpuid++)
166     {
167       checkCudaErrors (cudaSetDevice (gpuid));
168       ssyinitfloat <<< numberOfBlocks, threadsPerBlock >>> (d_dataV[gpuid], sz);
169     }
170
171   }

Robert_Crovella · February 13, 2019, 4:49am

If gpuid is 0, doesn’t this:

d_dataV[gpuid - 1]

generate invalid indexing?

shengyushen · February 13, 2019, 5:47am

No, I already guard it with if(gpuid>1) to prevent this problem

Robert_Crovella · February 13, 2019, 2:34pm

are the allocations like d_dataV[…] created with cudaMallocManaged ?

shengyushen · February 14, 2019, 4:38am

Yes they are allmocated by cudaMallcManaged after calling cudaSetDevice on every GPU.
and I have already resolved this problem by setting cudaMemAdvicePreferedLocation.to eliminate all those GPU page fault.

But one thing that I don’t understand is, why a simple cudaMemCpy with default advise will generate so many page fault?

Robert_Crovella · February 14, 2019, 2:48pm

cudaMemcpy isn’t really the correct api to use with managed allocations in a demand paging environment

The location of a managed allocation can vary (the runtime migrates it from one processor to another, on-demand). cudaMemcpy simply moves data. it doesn’t necessarily effect the runtime’s opinion of where data should migrated to.

Use migration apis such as cudaMemPrefetchAsync to migrate the intended location of a managed allocation, and use the memory hints API to help the runtime to make these decisions.

To simplify:

A managed allocation can be resident either on the host or on the device. cudaMemcpy doesn’t affect this.

shengyushen · February 16, 2019, 3:16am

my application need partition its working set evenly onto 4GPUs, any most the time, these GPU access their own data.

But after each iteration, they need to copy some data from neighbour, and place these newly arrived data at the boundary of itw own data, to form a contingous buffer, and put it into cudnn to run for the next iteration.

So what should I do?

Robert_Crovella · February 16, 2019, 3:25am

You can use cudaMemcpy, or even memcpy, to copy data from one managed allocation to another. But that doesn’t necessarily affect the preferred location of any of the allocations.

If you witness page faults, its because the data being touched is not currently migrated to that processor. You’ll need to fix that.

You might want to take a look at some of the information I indicated already such as the memory hints API

shengyushen · February 19, 2019, 2:15pm

Thank you, just as you mentioned, I remove all these page fault by setting preferred location hint for these managed memory.

My concern is: cudaMemcpyAsync can really lead to page fault? it sementic is very different from direct load store. Direct load store means what I want is the original data, so for load store, it make sense to migrate the data to new GPU.
But memcpyasync means I just want the content to be send to another GPU, I done want the original data. So why does it raise page fault?

Robert_Crovella · February 19, 2019, 2:30pm

No, it doesn’t. Not in a UM system.

cudaMemcpyAsync copies data from one allocation to another allocation

In traditional CUDA, copying data to an allocation also implied the device, because an allocation is associated with a device.

In unified memory (UM), a UM allocation does not necessarily imply a particular device. The location of the allocation is movable, and is controlled by the UM system. Therefore, cudaMemcpyAsync doesn’t explicitly indicate anything about devices.

Regardless of the above statements, cudaMemcpyAsync itself doesn’t cause page faults. In demand-paging UM, page faults arise when either host (CPU) or device (GPU) touches an allocation that is not currently migrated to that processor.

Topic		Replies	Views
cudaMemcpyAsync problem CUDA Programming and Performance	9	2993	May 26, 2020
cudaMemcpyAsync cpu Load? CUDA Programming and Performance cuda	2	553	April 24, 2023
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	460	March 30, 2024
cudaMemcpyAsync code problem CUDA Programming and Performance	3	4540	September 16, 2008
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16078	September 29, 2017
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	505	October 19, 2023
cudaMallocManaged with cudaMemAttachHost Jetson AGX Orin cuda	2	521	October 13, 2022
cudaMemcpy behaving asynchronously with drivers 471.11+ CUDA Programming and Performance cuda , nvbugs	7	1395	July 21, 2021
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1582	October 26, 2020
Is concurrent cudaMemcpyAsync possible? CUDA Programming and Performance	2	623	April 5, 2019

cudaMemcpyAsync +cudaDeviceSynchronize lead to lots of gpu page fault

Related topics