cudaMemcpyAsync +cudaDeviceSynchronize lead to lots of gpu page fault

Dear all:

I am using a simple code running across 4 V100, which copy a large block data from gpu n to gpu n+1, and then use these data on each gpu.

I find there are lots of gpu page fault in it. can any one help me please?

//copy from gpu n to gpu n+1
for (int gpuid = 0; gpuid < num_gpus; gpuid++)
  checkCudaErrors (cudaSetDevice (gpuid));
if (gpuid > 0)
  checkCudaErrors (cudaMemcpyAsync(d_pool1V[gpuid] +sz / (2 * sizeof (float)),d_pool1V[gpuid - 1], int (fract * sz / 2),cudaMemcpyDefault));
}

//wait for all cpy finish
for (int gpuid = 0; gpuid < num_gpus; gpuid++)
{
checkCudaErrors (cudaSetDevice (gpuid));
checkCudaErrors (cudaDeviceSynchronize ());
}

//accessing the newly arrive data with lots of GPU page fault

some complete code like this

142   auto t1 = std::chrono::high_resolution_clock::now ();
143   for (int iter = 0; iter < iterations; ++iter)
144   {
145     //copy
146     for (int gpuid = 0; gpuid < num_gpus; gpuid++)
147     {
148         //sync n+1 to n
149         checkCudaErrors (cudaSetDevice (gpuid));
150         if(gpuid>1) {
151           checkCudaErrors (cudaMemcpyAsync (d_dataV[gpuid] + sz / (2 * sizeof (float)), d_dataV[gpuid - 1], i    nt (fract * sz / 2), cudaMemcpyDefault));
152 //      checkCudaErrors (cudaMemPrefetchAsync
153 //           (d_pool1V[gpuid], sz, gpuid));
154         }
155     }
156
157     //sync
158     for (int gpuid = 0; gpuid < num_gpus; gpuid++)
159     {
160         checkCudaErrors (cudaSetDevice (gpuid));
161         checkCudaErrors (cudaDeviceSynchronize ());
162     }
163
164     //vitis
165     for (int gpuid = 0; gpuid < num_gpus; gpuid++)
166     {
167       checkCudaErrors (cudaSetDevice (gpuid));
168       ssyinitfloat <<< numberOfBlocks, threadsPerBlock >>> (d_dataV[gpuid], sz);
169     }
170
171   }

If gpuid is 0, doesn’t this:

d_dataV[gpuid - 1]

generate invalid indexing?

No, I already guard it with if(gpuid>1) to prevent this problem

are the allocations like d_dataV[…] created with cudaMallocManaged ?

Yes they are allmocated by cudaMallcManaged after calling cudaSetDevice on every GPU.
and I have already resolved this problem by setting cudaMemAdvicePreferedLocation.to eliminate all those GPU page fault.

But one thing that I don’t understand is, why a simple cudaMemCpy with default advise will generate so many page fault?

cudaMemcpy isn’t really the correct api to use with managed allocations in a demand paging environment

The location of a managed allocation can vary (the runtime migrates it from one processor to another, on-demand). cudaMemcpy simply moves data. it doesn’t necessarily effect the runtime’s opinion of where data should migrated to.

Use migration apis such as cudaMemPrefetchAsync to migrate the intended location of a managed allocation, and use the memory hints API to help the runtime to make these decisions.

To simplify:

A managed allocation can be resident either on the host or on the device. cudaMemcpy doesn’t affect this.

my application need partition its working set evenly onto 4GPUs, any most the time, these GPU access their own data.

But after each iteration, they need to copy some data from neighbour, and place these newly arrived data at the boundary of itw own data, to form a contingous buffer, and put it into cudnn to run for the next iteration.

So what should I do?

You can use cudaMemcpy, or even memcpy, to copy data from one managed allocation to another. But that doesn’t necessarily affect the preferred location of any of the allocations.

If you witness page faults, its because the data being touched is not currently migrated to that processor. You’ll need to fix that.

You might want to take a look at some of the information I indicated already such as the memory hints API

Thank you, just as you mentioned, I remove all these page fault by setting preferred location hint for these managed memory.

My concern is: cudaMemcpyAsync can really lead to page fault? it sementic is very different from direct load store. Direct load store means what I want is the original data, so for load store, it make sense to migrate the data to new GPU.
But memcpyasync means I just want the content to be send to another GPU, I done want the original data. So why does it raise page fault?

No, it doesn’t. Not in a UM system.

cudaMemcpyAsync copies data from one allocation to another allocation

In traditional CUDA, copying data to an allocation also implied the device, because an allocation is associated with a device.

In unified memory (UM), a UM allocation does not necessarily imply a particular device. The location of the allocation is movable, and is controlled by the UM system. Therefore, cudaMemcpyAsync doesn’t explicitly indicate anything about devices.

Regardless of the above statements, cudaMemcpyAsync itself doesn’t cause page faults. In demand-paging UM, page faults arise when either host (CPU) or device (GPU) touches an allocation that is not currently migrated to that processor.