Performance Issue using cudaMemPrefetchAsync


I’m testing a task based runtime which uses Unified Memory to run CUDA kernels. The problem I have is that when adding prefetch operations, the performance degrades a lot.

The problem only appears when I need to run a lot of kernels (hundreds/thousands), when only a few kernels are involved there is no issue. I have tested it on a microbenchmark, which divided an array of data into a number of blocks and runs a kernel for each block. With many small blocks, the performance is horrible.

#define DEVICE 0

__global__ void kernel(long int n, float* x, float *y)
   long int i = blockIdx.x * blockDim.x + threadIdx.x;
   if(i < n) x[i] = x[i] + y[i];

void init_data(long int N, float *x, float *y){
        for(long int i = 0; i < N; i++){
                x[i] = i;
                y[i] = i+2;


void call_kernels(long int N, long int BS, float *x, float *y){
        long int blockSize = 128;
        long int gridSize = ceil( BS / blockSize );

        cudaStream_t stream;

        cudaEvent_t event;

        for (long int i=0; i<N; i+=BS ){
                cudaMemPrefetchAsync(x, BS*sizeof(float), DEVICE, stream);
                cudaMemPrefetchAsync(y, BS*sizeof(float), DEVICE, stream);
                kernel<<<gridSize, blockSize>>>(BS, &x[i], &y[i]);
                cudaMemPrefetchAsync(x, BS*sizeof(float), cudaCpuDeviceId, stream);
                cudaMemPrefetchAsync(y, BS*sizeof(float), cudaCpuDeviceId, stream);



void check_results(long int N, float *x){
        for(long int i = 0; i < N; ++i){
                if(x[i] != i+i+2){
                        printf("Error when checking results in position %ld, is %f, should be %ld\n", i, x[i], i+i+2);

        printf("Results are correct\n");


int main(){
        long unsigned int size = ceil((long int) 2 * 1024 * 1024 * 1024 / sizeof(float)); //TWO GB of floats

        float *x, *y;


        cudaMallocManaged((void **) &x, size*sizeof(float), cudaMemAttachGlobal);
        cudaMallocManaged((void **) &y, size*sizeof(float), cudaMemAttachGlobal);

        long int BS = size / 8192;
        init_data(size, x, y);
        call_kernels(size, BS, x, y);
        check_results(size, x);

        return NULL;

When removing the prefetch calls the program runs well, however, with the prefetch calls the performance degrades for many blocks. Another version of the same program which does not use unified memory and instead uses cudaMemcpy where the cudaMemPrefetchAsyncs are located also runs well.

Is this due to overhead on the prefetch operation or am I doing something wrong?

Also, I am running on a Pascal Titan X.

Thank you!

You’re not prefetching the data actually used by the kernel.

Shouldn’t this:

cudaMemPrefetchAsync(x, BS*sizeof(float), DEVICE, stream);

be this:

cudaMemPrefetchAsync(&(x[i]), BS*sizeof(float), DEVICE, stream);

and likewise for y?

I hadn’t seen that (copy & paste :S), thank you. I also realized I was launching the kernel in the default stream, while the intention was to launch it on the same stream as the prefetch calls.

Fixing that however, makes the performance issue worst, now there is even more difference in performance between the version with prefetches and the one with nothing (the prefetching being the worst version).

I am not sure if this is an error on the way I am using prefetches, the GPU (unfortunately I dont have access to any p100 to compare) or it is the prefetch mechanism that is not designed for many small calls.

I have just measured the time taken between each iteration. I have seen that the first iterations take a very small time, but after 20 or so, the CPU seems to need about a second to process each iteration even though the calls are asynchronous.

I think it may have something to do with the prefetch call usage of the CPU, however, I am not sure how to measure this. Any ideas?

Thank you.

You may be running out of space in the async queue. I wouldn’t normally expect this to happen after 20 calls, but I haven’t studied this case.

If you launch enough kernels back-to-back, for example, eventually the kernel launches stop becoming asynchronous. The CPU thread blocks in that case, waiting for a queue slot to open up. this effectively causes the GPU thread to wait for the duration of a kernel execution at each iteration, in my example.

Thank you this might be the issue I am having. I understand that it may be possible that the prefetch calls are not being called asynchronously? or does this only happen for kernels?

Any async call can covert to a synchronous call if you fill up a corresponding queue. The nature and depth of these queues are unpublished AFAIK, but can be can be observed thru microbenchmarking.

Then I think I may have found the issue, thank you very much!