Data copies of the same size vary greatly in different program times

I used OpenACC to accelerate my program, but I found that copying data from the GPU to the CPU wasted a lot of time. I tried to solve this problem but failed.
Firstly, I tried using # pragma acc update host and cudamemcpy to modify my functions, and I tested them in a program,

nvc++ -o copy -acc -gpu=cc87 -cuda -cudalib copy.

CUDA to CPU t_12=119156 microsecond
Openacc to CPU t_23=181720 microsecond

Accelerator Kernel Timing data
/home/orin/Downloads/copy.c
  main  NVIDIA  devicenum=0
    time(us): 210,041
    42: data region reached 1 time
        42: data copyin transfers: 10
             device time(us): total=28,704 max=3,435 min=1,588 avg=2,870
    53: update directive reached 1 time
        53: data copyout transfers: 10
             device time(us): total=181,337 max=24,605 min=13,725 avg=18,133



#include<stdio.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include "device_launch_parameters.h"
#include "cublas_v2.h"



#if defined(_WIN32) || defined(_WIN64)
#include <sys/timeb.h>
#define gettime(a) _ftime(a)
#define usec(t1,t2) ((((t2).time-(t1).time)*1000+((t2).millitm-(t1).millitm))*100)
typedef struct _timeb timestruct;
#else
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#endif
      
int main()
{
int M_Array=200*200,Fs=2000;
     float* a = (float*)malloc(M_Array * Fs * sizeof(float));
     float* b = (float*)malloc(M_Array * Fs * sizeof(float));
     float* c = (float*)malloc(M_Array * Fs * sizeof(float));
      long long t_12,t_23;
     timestruct t1, t2,t3; 
     
     for(int i=0;i<M_Array * Fs;i++)
     a[i]=0.1*i;


    cublasHandle_t handle;
    cublasCreate(&handle);
     #pragma acc enter data copyin(a[0:M_Array*Fs])  create(b[0:M_Array*Fs])
 

      #pragma acc host_data use_device(a,b)
      cublasScopy(handle,M_Array*Fs,a,1,b,1);    
    
       gettime(&t1);
       
      #pragma acc host_data use_device(a)
      cudaMemcpy(c,a,M_Array* Fs*sizeof(float) ,cudaMemcpyDeviceToHost);

  	gettime(&t2); 

 	#pragma acc update host(b[0:M_Array*Fs])
	
  	gettime(&t3);

       t_12=usec(t1,t2); 
       t_23=usec(t2,t3);       
       printf("CUDA to CPU t_12=%lld microsecond\n",t_12);
       printf("Openacc to CPU t_23=%lld microsecond\n",t_23);   

return 0;
}

There was a noticeable improvement in the testing program, but after I tried to replace update with cudamemcpy in the large program, the program’s time did not change and it took 0.5 seconds to copy the same amount of data to the CPU

I tried locking page memory again and performed well in the test program, but it still cannot be done in the main program

nvc++ -o test -acc -gpu=cc87 -cuda -cudalib -Minfo=accel  test.c

update to CPU t_12=92836 microsecond cc[1]=1.000000
cudaMemcpy to CPU t_23=11097 microsecond bb[1]=1.000000

Accelerator Kernel Timing data
  main  NVIDIA  devicenum=0
    time(us): 103,862
    59: compute region reached 1 time
        64: kernel launched 1 time
            grid: [65535]  block: [128]
            elapsed time(us): total=7,632 max=7,632 min=7,632 avg=7,632
    59: data region reached 3 times
        59: data copyin transfers: 10
             device time(us): total=11,345 max=1,211 min=708 avg=1,134
    69: update directive reached 1 time
        69: data copyout transfers: 10
             device time(us): total=92,517 max=16,835 min=7,517 avg=9,251



#include <iostream>
#include <condition_variable>
#include <cstring>
#include <cstdlib>

#include<openacc.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include "device_launch_parameters.h"
#include "cublas_v2.h"


#if defined(_WIN32) || defined(_WIN64)
#include <sys/timeb.h>
#define gettime(a) _ftime(a)
#define usec(t1,t2) ((((t2).time-(t1).time)*1000+((t2).millitm-(t1).millitm))*100)
typedef struct _timeb timestruct;
#else
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#endif


int main()
{
    int NN=200*2000*200;

    float *aa;
    cudaHostAlloc((void **)&aa, NN * sizeof(*aa), cudaHostAllocDefault);

    float*bb=(float*)malloc(NN*sizeof(float)); 
    float*cc=(float*)malloc(NN*sizeof(float)); 
    
    for(int i=0;i<NN;i++)
    bb[i]=i+1/2;

    long long t_12,t_23;
   timestruct t1, t2,t3;    
    #pragma acc enter data copyin(bb[0:NN]) create(cc[0:NN])
    
    #pragma acc kernels
    #pragma acc loop independent
    for(int i=0;i<NN;i++)
    cc[i]=bb[i];
    
    gettime(&t1);
    #pragma acc update host(cc[0:NN])
    gettime(&t2);

    #pragma acc host_data use_device(bb)
    cudaMemcpy(aa,bb,NN*sizeof(float) ,cudaMemcpyDeviceToHost);
    gettime(&t3);
      
    t_12=usec(t1,t2);
    t_23=usec(t2,t3);
    printf("update to CPU t_12=%lld microsecond cc[1]=%f\n",t_12,cc[1]);
    printf("cudaMemcpy to CPU t_23=%lld microsecond bb[1]=%f\n",t_23,bb[1]);      
       
    return 0;
}

I think the main program may have taken up too much memory. When the main program starts running, it needs to take up 10GB of memory space. I opened up many variables on the GPU through OpenACC, and in the end, only a float array of size 200 * 200 * 2000 was returned. I think this size of array should not take so long

I tried to simulate my large program, but it seems impossible to simulate that effect. In the large program, I opened up multiple threads, which is only the content of one of them. The thread program I simulated is as follows

OpenACC to CPU t_12=95194 microsecond  F4[2]=1.000000
CUDA to CPU t_23=63221 microsecond   F5[2]=1.000000
OpenACC to CPU t_12=60883 microsecond  F4[2]=1.000000
CUDA to CPU t_23=48682 microsecond   F5[2]=1.000000
OpenACC to CPU t_12=61164 microsecond  F4[2]=1.000000
CUDA to CPU t_23=48196 microsecond   F5[2]=1.000000
OpenACC to CPU t_12=60692 microsecond  F4[2]=1.000000
CUDA to CPU t_23=46595 microsecond   F5[2]=1.000000
OpenACC to CPU t_12=63451 microsecond  F4[2]=1.000000
CUDA to CPU t_23=47711 microsecond   F5[2]=1.000000



#include<stdio.h>
#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <vector>
#include <cstring>
#include <cstdlib>
#include <arpa/inet.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>


#include<openacc.h>
#include <cuda_runtime.h>//
#include <cufft.h>
#include "device_launch_parameters.h"
#include "cublas_v2.h"

#if defined(_WIN32) || defined(_WIN64)
#include <sys/timeb.h>
#define gettime(a) _ftime(a)
#define usec(t1,t2) ((((t2).time-(t1).time)*1000+((t2).millitm-(t1).millitm))*100)
typedef struct _timeb timestruct;
#else
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#endif
      

int main()

{
	int M=200,N=200,NN=2000,MM=30000;

    float* A1 = (float*)malloc(M*MM * sizeof(float));
    float* A2 = (float*)malloc(M* MM * sizeof(float));
    double* A3 = (double*)malloc(M * MM * sizeof(double));

    float* F3 = (float*)malloc(NN*M*N * sizeof(float));
 
    cuDoubleComplex* B1 = (cuDoubleComplex*)malloc(MM * M * sizeof(cuDoubleComplex));
    cuDoubleComplex* B2 = (cuDoubleComplex*)malloc(NN * M * sizeof(cuDoubleComplex)); 
    cuDoubleComplex* B3 = (cuDoubleComplex*)malloc(NN * MM * sizeof(cuDoubleComplex));
    float* B4 = (float*)malloc(MM * M * sizeof(float));
    double* B5 = (double*)malloc(MM * M * sizeof(double));

    cuDoubleComplex** B6 = (cuDoubleComplex**)malloc(N * sizeof(cuDoubleComplex*));
    cuDoubleComplex** B7 = (cuDoubleComplex**)malloc(N * sizeof(cuDoubleComplex*));
    cuDoubleComplex** B8 = (cuDoubleComplex**)malloc(N * sizeof(cuDoubleComplex*));
    float* C1 = (float*)malloc(MM * M * sizeof(float));
    float* C2= (float*)malloc(MM * sizeof(float));
    float* C3= (float*)malloc(MM * M * sizeof(float));

    double* D1 = (double*)malloc(MM * M * sizeof(double));
    double* D2 = (double*)malloc(MM * M * sizeof(double));
    double* D3 = (double*)malloc(MM * M * sizeof(double));
    double* D4 = (double*)malloc(MM * M * sizeof(double));
    

    cuDoubleComplex* E1 = (cuDoubleComplex*)malloc(MM * M * sizeof(cuDoubleComplex));
    cuDoubleComplex* E2 = (cuDoubleComplex*)malloc(MM * M * sizeof(cuDoubleComplex));
    cuDoubleComplex* E3 = (cuDoubleComplex*)malloc(MM * M * sizeof(cuDoubleComplex));
    cuDoubleComplex* E4 = (cuDoubleComplex*)malloc(MM * M * sizeof(cuDoubleComplex));
    cuDoubleComplex* E5=(cuDoubleComplex*)malloc(MM * M * sizeof(cuDoubleComplex));

    cuDoubleComplex* E6 = (cuDoubleComplex*)malloc(NN * M * sizeof(cuDoubleComplex));

    cuDoubleComplex* E7 = (cuDoubleComplex*)malloc(NN*M*N * sizeof(cuDoubleComplex));
    cuDoubleComplex* E8 = (cuDoubleComplex*)malloc(NN*M*N * sizeof(cuDoubleComplex));
    cuDoubleComplex* F1 = (cuDoubleComplex*)malloc(NN*M*N * sizeof(cuDoubleComplex));
    cuDoubleComplex* F2 = (cuDoubleComplex*)malloc(NN*M*N * sizeof(cuDoubleComplex));

    float* F4 = (float*)malloc(NN*M*N * sizeof(float));
    float* F5 = (float*)malloc(NN*M*N * sizeof(float));    
    
    for(int i=0;i<NN*M*N;i++)
    F3[i]=i/2;
 
 
       long long t_12,t_23;
     timestruct t1, t2,t3; 
        
    #pragma acc enter data copyin(A1[0:M*MM],A2[0:M*MM],A3[0:M*MM])
    #pragma acc enter data create(B1[0:MM*M],B2[NN*M],B3[0:NN*MM],B4[0:MM*M],B5[0:MM*M])
    #pragma acc enter data create(B6[0:N],B7[0:N],B8[0:N])
    #pragma acc enter data create(C1[0:MM*M],C2[0:MM],C3[MM*M])
    #pragma acc enter data create(D1[0:MM * M],D2[0:MM * M],D3[0:MM * M],D4[0:MM * M])
    
    #pragma acc enter data create(E1[0:MM*M],E2[0:MM*M],E3[0:MM*M],E4[0:MM*M],E5[0:MM*M],E6[0:NN*M],E7[0:NN*M*N],E8[0:NN*M*N])
    #pragma acc enter data copyin(F1[0:NN*M*N],F2[0:NN*M*N],F3[0:NN*M*N])    
    #pragma acc enter data create(F4[0:NN*M*N])
int k=0;
 while(k<5)
    {
    #pragma acc kernels
    #pragma acc loop independent
    for(int i=0;i<NN;i++)
    F4[i]=F3[i];
 
       gettime(&t1);    
    #pragma acc update host(F4[0:NN*M*N])
        gettime(&t2);   
    #pragma acc host_data use_device(F3)
    cudaMemcpy(F5,F3,NN*M*N*sizeof(float) ,cudaMemcpyDeviceToHost);
    
        gettime(&t3);   
       t_12=usec(t1,t2); 
       t_23=usec(t2,t3);       
       printf("OpenACC to CPU t_12=%lld microsecond  F4[2]=%f\n",t_12,F4[2]);
       printf("CUDA to CPU t_23=%lld microsecond   F5[2]=%f\n",t_23,F5[2]); 
       k++;
  }     
       return 0;    
}

In fact, each parameter participates in the calculation, and in the end, a result F3 is obtained. I transfer it to the CPU to complete a certain calculation. But the simulation effect is much better than my actual operation, and I don’t know why it’s like this

Is this the reason why my large program uses OpenACC+CUDA mixed compilation?

bantch_modify:bantch_modify.o  cudaCode.o 
	nvc++ -acc -gpu=cc87 -cuda -cudalib -Minfo=accel  -o bantch_modify   bantch_modify.o cudaCode.o
bantch_modify.o:bantch_modify.c
	nvc++ -acc -gpu=cc87 -cudalib -c bantch_modify.c
cudaCode.o:cudaCode.cu
	nvcc -c cudaCode.cu

Actually, I ran the program for over 0.5 seconds

In fact, each parameter participates in the calculation, and in the end, a result F3 is obtained. I transfer it to the CPU to complete a certain calculation. But the simulation effect is much better than my actual operation, and I don’t know why it’s like this

I’m not clear on what you mean, but given your examples, it sounds like you’re basically asking about performance of data transfers between the host and device.

Data movement is often the most expensive part of GPU programing. The general advice is to try to offload all compute so I’d recommend seeing if the “certain calculation” can be put on the GPU. Even if it has to be run serially, it may be less expensive than having to copy the data back and forth between the host and device.

Of course, I don’t have you’re full code, so don’t know for sure, but something for you to consider.

Now page-locked memory (aka “pinned” memory) can help in some cases but while it helps the data transfer time, the extra overhead it takes the host OS to allocate the pinned memory often far outweighs any gains in transfer time. It generally only helps when there are few allocations but many data transfers.

For OpenACC, you can enable the use of page-locked memory via the flag “-gpu=pinned”. The compiler will then replace the malloc calls with calls to cuMemHostAlloc.

DMA transfers do need to be done from pinned memory so the OS doesn’t swap out the memory while it’s being transferred. When not using “-gpu=pinned”, the OpenACC runtime uses a double buffering system. It will perform a virtual to pinned memory copy to the one of the buffers, launch the transfer asynchronously, and then fill the second buffer while the first buffer is transferring. This can effectively hide much of the virtual to pinned memory copy time.

Since you’re dealing with large arrays, you might be able to see some improvement by adjusting the buffer sizes via the environment variable NV_ACC_BUFFERSIZE. The caveat being that bigger buffers means bigger overhead cost to allocate them.