Slow 2D MEMCPY

Hello,
I need copy submatrix of matrix from host to device but problem is matrix is copying in small parts like 12-16 kB and slow.

I allocate matrix like 2D-array in this way and use ta=tesla:pinned

  double** matrix = (double**) malloc(x*sizeof(double*));
  for (i = 0; i < x; i++)
        matrix[i] = (double*) malloc(y*sizeof(double));

and copyin

#pragma acc enter data copyin (matrix[0:x/2][0:x/2])

When i tried same code with vectors data transfer was fast and whole matrix was moved in one block not in small parts like matrix.
Is there any way how copy submatrices in OpenACC faster or something like cudaMemCpy2D?
Thanks.

Is there any way how copy submatrices in OpenACC faster or something like cudaMemCpy2D?

The problem here is that the compiler can only copy data to the device in contiguous chunks (due to system DMA transfers) so must loop through the column and then transfer each row one at a time. I believe cudaMemCpy2D works the same way so may not help.

For performance, it’s better to linearize (i.e. make a vector) arrays and then copy the entire array so a single DMA transfer can be used.

-Mat

Thanks Mat. Can I ask one more question?
I want overlap kernels execution with data transfer. First i used cublasDgemm and it works pretty well as you can see on img and code:

   #pragma acc enter  data copyin () create () async(1)
    {
    cublasSetKernelStream(stream1);
    #pragma acc host_data use_device()
        cublasDgemm();
   #pragma acc exit data  copyout()  async(1)
    } 
    
   #pragma acc enter  data copyin () create () async(2)
    {
    cublasSetKernelStream(stream2);
    #pragma acc host_data use_device()
        cublasDgemm();
   #pragma acc exit data  copyout()  async(2)
 
    }

But when I use my own OpenACC function

void matrixMull(double *a,  double *b, double *c, int x, int y,int streamId){
    double sum =0;
    #pragma acc kernels present  (a[0:x*y], b[0:x*y],c[0:x*y]) async(streamId)
  	for (int i = 0; i < x; i++) {
		for (int j = 0; j < y; j++) {
			for (int k = 0; k < y; k++) {
				sum += a[i*y+k] * b[k*y+j];
			}
			c[i*y+j] = sum;
			sum = 0;
		}
	}
}

 #pragma acc enter  data copyin () create () async(1)
    {
    matrixMull(1); //one of parameters is stream for kernel
   #pragma acc exit data  copyout()  async(1)
    } 
    
   #pragma acc enter  data copyin () create () async(2)
    {
      matrixMull(2); //one of parameters is stream for kernel
   #pragma acc exit data  copyout()  async(2)
    }

It doesn´t work same as with cublasDgemm. Do you know where could be a problem?
Thanks

How is “stream1” created? Are you passing “stream1” to the OpenACC version or the value “1”?

Note that the profile doesn’t quite match up with the code you have. The profile appears to show 1 data stream and 4 compute streams. Not sure why the disconnect. Having a full example would help.

Although you don’t show it, my assumption is that you’re creating “stream1” via “cudaSteamCreate” so unless you’re setting async queue to be the same as the CUDA stream (via acc_set_cuda_stream), the data movement and cublas call will be on separate streams. Though if the data movement and the the compute kernel that uses that data are on separate streams, you’re likely to get incorrect answers since the data may not have fully arrived before the compute begins.

The typical way of interleaving compute with data movement would be to do something like:

for (i=0; i < num_blas_calls; ++i) {
    int queue = i%num_streams;  // limit the number of streams to 2 or 4
    #pragma acc enter  data copyin () create () async(queue)
    matrixMull(a, b, c, x, y, queue);
} 
#pragma acc wait
for (i=0; i < num_blas_calls; ++i) {
    #pragma acc exit data copyout()
}

Although async is a supported clause on “exit data”, and we did try many different ways to make this work, but we were unable to find a reliable way to get a call back from the OS to know that the data transfer from the device was complete. Hence we had to make “exit data” blocking. Not sure if this is the problem you’re encountering, but the work around is to delay the copying back data until after the compute is finished.

-Mat