In CUDA, there is cudaMemcpy2D, which lets you copy a 2D sub-matrix of a larger matrix on the host to a smaller matrix on the device (or vice versa).
For instance, say A is a 6x6 matrix on the host, and we allocated a 3x3 matrix B on the device previously. cudaMemcpy2D lets you copy a 3x3 submatrix of A, defined by rows 0 to 2 and columns 0 to 2 to the device into the space for B (the 3x3 matrix was already allocated on the device).
To the best of my understanding based on the specs, OpenACC does not have anything similar. We would have to copy the non-contiguous slices of the matrix individually, right?