"Invalid pitch argument error" from cublasSetmatrix Limitation on size of matrix

Hi all,

     I am trying to Initialise a matrix on the device using the cublasSetmatrix function. While trying it for many different sized matrices, I get an error saying "invalid pitch argument" for the largest matrix that I am trying to transfer (80,000 x 2016).  I found an old thread about this on the forum  [url="http://forums.nvidia.com/lofiversion/index.php?t69307.html"]http://forums.nvidia.com/lofiversion/index.php?t69307.html[/url], which says that the cudaMemcpy2D() has a limitation that the pitch of a matrix cannot exceed 65536 elemts (for floats).. I guess cublasSetmatrix() is a wrapper around cudamemcpy2D right? 

      The thread is almost an year old. Does this limitation still hold..? In my case, I am not copying the entire matrix onto the graphics card. Only a small portion of the entire matrix will actually be copied. So how come I am still limited by this? The thread that I found didnt explain why this limitation exists. neither could I find anything about it in the Reference manual or the programming guide.. Could someone tell me more about why this limitation exists?

      
      And more importantly, could someone suggest a work around if I need to use these large matrices. I could ofcourse extract the smaller submatrix that I need and store seperately in the host and then transfer that smaller submatrix. But I was wondering if there was a more direct way to do it?

thanks in advance for any replies…

Avinash

Yes, the limitation still holds. It is an hardware limitation in the copy engine used in cudamemcpy2D.

These are two functions that I wrote to work around the issue, they are for double precision data but it is very simple to convert them to float:

void copyMatrixH2D (int rows, int cols,

					const void *A, int lda, void *B, int ldb)

{

int i,n, nc, nb, bufferSize, lastNc;

  int maxBufferSize=524288; // number of double precision elements in 4MB buffer

  static int allocateBuffer=1;

  static double *buffer;

if(allocateBuffer)

	{

#ifdef PAGEABLE

	 buffer =(double *)malloc(maxBufferSize*8);

#else

	 cudaMallocHost((void **) &buffer, maxBufferSize*8);

#endif

	 allocateBuffer=0;

	}

/* Find out how many columns of data fit in the buffer */

  nc= maxBufferSize/rows;

  nc = imin(nc, cols);

  bufferSize= nc*rows*8;

/* Find out how many times we need to copy data to the buffer */

  nb= cols/nc;

for (n=0; n <nb; n++)

  {

   /* Copy nc colums from host matrix to the buffer */

	  for (i=0; i<nc; i++)

	  {

	   memcpy (buffer+i*rows, (double *)A+lda*(i+n*nc), 8*rows);

	  }

/* Copy the buffer to the device matrix */

	   cudaMemcpy((double *)B+ldb*n*nc, buffer, bufferSize, cudaMemcpyHostToDevice);

  }

lastNc = cols - nc*nb;

  if (lastNc == 0) return;

 /* Copy last colums from host matrix to the buffer */

   bufferSize= lastNc*rows*8;

	  for (i=0; i<lastNc; i++)

	   memcpy (buffer+i*rows, (double *)A+lda*(i+nb*nc), 8*rows);

/* Copy the buffer to the device matrix */

	   cudaMemcpy((double *)B+ldb*nb*nc, buffer, bufferSize, cudaMemcpyHostToDevice);

}
void copyMatrixD2H (int rows, int cols,

					const void *A, int lda, void *B, int ldb)

{

int i,n, nc, nb, bufferSize, lastNc;

  int maxBufferSize=524288; // number of double precision elements in 4MB buffer

  static int allocateBuffer=1;

  static double *buffer;

if(allocateBuffer)

	{

#ifdef PAGEABLE

	 buffer =(double *)malloc(maxBufferSize*8);

#else

	 cudaMallocHost((void **) &buffer, maxBufferSize*8);

#endif

	 allocateBuffer=0;

	}

/* Find out how many columns of data fit in the buffer */

  nc= maxBufferSize/rows;

  nc = imin(nc, cols);

  bufferSize= nc*rows*8;

/* Find out how many times we need to copy data to the buffer */

  nb= cols/nc;

for (n=0; n <nb; n++)

  {

   /* Copy the device matrix to the buffer */

	   cudaMemcpy(buffer,(double *)A+lda*n*nc,  bufferSize, cudaMemcpyDeviceToHost);

   /* Copy nc colums from the buffer to host matrix */

	  for (i=0; i<nc; i++)

	  {

	   memcpy ((double *)B+ldb*(i+n*nc),buffer+i*rows, 8*rows);

	  }

}

lastNc = cols - nc*nb;

  if (lastNc == 0) return;

/* Copy last colums from device matrix to the buffer */

   bufferSize= lastNc*rows*8;

   /* Copy the buffer to the device matrix */

	   cudaMemcpy(buffer,(double *)A+lda*nb*nc, bufferSize, cudaMemcpyDeviceToHost);

	  for (i=0; i<lastNc; i++)

	   memcpy ((double *)B+ldb*(i+nb*nc),buffer+i*rows, 8*rows);

}

The usage is similar to S/GetMatrix (the limitations in this example are for double)

if ( (*lda) > 32768 || m_gpu >32768)

	 {

	  copyMatrixH2D (m_gpu, k_gpu,  A, *lda, devPtrA, m_gpu);

	 }

	else

	{

	 status = cublasSetMatrix (m_gpu, k_gpu, sizeof(A[0]), A, *lda, devPtrA, m_gpu);

	   if (status != CUBLAS_STATUS_SUCCESS) {

		printf ( "!!!! device access error (write A ) %d\n", status);

		}

	}

Hope this help.

It most certainly does help… This is awesome. Thanks a lot mfatica.

I have one doubt. Might be a bit silly. The Macro PAGEABLE that you use… Is it predefined in C? If not what would go in that macro?

Thanks again,

It is just a define at compile time.
gcc -DPAGEABLE file.c
will use a page-locked buffer.
It will improve the performance

Hi Mfatica,

         I was looking at your function and came up with another question. last one, I swear.. 

I was wondering what is the logic behind setting max buffersize as 4MB. Is this the hardware limitation that you spoke of?
Because the old thread that I found [url=“http://forums.nvidia.com/lofiversion/index.php?t69307.html”]http://forums.nvidia.com/lofiversion/index.php?t69307.html[/url] says that its only the pitch of the matrix (leading dimension/no of rows) that cannot exceed 65536. But if the max buffer is 4MB then it would mean I can have a matrix less than 1048576 floats (no of floats in 4MB) irrespective of how that is divided up into rows and cols. I find this a bit confusing. Could you help me out here…?

Thanks

The maximum memory pitch is 262144 bytes ( it is reported by deviceQuery).

The buffer is used to copy pieces of the original matrix before sending/receiving data to/from the GPU. I played a little bit with the dimensions of this buffer and this was a good size on my systems. You can experiment with the size.
Once the data is in this buffer, it is sent with cudaMemcpy not cudaMemcpy2D.