"Invalid pitch argument error" from cublasSetmatrix Limitation on size of matrix

avi2886 · April 7, 2009, 1:07am

Hi all,

     I am trying to Initialise a matrix on the device using the cublasSetmatrix function. While trying it for many different sized matrices, I get an error saying "invalid pitch argument" for the largest matrix that I am trying to transfer (80,000 x 2016).  I found an old thread about this on the forum  [url="http://forums.nvidia.com/lofiversion/index.php?t69307.html"]http://forums.nvidia.com/lofiversion/index.php?t69307.html[/url], which says that the cudaMemcpy2D() has a limitation that the pitch of a matrix cannot exceed 65536 elemts (for floats).. I guess cublasSetmatrix() is a wrapper around cudamemcpy2D right? 

      The thread is almost an year old. Does this limitation still hold..? In my case, I am not copying the entire matrix onto the graphics card. Only a small portion of the entire matrix will actually be copied. So how come I am still limited by this? The thread that I found didnt explain why this limitation exists. neither could I find anything about it in the Reference manual or the programming guide.. Could someone tell me more about why this limitation exists?

      
      And more importantly, could someone suggest a work around if I need to use these large matrices. I could ofcourse extract the smaller submatrix that I need and store seperately in the host and then transfer that smaller submatrix. But I was wondering if there was a more direct way to do it?

thanks in advance for any replies…

Avinash

mfatica · April 7, 2009, 3:56am

Yes, the limitation still holds. It is an hardware limitation in the copy engine used in cudamemcpy2D.

These are two functions that I wrote to work around the issue, they are for double precision data but it is very simple to convert them to float:

void copyMatrixH2D (int rows, int cols,

					const void *A, int lda, void *B, int ldb)

{

int i,n, nc, nb, bufferSize, lastNc;

  int maxBufferSize=524288; // number of double precision elements in 4MB buffer

  static int allocateBuffer=1;

  static double *buffer;

if(allocateBuffer)

	{

#ifdef PAGEABLE

	 buffer =(double *)malloc(maxBufferSize*8);

#else

	 cudaMallocHost((void **) &buffer, maxBufferSize*8);

#endif

	 allocateBuffer=0;

	}

/* Find out how many columns of data fit in the buffer */

  nc= maxBufferSize/rows;

  nc = imin(nc, cols);

  bufferSize= nc*rows*8;

/* Find out how many times we need to copy data to the buffer */

  nb= cols/nc;

for (n=0; n <nb; n++)

  {

   /* Copy nc colums from host matrix to the buffer */

	  for (i=0; i<nc; i++)

	  {

	   memcpy (buffer+i*rows, (double *)A+lda*(i+n*nc), 8*rows);

	  }

/* Copy the buffer to the device matrix */

	   cudaMemcpy((double *)B+ldb*n*nc, buffer, bufferSize, cudaMemcpyHostToDevice);

  }

lastNc = cols - nc*nb;

  if (lastNc == 0) return;

 /* Copy last colums from host matrix to the buffer */

   bufferSize= lastNc*rows*8;

	  for (i=0; i<lastNc; i++)

	   memcpy (buffer+i*rows, (double *)A+lda*(i+nb*nc), 8*rows);

/* Copy the buffer to the device matrix */

	   cudaMemcpy((double *)B+ldb*nb*nc, buffer, bufferSize, cudaMemcpyHostToDevice);

}

void copyMatrixD2H (int rows, int cols,

					const void *A, int lda, void *B, int ldb)

{

int i,n, nc, nb, bufferSize, lastNc;

  int maxBufferSize=524288; // number of double precision elements in 4MB buffer

  static int allocateBuffer=1;

  static double *buffer;

if(allocateBuffer)

	{

#ifdef PAGEABLE

	 buffer =(double *)malloc(maxBufferSize*8);

#else

	 cudaMallocHost((void **) &buffer, maxBufferSize*8);

#endif

	 allocateBuffer=0;

	}

/* Find out how many columns of data fit in the buffer */

  nc= maxBufferSize/rows;

  nc = imin(nc, cols);

  bufferSize= nc*rows*8;

/* Find out how many times we need to copy data to the buffer */

  nb= cols/nc;

for (n=0; n <nb; n++)

  {

   /* Copy the device matrix to the buffer */

	   cudaMemcpy(buffer,(double *)A+lda*n*nc,  bufferSize, cudaMemcpyDeviceToHost);

   /* Copy nc colums from the buffer to host matrix */

	  for (i=0; i<nc; i++)

	  {

	   memcpy ((double *)B+ldb*(i+n*nc),buffer+i*rows, 8*rows);

	  }

}

lastNc = cols - nc*nb;

  if (lastNc == 0) return;

/* Copy last colums from device matrix to the buffer */

   bufferSize= lastNc*rows*8;

   /* Copy the buffer to the device matrix */

	   cudaMemcpy(buffer,(double *)A+lda*nb*nc, bufferSize, cudaMemcpyDeviceToHost);

	  for (i=0; i<lastNc; i++)

	   memcpy ((double *)B+ldb*(i+nb*nc),buffer+i*rows, 8*rows);

}

The usage is similar to S/GetMatrix (the limitations in this example are for double)

if ( (*lda) > 32768 || m_gpu >32768)

	 {

	  copyMatrixH2D (m_gpu, k_gpu,  A, *lda, devPtrA, m_gpu);

	 }

	else

	{

	 status = cublasSetMatrix (m_gpu, k_gpu, sizeof(A[0]), A, *lda, devPtrA, m_gpu);

	   if (status != CUBLAS_STATUS_SUCCESS) {

		printf ( "!!!! device access error (write A ) %d\n", status);

		}

	}

Hope this help.

avi2886 · April 7, 2009, 6:59pm

It most certainly does help… This is awesome. Thanks a lot mfatica.

I have one doubt. Might be a bit silly. The Macro PAGEABLE that you use… Is it predefined in C? If not what would go in that macro?

Thanks again,

mfatica · April 7, 2009, 7:12pm

It is just a define at compile time.
gcc -DPAGEABLE file.c
will use a page-locked buffer.
It will improve the performance

avi2886 · April 8, 2009, 7:17pm

Hi Mfatica,

         I was looking at your function and came up with another question. last one, I swear..

I was wondering what is the logic behind setting max buffersize as 4MB. Is this the hardware limitation that you spoke of?
Because the old thread that I found [url=“http://forums.nvidia.com/lofiversion/index.php?t69307.html”]http://forums.nvidia.com/lofiversion/index.php?t69307.html[/url] says that its only the pitch of the matrix (leading dimension/no of rows) that cannot exceed 65536. But if the max buffer is 4MB then it would mean I can have a matrix less than 1048576 floats (no of floats in 4MB) irrespective of how that is divided up into rows and cols. I find this a bit confusing. Could you help me out here…?

Thanks

mfatica · April 8, 2009, 9:38pm

The maximum memory pitch is 262144 bytes ( it is reported by deviceQuery).

The buffer is used to copy pieces of the original matrix before sending/receiving data to/from the GPU. I played a little bit with the dimensions of this buffer and this was a good size on my systems. You can experiment with the size.
Once the data is in this buffer, it is sent with cudaMemcpy not cudaMemcpy2D.

Topic		Replies	Views
Avoiding cudaMemcpy2D() because of 65536 pitch limit CUDA Programming and Performance	1	2423	April 14, 2009
cuBLAS fails when matrix has more than 2^31-1 entries? CUDA Programming and Performance	13	891	October 12, 2021
Copying Rectangular matrix from Host to Device : cudaErrorInvalidPitchValue (error 12) CUDA Programming and Performance	1	680	June 8, 2019
cublasDgemm issue :- Copying Rectangular matrix from Host to Device - cudaErrorInvalidPitchValue (error 12) GPU-Accelerated Libraries	1	712	June 7, 2019
Maximum pitch value allowed before cudaErrorInvalidPitchValue CUDA Programming and Performance	2	2795	August 29, 2008
Is there any function in cublas limit the matrix size less than or equal to 48 CUDA Programming and Performance	0	1336	May 25, 2014
Can't get cuMemcpy2D to Work CUDA Programming and Performance	1	2065	October 7, 2010
cudaMemcpy2D(): reason for pitch limits? CUDA Programming and Performance	3	7190	August 20, 2007
CUBLAS alloc limit CUDA Programming and Performance	9	10928	June 1, 2009
Cublas functions, matrix size limit..? Able to allocate too much memory through cublasAlloc CUDA Programming and Performance	0	2382	March 18, 2009

"Invalid pitch argument error" from cublasSetmatrix Limitation on size of matrix

Related topics