Bad performance using MallocPitch and Memcpy2D

Lass · May 23, 2017, 2:35pm

Hi,

I tried to accelerate an image processing function using Pitch, but I have really bad performance.

For instance, with basic cudaMemcpy and cudaMalloc the kernel processed in: 1462 usec (good perf)
Now with memcpy2D and cudaMallocPitch, the kernel processed in: 56299 usec (really bad perf)

Something must be wrong with my code. Do you have any idea ?

Here is the host part:

//image size
	int nY = fpIn.height(); 
	int nX = fpIn.width(); 

//pitch
        size_t pitch1;
	size_t pitch2;

//image ptrs on host
	float* imIn=static_cast<float*>(fpIn.data()); //pointer-> input image
	unsigned short* imRef=static_cast<unsigned short*>(ucOff.data()); //pointer-> ref image


//device ptrs
       float* d_imIn;
       float* d_imRef;

//GPU alloc
       cudaMallocPitch((void**)&d_imIn,&pitch1,nX*sizeof(float),nY);
       cudaMallocPitch((void**)&d_imRef,&pitch2,nX*sizeof(unsigned short),nY);

//copyHTD
       cudaMemcpy2D(d_imIn,pitch1,imIn,nX*sizeof(float),nX*sizeof(float),nY,cudaMemcpyHostToDevice);
       cudaMemcpy2D(d_imRef,pitch2,imRef,nX*sizeof(unsigned short),nX*sizeof(unsigned short),nY,cudaMemcpyHostToDevice);

//launch kernel
	dim3 Db(32,32), Dg(nX/32,nY/32); //image size:1024x1024
	K1_with_pitch<<<Dg,Db>>>(d_imIn, d_imRef, nY, nX, pitch1>>2,pitch2>>2); //pitch in pix not in bytes

//copyDTH
	cudaMemcpy2D(imIn,nX*sizeof(float),d_imIn,pitch1,nX*sizeof(float),nY,cudaMemcpyDeviceToHost);

and the device part:

__global__ void K1_with_pitch(float* imIn, unsigned short* imRef, int nY, int nX, size_t pitch1, size_t pitch2)
{

	int i=blockDim.x*blockIdx.x+threadIdx.x;
	int j=blockDim.y*blockIdx.y+threadIdx.y;

	//if((i>=nY) || (j>=nX)) return;

	if((i<nY) && (j<nX)) imIn[j+i*pitch1] = imIn[j+i*pitch1] - static_cast<float>(imRef[j+i*pitch2]); // I changed "[j+i*Nx]" to "[j+i*pitch]"


}

Thanks.

Robert_Crovella · May 23, 2017, 2:39pm

Your pitch calculations for indexing are not correct.

Please refer to the documentation:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

The pitch value returned by cudaMallocPitch is a quantity in bytes

Even when you fix that, the pitched method may not give any better performance than the unpitched method. Pitched allocations were especially useful on early GPUs, but are of less significance on modern GPUs. Depending on your GPU, its possible that the overhead associated with pitch calculations in the kernel (especially for such a simple kernel) may outweigh any benefit from pitched access (although it should not cause a ~50x performance reduction)

njuffa · May 23, 2017, 3:01pm

cudaMemcpy2D() can be dramatically slower than cudaMemcpy(), especially for extreme aspect ratios (tall matrices with short rows). That is a function of hardware (“copy engines”), and nothing can be done about it. . The higher copy time should be visible in the profiler.

However, replacing such a cudaMemcpy2D() by many piece-wise cudaMemcpy() calls would be no faster and likely quite a bit slower. So when you need a true 2D copy, use cudaMemcpy2D(), otherwise use cudaMemcpy().

Robert_Crovella · May 23, 2017, 3:09pm

I had assumed that the question implied an accurate measurement of kernel-only execution time, but it is unclear. So if the reported timing includes the data copy time as well as kernel execution time, that could possibly explain the large disparity in performance. Also, it probably goes without saying, but the displayed method could not produce the same (i.e. correct) results as the unpitched method. Finally, for completeness, I would assume that running such a code with cuda-memcheck would encounter out-of-bounds accesses, or indeed proper cuda error checking may turn up such.

njuffa · May 23, 2017, 3:23pm

Note that if you must use pitch-linear storage on the device, you could still use a straight cudaMemcpy() if you use the same data layout, i.e. with pitch, on the host. Since host memory usually is not a scarce resource, that might offer the best overall performance.

But as txbob stated, the importance of pitch-linear storage for device-code performance has diminished over the years, as the GPU memory controllers have become more sophisticated.

Lass · May 23, 2017, 3:51pm

Thanks for your answers.

Interesting, I didn’t know that pitched allocations aren’t that usefull on modern GPUs.

The reported timing do not include data copy. That’s the reason why I’m a little bit confused about the perf.

txbob, I didn’t really understand what is not correct with my pitch. Which part do I have to correct exactly ?

njuffa · May 23, 2017, 4:06pm

It looks like you may need to look into your use of pitch:

Lass:

txbob:

Used correctly, the use of pitch-linear memory should result in performance greater-than-or-equal-to the performance with plain old linear memory.

Robert_Crovella · May 23, 2017, 8:04pm

If you read the documentation link I gave you, it gives an exact line of code you can use to perform the pitch indexing in your kernel code.

Lass · May 24, 2017, 9:34am

You mean that way ?

__global__ void K_offsetpitch(float* imIn, unsigned short* imRef, int nY, int nX, size_t pitch1, size_t pitch2)
{

	for (int y=0; y<nY; y++) {
		float* r_imIn= (float*) ((char*) imIn + y * pitch1);
		unsigned short* r_imRef= (unsigned short*) ((char*) imRef + y * pitch2);
		for (int x=0; x<nX; x++) {
			r_imIn[x] = r_imIn[x] - static_cast<float>(r_imRef[x]);
		}
	}

}

and the call of my kernel is this one:

K1_with_pitch<<<Dg,Db>>>(d_imIn, d_imRef, nY, nX, pitch1,pitch2); //pitch in bytes here

I’ve already tried this method, but it didn’t work, moreover I don’t have any errors. Do you see something wrong there ?

Robert_Crovella · May 24, 2017, 4:34pm

This code couldn’t possibly make sense.

Every thread in your grid is doing the exact same thing?

My suggestion if you want help is to provide a short, complete code.

Topic		Replies	Views
need help for cudaMemcpy2D() CUDA Programming and Performance	5	4558	December 8, 2009
Simple Malloc on host CUDA Programming and Performance	4	1256	October 22, 2009
Understanding Memory Pitch Alignment CUDA Programming and Performance	9	12015	October 13, 2015
cudaMallocPitch is giving inconsistent result cudaMallocPitch is giving inconsistent r CUDA Programming and Performance	5	6281	June 28, 2008
Two memory questions, and a CPU question CUDA Programming and Performance	5	2533	March 24, 2008
2D textures bound to Pitch Linear Memory texture doesn't access correct information CUDA Programming and Performance	2	20212	December 21, 2010
Why pitched memory performanced worse than linear memory? CUDA Programming and Performance	5	970	August 24, 2015
cudaMemcpy2D example? CUDA Programming and Performance	5	19540	February 1, 2012
Copying 2D array from host to device CUDA Programming and Performance	7	7260	July 27, 2010
problem with cudaMallocPitch and cudaMemcpy2D CUDA Programming and Performance	4	21743	August 12, 2010

Bad performance using MallocPitch and Memcpy2D

Related topics