Bad performance using MallocPitch and Memcpy2D


I tried to accelerate an image processing function using Pitch, but I have really bad performance.

For instance, with basic cudaMemcpy and cudaMalloc the kernel processed in: 1462 usec (good perf)
Now with memcpy2D and cudaMallocPitch, the kernel processed in: 56299 usec (really bad perf)

Something must be wrong with my code. Do you have any idea ?

Here is the host part:

//image size
	int nY = fpIn.height(); 
	int nX = fpIn.width(); 

        size_t pitch1;
	size_t pitch2;

//image ptrs on host
	float* imIn=static_cast<float*>(; //pointer-> input image
	unsigned short* imRef=static_cast<unsigned short*>(; //pointer-> ref image

//device ptrs
       float* d_imIn;
       float* d_imRef;

//GPU alloc
       cudaMallocPitch((void**)&d_imRef,&pitch2,nX*sizeof(unsigned short),nY);

       cudaMemcpy2D(d_imRef,pitch2,imRef,nX*sizeof(unsigned short),nX*sizeof(unsigned short),nY,cudaMemcpyHostToDevice);

//launch kernel
	dim3 Db(32,32), Dg(nX/32,nY/32); //image size:1024x1024
	K1_with_pitch<<<Dg,Db>>>(d_imIn, d_imRef, nY, nX, pitch1>>2,pitch2>>2); //pitch in pix not in bytes


and the device part:

__global__ void K1_with_pitch(float* imIn, unsigned short* imRef, int nY, int nX, size_t pitch1, size_t pitch2)

	int i=blockDim.x*blockIdx.x+threadIdx.x;
	int j=blockDim.y*blockIdx.y+threadIdx.y;

	//if((i>=nY) || (j>=nX)) return;

	if((i<nY) && (j<nX)) imIn[j+i*pitch1] = imIn[j+i*pitch1] - static_cast<float>(imRef[j+i*pitch2]); // I changed "[j+i*Nx]" to "[j+i*pitch]"



Your pitch calculations for indexing are not correct.

Please refer to the documentation:

The pitch value returned by cudaMallocPitch is a quantity in bytes

Even when you fix that, the pitched method may not give any better performance than the unpitched method. Pitched allocations were especially useful on early GPUs, but are of less significance on modern GPUs. Depending on your GPU, its possible that the overhead associated with pitch calculations in the kernel (especially for such a simple kernel) may outweigh any benefit from pitched access (although it should not cause a ~50x performance reduction)

cudaMemcpy2D() can be dramatically slower than cudaMemcpy(), especially for extreme aspect ratios (tall matrices with short rows). That is a function of hardware (“copy engines”), and nothing can be done about it. . The higher copy time should be visible in the profiler.

However, replacing such a cudaMemcpy2D() by many piece-wise cudaMemcpy() calls would be no faster and likely quite a bit slower. So when you need a true 2D copy, use cudaMemcpy2D(), otherwise use cudaMemcpy().

I had assumed that the question implied an accurate measurement of kernel-only execution time, but it is unclear. So if the reported timing includes the data copy time as well as kernel execution time, that could possibly explain the large disparity in performance. Also, it probably goes without saying, but the displayed method could not produce the same (i.e. correct) results as the unpitched method. Finally, for completeness, I would assume that running such a code with cuda-memcheck would encounter out-of-bounds accesses, or indeed proper cuda error checking may turn up such.

Note that if you must use pitch-linear storage on the device, you could still use a straight cudaMemcpy() if you use the same data layout, i.e. with pitch, on the host. Since host memory usually is not a scarce resource, that might offer the best overall performance.

But as txbob stated, the importance of pitch-linear storage for device-code performance has diminished over the years, as the GPU memory controllers have become more sophisticated.

Thanks for your answers.

Interesting, I didn’t know that pitched allocations aren’t that usefull on modern GPUs.

The reported timing do not include data copy. That’s the reason why I’m a little bit confused about the perf.

txbob, I didn’t really understand what is not correct with my pitch. Which part do I have to correct exactly ?

It looks like you may need to look into your use of pitch:



Used correctly, the use of pitch-linear memory should result in performance greater-than-or-equal-to the performance with plain old linear memory.

If you read the documentation link I gave you, it gives an exact line of code you can use to perform the pitch indexing in your kernel code.

You mean that way ?

__global__ void K_offsetpitch(float* imIn, unsigned short* imRef, int nY, int nX, size_t pitch1, size_t pitch2)

	for (int y=0; y<nY; y++) {
		float* r_imIn= (float*) ((char*) imIn + y * pitch1);
		unsigned short* r_imRef= (unsigned short*) ((char*) imRef + y * pitch2);
		for (int x=0; x<nX; x++) {
			r_imIn[x] = r_imIn[x] - static_cast<float>(r_imRef[x]);


and the call of my kernel is this one:

K1_with_pitch<<<Dg,Db>>>(d_imIn, d_imRef, nY, nX, pitch1,pitch2); //pitch in bytes here

I’ve already tried this method, but it didn’t work, moreover I don’t have any errors. Do you see something wrong there ?

This code couldn’t possibly make sense.

Every thread in your grid is doing the exact same thing?

My suggestion if you want help is to provide a short, complete code.