1D array vs. 2D array

zoharl · August 15, 2011, 4:42pm

[RESOLVED]

Hi,

I wrote a test application that compares 1D arrays to 2D arrays.

There are 3 variants:

1D array.
2D array.
2D array using pitched memory.

Originally I tried to test the offsetCopy from the best_practices_guide, when I noticed this bizarre result.

The code copies a 1024x1024 matrix in the global memory. I execute the kernel 1000 times. It takes 1 sec for the 1D version, and 7 seconds for the 2D arrays (both cases). Okay, what went wrong??

Now I set int szY = 1025, and the pitch now has an effect and is set to 1088. 1D and 2D cases run the same as before, but the 2D pitched version now runs in 2.6 seconds. Can someone please explain this??

Here is the full test code:

#include <iostream>

using namespace std;

#include "../shared/stopwatch.h"

__global__ void offsetCopy1D(float *odata, float* idata, int offset) 

{ 

	int szX = blockDim.x * gridDim.x;

	int i = blockIdx.x * blockDim.x + threadIdx.x + offset; 

	i = i % szX;

	odata[i] = idata[i]; 

} 

__global__ void offsetCopy2D(float *odata, float* idata, int offset) 

{ 

	int szX = blockDim.x * gridDim.x;

	int szY = blockDim.y * gridDim.y;

	int i = blockIdx.x * blockDim.x + threadIdx.x + offset; // RESOLVED: here is the problem, you should swap i with j.

	i = i % szX;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	odata[i*szY + j] = idata[i*szY + j]; 

} 

__global__ void offsetCopy2D_pitch(float *odata, float* idata, size_t pitch, int offset) 

{ 

	int szX = blockDim.x * gridDim.x;

//	int szY = blockDim.y * gridDim.y;

	int i = blockIdx.x * blockDim.x + threadIdx.x + offset; 

	i = i % szX;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	odata[i*pitch + j] = idata[i*pitch + j]; 

} 

void test_res()

{

	cudaError_t errorCode = cudaGetLastError();	

	if (errorCode != cudaSuccess) {

		cout << "Cuda errorCode = " << cudaGetErrorString(errorCode) << endl;

		throw std::exception("Cuda error!");

	}

}

int main(int argc, char** argv)

{

	Stopwatch sw;

	sw.start_print();

	float * din = NULL;

	float * dout = NULL;

	int szX = 1024;

	int szY = 1024;

	dim3 threads = dim3(16, 16);

	dim3 blocks = dim3(szX / threads.x, szY / threads.y);

	bool b2D = 1;

	bool bPitch = 1;

	size_t pitch;

	if ( !bPitch ) {

		cudaMalloc((void**)&din, sizeof(float) * szX * szY);

		cudaMalloc((void**)&dout, sizeof(float) * szX * szY);

	} else {

		cudaMallocPitch( (void**) &din, &pitch , szY * sizeof(float) , szX);

		cudaMallocPitch( (void**) &dout, &pitch , szY * sizeof(float) , szX);

		pitch /= sizeof(float);

		cout << "Pitch " << pitch << endl;

	}

	test_res();

	int offset = 0;

	for ( int it = 0 ; it < 1000 ; it++ ) {

		if ( b2D ) {

			if ( !bPitch )

				offsetCopy2D<<<blocks, threads>>>(dout, din, offset);

			else {

				offsetCopy2D_pitch<<<blocks, threads>>>(dout, din, pitch, offset);

			}

		} else {

			int nBlocks = blocks.x * blocks.y;

			int nThreads = threads.x * threads.y;

			offsetCopy1D<<<nBlocks, nThreads>>>(dout, din, offset);

		}

		test_res();

	}

cudaFree(din);

    cudaFree(dout);

cudaThreadExit();

	test_res();

	sw.print();

}

alrikai · August 15, 2011, 8:01pm

For a bit of background, accessing 1D memory is generally faster, since you’re only having to read from memory once, whereas when using 2D memory, you’re reading from it twice (once for each dimension). Pitch is introduced to ensure memory accesses are aligned to a memory address convenient for the GPU, so it can do coalesced reads.

With that in mind, it’s not surprising that the 1D memory was consistently faster than the 2D memory. For the case of your szY dimension being 1025, the pitch increased it to 1088 to do satisfy its alignment requirements (it didn’t have to add any pitch before, since 1024 was presumably already aligned). Because of the pitched memory being aligned, it should be faster than non-pitched 2D memory, so the performance increase for using pitched memory isn’t too surprising.

Now for your case of the 2D memory being 1024x1024, it makes sense that the pitched and non-pitched memory would have the same dimensions, but I don’t know why they would be 7x slower than 1D memory; that seems like a higher than usual performance hit, but maybe someone else with more experience in the matter can confirm/deny that.

hope that helped

EDIT: sorry I didn’t bother to look at the code you posted or run it myself, I just posted in response to your written results. I’m feeling lazy

zoharl · August 15, 2011, 9:27pm

Sorry, but your explanation doesn’t make sense. Why is the memory accessed twice? Even if we were talking about an array of pointers to arrays of floats (which I’m not sure is the case for a continuous memory), it wouldn’t justify x7 slowness. Moreover this is not the case, the type stays the same, a 1D array, only the grid and the block are now 2D instead of 1D. It’s all suppose to be cosmetics.

About the pitched memory, I know what it’s suppose to do, but do you have any explanation why 1024x1024 runs 7 seconds, while 1024x1025 runs only 2.6 seconds?? I’ll emphasize that the second case is larger.

alrikai · August 15, 2011, 10:12pm

Yeah I mentioned in my edit that when I answered, I didn’t look at your code, only your explanation of what was occurring. Typically a “2D array” refers to an array of arrays, not linear memory… As for the 7x slowness, like I also said, I don’t think it should be that slow. Maybe you should try making your timing more specific, i.e. timing segments of your code as opposed to the entire thing, so you can see how much each part of it takes.

Sorry I couldn’t help you much

kbam · August 15, 2011, 10:52pm

Just noticed this, should the second call pass &dout ?

cudaMallocPitch( (void**) &din, &pitch , szY * sizeof(float) , szX);  //din

                cudaMallocPitch( (void**) &din, &pitch , szY * sizeof(float) , szX); // din again !!

zoharl · August 16, 2011, 9:30am

@alrikai, don’t sweat it, thanks for the effort.

@kbam, you are definitely right, and I’ll edit the code. It explains why the pitch execution failed on my friend’s card. Unfortunately it didn’t change the results even a bit.

[RESOLVED]

i and j should be swapped. It seems that I copied a wrong example from someone, who didn’t check it thoroughly, and used this mistake through all his code consistently (I won’t name names External Image), and counter to my intuition that x and y should behave as in Cartesian coordinates (x for columns, y for rows), I treated x as the first dimension, and since the matrix is row major, I used x for rows. This also can explain why the pitched memory mistake was faster. Since the indices were inverted, the pitch added an offset which aligned some of the threads.

Topic		Replies	Views
Problem with 2D memory copy using pitch CUDA Programming and Performance	6	6584	November 20, 2011
Bad performance using MallocPitch and Memcpy2D CUDA Programming and Performance	9	2978	May 24, 2017
2D Array CUDA Programming and Performance	16	77319	January 20, 2012
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26519	March 17, 2011
2D array & Memory space Mostly about cudaMallocPitch & cudaMemcpy2D CUDA Programming and Performance	1	1520	October 15, 2009
Allocating an array of pitched arrays CUDA Programming and Performance	13	6613	September 30, 2011
Why pitched memory performanced worse than linear memory? CUDA Programming and Performance	5	1059	August 24, 2015
Copying 2D array from host to device CUDA Programming and Performance	7	7358	July 27, 2010
test on 'cudaMallocPitch' and 'cudaMemcpy2D' CUDA Programming and Performance	1	612	November 16, 2010
What are row alignments for 2D arrays used for? CUDA Programming and Performance	1	760	October 11, 2019

1D array vs. 2D array

Related topics