1D array vs. 2D array

[RESOLVED]

Hi,

I wrote a test application that compares 1D arrays to 2D arrays.

There are 3 variants:

  1. 1D array.

  2. 2D array.

  3. 2D array using pitched memory.

Originally I tried to test the offsetCopy from the best_practices_guide, when I noticed this bizarre result.

The code copies a 1024x1024 matrix in the global memory. I execute the kernel 1000 times. It takes 1 sec for the 1D version, and 7 seconds for the 2D arrays (both cases). Okay, what went wrong??

Now I set int szY = 1025, and the pitch now has an effect and is set to 1088. 1D and 2D cases run the same as before, but the 2D pitched version now runs in 2.6 seconds. Can someone please explain this??

Here is the full test code:

#include <iostream>

using namespace std;

#include "../shared/stopwatch.h"

__global__ void offsetCopy1D(float *odata, float* idata, int offset) 

{ 

	int szX = blockDim.x * gridDim.x;

	int i = blockIdx.x * blockDim.x + threadIdx.x + offset; 

	i = i % szX;

	odata[i] = idata[i]; 

} 

__global__ void offsetCopy2D(float *odata, float* idata, int offset) 

{ 

	int szX = blockDim.x * gridDim.x;

	int szY = blockDim.y * gridDim.y;

	int i = blockIdx.x * blockDim.x + threadIdx.x + offset; // RESOLVED: here is the problem, you should swap i with j.

	i = i % szX;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	odata[i*szY + j] = idata[i*szY + j]; 

} 

__global__ void offsetCopy2D_pitch(float *odata, float* idata, size_t pitch, int offset) 

{ 

	int szX = blockDim.x * gridDim.x;

//	int szY = blockDim.y * gridDim.y;

	int i = blockIdx.x * blockDim.x + threadIdx.x + offset; 

	i = i % szX;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	odata[i*pitch + j] = idata[i*pitch + j]; 

} 

void test_res()

{

	cudaError_t errorCode = cudaGetLastError();	

	if (errorCode != cudaSuccess) {

		cout << "Cuda errorCode = " << cudaGetErrorString(errorCode) << endl;

		throw std::exception("Cuda error!");

	}

}

int main(int argc, char** argv)

{

	Stopwatch sw;

	sw.start_print();

	float * din = NULL;

	float * dout = NULL;

	int szX = 1024;

	int szY = 1024;

	dim3 threads = dim3(16, 16);

	dim3 blocks = dim3(szX / threads.x, szY / threads.y);

	bool b2D = 1;

	bool bPitch = 1;

	size_t pitch;

	if ( !bPitch ) {

		cudaMalloc((void**)&din, sizeof(float) * szX * szY);

		cudaMalloc((void**)&dout, sizeof(float) * szX * szY);

	} else {

		cudaMallocPitch( (void**) &din, &pitch , szY * sizeof(float) , szX);

		cudaMallocPitch( (void**) &dout, &pitch , szY * sizeof(float) , szX);

		pitch /= sizeof(float);

		cout << "Pitch " << pitch << endl;

	}

	test_res();

	int offset = 0;

	for ( int it = 0 ; it < 1000 ; it++ ) {

		if ( b2D ) {

			if ( !bPitch )

				offsetCopy2D<<<blocks, threads>>>(dout, din, offset);

			else {

				offsetCopy2D_pitch<<<blocks, threads>>>(dout, din, pitch, offset);

			}

		} else {

			int nBlocks = blocks.x * blocks.y;

			int nThreads = threads.x * threads.y;

			offsetCopy1D<<<nBlocks, nThreads>>>(dout, din, offset);

		}

		test_res();

	}

cudaFree(din);

    cudaFree(dout);

cudaThreadExit();

	test_res();

	sw.print();

}

For a bit of background, accessing 1D memory is generally faster, since you’re only having to read from memory once, whereas when using 2D memory, you’re reading from it twice (once for each dimension). Pitch is introduced to ensure memory accesses are aligned to a memory address convenient for the GPU, so it can do coalesced reads.

With that in mind, it’s not surprising that the 1D memory was consistently faster than the 2D memory. For the case of your szY dimension being 1025, the pitch increased it to 1088 to do satisfy its alignment requirements (it didn’t have to add any pitch before, since 1024 was presumably already aligned). Because of the pitched memory being aligned, it should be faster than non-pitched 2D memory, so the performance increase for using pitched memory isn’t too surprising.

Now for your case of the 2D memory being 1024x1024, it makes sense that the pitched and non-pitched memory would have the same dimensions, but I don’t know why they would be 7x slower than 1D memory; that seems like a higher than usual performance hit, but maybe someone else with more experience in the matter can confirm/deny that.

hope that helped

EDIT: sorry I didn’t bother to look at the code you posted or run it myself, I just posted in response to your written results. I’m feeling lazy

1 Like

Sorry, but your explanation doesn’t make sense. Why is the memory accessed twice? Even if we were talking about an array of pointers to arrays of floats (which I’m not sure is the case for a continuous memory), it wouldn’t justify x7 slowness. Moreover this is not the case, the type stays the same, a 1D array, only the grid and the block are now 2D instead of 1D. It’s all suppose to be cosmetics.

About the pitched memory, I know what it’s suppose to do, but do you have any explanation why 1024x1024 runs 7 seconds, while 1024x1025 runs only 2.6 seconds?? I’ll emphasize that the second case is larger.

Yeah I mentioned in my edit that when I answered, I didn’t look at your code, only your explanation of what was occurring. Typically a “2D array” refers to an array of arrays, not linear memory… As for the 7x slowness, like I also said, I don’t think it should be that slow. Maybe you should try making your timing more specific, i.e. timing segments of your code as opposed to the entire thing, so you can see how much each part of it takes.

Sorry I couldn’t help you much

Just noticed this, should the second call pass &dout ?

cudaMallocPitch( (void**) &din, &pitch , szY * sizeof(float) , szX);  //din

                cudaMallocPitch( (void**) &din, &pitch , szY * sizeof(float) , szX); // din again !!

@alrikai, don’t sweat it, thanks for the effort.

@kbam, you are definitely right, and I’ll edit the code. It explains why the pitch execution failed on my friend’s card. Unfortunately it didn’t change the results even a bit.

[RESOLVED]

i and j should be swapped. It seems that I copied a wrong example from someone, who didn’t check it thoroughly, and used this mistake through all his code consistently (I won’t name names External Image), and counter to my intuition that x and y should behave as in Cartesian coordinates (x for columns, y for rows), I treated x as the first dimension, and since the matrix is row major, I used x for rows. This also can explain why the pitched memory mistake was faster. Since the indices were inverted, the pitch added an offset which aligned some of the threads.