Allocating 2D Grid Access Performance

So as I’m sure many topics have been created, 2D allocation help. With an OpenCL implementation I made, I simply flattened the 2D grid I have to a 1D grid (copied it), to pass it to the OpenCL kernel.

I realize that this is a hack, but is sort of ok. I need quick rendering times, though, so it’s a lot of overhead to constantly swap it 2D to 1D or vice versa every frame, so I looked into other solutions with CUDA and obviously stumbled upon:

cudaMalloc3D()

and

cudaMallocPitch()

and how to use them.

There’s 2 things I can’t figure out however:

  1. When would I use one over the other (pitch vs 3D)? I can’t figure that out via the documentation. I’m allocating a 2D grid.
  2. The access annoys me. It’s great that the memory is aligned and all on the device, but to access every base address of a row I have to do:
float *row = basePtr + x * pitch;

There’s no way to increment that? I mean I suppose it’s not the end of the world, in a 1500x1500 grid it’s only calculating the address 1500 times - beats the 2.3 mil if I had to do so for every address, but it’s still a lot of calculations…I mean I realize that it’s all integer calculations but is this overhead worth the alignment? I feel like it’s lots of overhead, and to be quite frank, this 3D allocation stuff is a tad more complicated than simply dealing with 2D and forcing it into a 1D grid.

Thanks!

The pitch is mostly used if your rows dont have a nice ^2 in size (or whatever your OS prefers)
the pitch would pad the memory allocation and make your access pattern in memory more OS-friendly.

If all you do in your kernel is calculate the row pointer (and an offset for example) and then
do a simple operation and thats it then obviously the overhead is big.
However if you have a bigger or meaningfull kernel than the time to calculate this is not that big compared to the other stuff you do in your kernel.

You could also calculate the row pointer in shared memory (once for all threads in a block) and then have the threads add their own offset on top of the shared memory value. It might be an overkill though :)

With the padded allocation, the allocation is continuous, but does that necessarily mean the inside won’t be padded (ie. all the padding is at the end). I can’t think of why you’d want padding on the inside but I’m just trying to make sure I get it. I’ve never used a pitch setup like this before.

EDIT:
^ Nevermind, the inside IS padded. That’s why the access is how it is in the documentation. For 600 floats at 32 bits (4 bytes) it allocated 2560 bytes instead of 2400 per row (times the number of columns, also 600). So each row (width) is padded. Also means when copying out a 2D memcpy must be used and the host pitch and device pitch most be correct (in this case 2400 and 2560 respectively).
/EDIT

It would be used for many accesses, but it’s being a pain in the neck, so I’d rather not use it if the gains aren’t good.

Also can’t figure out which to use for allocation (pitch vs 3D, see 1st post). Right now I’m using pitch. But the way Nvidia doc says to access I feel (based on my results) that there’s ‘holes’ in the allocation (ie. an index not at the end that doesn’t have data while being type aligned).

EDIT:
^ As stated above, there are ‘holes’ in the allocation. I’m using mallocPitch for the allocation.
/EDIT

I’m still wondering if the alignment is worth the access. Each thread must access one ‘float’ in my 600x600 float grid. So you have 2 accesses per thread, 1 read, computations, then 1 write back to that cell. The threads aren’t going to be doing contiguous access so adding more to the offset doesn’t help and the threads are basing their cell on the thread index/block index. So as I have that random access and 1 read / 1 write I’m not sure if I’m gaining everything from the contiguous allocation.

That being said, it’s probably cheaper in terms of crunch time than storing it in a 2D on the host, shoving it into a 1D, then copying onto the device, running the kernel, copying it off the device, expanding it back into 2 dimensions, and accessing it.

Anybody have any input for that?