What are the textbook ways of accessing 2D pitched memory vertically

Say i have a 1920 x 1080 image the size of a pixel could be a custom element. e.g. 8 * sizeof(float). I need to do some scanline operations on this pitched memory. My question is how do I do vertical scanning on a pitched memory more efficiently. Obviously horizontal scan is easy.

Ways that i can think of are:

  1. bind pitched memory to a texture memory.
  2. write a rotate kernel for pitched memory and do the horizontal scan

So what is the textbook way of doing this.

If you intend to do multiple vertical operations, then do one vertical operation per thread. The effective access pattern across threads in a warp can be nicely coalesced.

Cool. Method tested. It’s very fast

Actually, i forgot to call my kernel. The bench mark shows it’s very slow when it’s been called. I think it’s the cache miss while doing vertical scan. Do you have example of launching kernel vertically?

My current code looks like this:

int row = blockIdx.y * blockDim.y + threadIdx.y; //Row number
int col = blockIdx.x * blockDim.x + threadIdx.x; //Column number

if (row >= ceilf((float)CustomImageHeight / (float)10) || col >= CustomImageWidth)
{
return;
}

row *= D;

for(int winSoze = 0; winSize < 10; ++winSize)
{
custom* data_loc = (custom*)((uint8*)dev_ptr + pitch * row + winSize);
custom& data = data_loc[col]
}

For me, the canonical example is summing the rows of a matrix vs. summing the columns of a matrix.

For row-sums, you should use a parallel reduction method (across threads).

For column sums, you can have each thread perform a running sum of a column.

compare case 1 to case 2 in the final sample code in the answer here:

https://stackoverflow.com/questions/51526082/cuda-parallel-reduction-over-one-axis/51530238#51530238

case 1 is performing a classical parallel reduction for row sum, with cooperating threads, operating in the x-axis

case 2 is performing one sum per column per thread, operating in the y-axis

Note that both of these methods should be able to come close to a throughput which is equivalent to GPU main memory bandwidth - the performance upper bound for any memory-bound kernel.