Say i have a 1920 x 1080 image the size of a pixel could be a custom element. e.g. 8 * sizeof(float). I need to do some scanline operations on this pitched memory. My question is how do I do vertical scanning on a pitched memory more efficiently. Obviously horizontal scan is easy.
Ways that i can think of are:
- bind pitched memory to a texture memory.
- write a rotate kernel for pitched memory and do the horizontal scan
So what is the textbook way of doing this.
If you intend to do multiple vertical operations, then do one vertical operation per thread. The effective access pattern across threads in a warp can be nicely coalesced.
Cool. Method tested. It’s very fast
Actually, i forgot to call my kernel. The bench mark shows it’s very slow when it’s been called. I think it’s the cache miss while doing vertical scan. Do you have example of launching kernel vertically?
My current code looks like this:
int row = blockIdx.y * blockDim.y + threadIdx.y; //Row number
int col = blockIdx.x * blockDim.x + threadIdx.x; //Column number
if (row >= ceilf((float)CustomImageHeight / (float)10) || col >= CustomImageWidth)
{
return;
}
row *= D;
for(int winSoze = 0; winSize < 10; ++winSize)
{
custom* data_loc = (custom*)((uint8*)dev_ptr + pitch * row + winSize);
custom& data = data_loc[col]
}
For me, the canonical example is summing the rows of a matrix vs. summing the columns of a matrix.
For row-sums, you should use a parallel reduction method (across threads).
For column sums, you can have each thread perform a running sum of a column.
compare case 1 to case 2 in the final sample code in the answer here:
https://stackoverflow.com/questions/51526082/cuda-parallel-reduction-over-one-axis/51530238#51530238
case 1 is performing a classical parallel reduction for row sum, with cooperating threads, operating in the x-axis
case 2 is performing one sum per column per thread, operating in the y-axis
Note that both of these methods should be able to come close to a throughput which is equivalent to GPU main memory bandwidth - the performance upper bound for any memory-bound kernel.