Choosing gridSize and blockSize for better performance on TX2

Hi,

Help me please to choose better dims for my kernel.

const size_t threads{ 32 };
    const dim3 blockSize( threads, threads );
    const dim3 gridSize{ uint32_t( width / threads ), uint32_t( height / threads ) };
    remap<<< gridSize, blockSize >>>( width, height, mapX0, mapY0, mapX1, mapY1, mapW0, mapW1, frame0, frame1, pano );

Width is 4096 and height is 2048.

Best regards, Viktor.

Advantageous partitioning is not independent of the resource usage and data access patterns of a kernel.

In the absence of additional information a reasonable heuristic is to start with a thread count per block that is a multiple of 32 and between 128 to 256 inclusive. Use the CUDA profiler on this initial version of the application and base further code changes on feedback from the profiler as to what the bottlenecks are in the kernel.

Here you might want to try a 16x16 thread block for the initial attempt.

Thanks.