Description
I wrote a CUDA kernel which implement Matlab premute operation.
The kernel operate correctly compared to Matlab premute operation while both of them operate on the same data input.
I activate it on two different platforms:
 Quadro RTX 3000 which I have in my laptop
Nvidia Driver Version: R516.01 (r515_953) / 31.0.15.1601 (4242022)
CUDA Version: 11.7
Operating System + Version: Windows 10
Baremetal  Xavier NX with JetPack 5.0.2
In general, all my CUDA kernels which usually include a lot of computations , are executed much more faster on the Quadro RTX 3000 than on the Xavier NX.
But this premutation CUDA kernel, which not perform any computation except address ofsset calculations, is much more faster on the Xavier NX than the Quadro RTX 3000.
I’m gessing it is an issue of Cache and global memory sizes, bandwith and read\write frequencies…

Please advise if these are the real reasons.

Is there any technique to reduce the consumed operation time of this CUDA kernel?

Is there any NVIDIA library which perform premute operation which I can use instead of writing my own CUDA kernel?
Implementation description
The operation get a 3 dimensions array as an input with the following sizes:
dim1 = 51
dim2 = 385
dim3 = 1024
The following is a revision of my CUDA kernel code:
__global__ void Premute(
const float* const source,
float* const output,
const std::uint32_t totalElements,
const std::uint32_t dim1Size,
const std::uint32_t dim2Size,
const std::uint32_t dim3Size)
{
std::uint32_t elementIndex = blockIdx.x * blockDim.x + threadIdx.x;
std::uint32_t dim1Idx;
std::uint32_t dim2Idx;
std::uint32_t dim3Idx;
std::uint32_t beforePremute2DSliceSize = dim2Size * dim1Size;
std::uint32_t afterPremute2DSliceSize = dim2Size * dim3Size;
if (!(elementIndex < totalElements))
return;
dim1Idx = elementIndex % dim1Size;
elementIndex /= dim1Size;
dim2Idx = elementIndex % dim2Size;
elementIndex /= dim2Size;
dim3Idx = elementIndex % dim3Size;
const float* const srcOffset =
(
source +
dim3Idx * beforePremute2DSliceSize +
dim2Idx * dim1Size +
dim1Idx
);
float* const outputOffset =
output +
dim1Idx * afterPremute2DSliceSize +
dim2Idx * dim3Size +
dim3Idx;
outputOffset = *srcOffset;
}
This is the code which launch the kernel:
const std::uint32_t threadsBlocksX = 1024u;
std::uint32_t blocksX = static_cast<int>(static_cast<float>(elementsCount) / static_cast<float>(threadsBlocksX));
if (elementsCount % threadsBlocksX)
blocksX++;
dim3 dimBlock(threadsBlocksX, 1, 1);
dim3 dimGrid(blocksX, 1, 1);
Premute<< <dimGrid, dimBlock, 0, stream >> > (
source,
output,
elementsCount,
51,
385,
1024);