CUDAK kernel which perform premute to 3 dimension array run much more fatser on Xavier NX than Quadro RTX3000

I wrote a CUDA kernel which implement Matlab premute operation.
The kernel operate correctly compared to Matlab premute operation while both of them operate on the same data input.

I activate it on two different platforms:

  1. Quadro RTX 3000 which I have in my laptop
    Nvidia Driver Version: R516.01 (r515_95-3) / (4-24-2022)
    CUDA Version: 11.7
    Operating System + Version: Windows 10
  2. Xavier NX with JetPack 5.0.2

In general, all my CUDA kernels which usually include a lot of computations , are executed much more faster on the Quadro RTX 3000 than on the Xavier NX.

But this premutation CUDA kernel, which not perform any computation except address ofsset calculations, is much more faster on the Xavier NX than the Quadro RTX 3000.

I’m gessing it is an issue of Cache and global memory sizes, bandwith and read\write frequencies…

  1. Please advise if these are the real reasons.

  2. Is there any technique to reduce the consumed operation time of this CUDA kernel?

  3. Is there any NVIDIA library which perform premute operation which I can use instead of writing my own CUDA kernel?

Implementation description

The operation get a 3 dimensions array as an input with the following sizes:
dim1 = 51
dim2 = 385
dim3 = 1024

The following is a revision of my CUDA kernel code:

 __global__ void Premute(
	const float* const source,
	float* const output,
	const std::uint32_t totalElements,
	const std::uint32_t dim1Size,
	const std::uint32_t dim2Size,
	const std::uint32_t dim3Size)
	std::uint32_t elementIndex = blockIdx.x * blockDim.x + threadIdx.x;
	std::uint32_t dim1Idx;
	std::uint32_t dim2Idx;
	std::uint32_t dim3Idx;	
	std::uint32_t beforePremute2DSliceSize = dim2Size * dim1Size;
	std::uint32_t afterPremute2DSliceSize = dim2Size * dim3Size;

	if (!(elementIndex < totalElements))

	dim1Idx = elementIndex % dim1Size;
	elementIndex /= dim1Size;
	dim2Idx = elementIndex % dim2Size;
	elementIndex /= dim2Size;
	dim3Idx = elementIndex % dim3Size;

	const float* const srcOffset =
			source +
			dim3Idx * beforePremute2DSliceSize +
			dim2Idx * dim1Size +

	float* const outputOffset =
		output +
		dim1Idx * afterPremute2DSliceSize +
		dim2Idx * dim3Size +

	outputOffset = *srcOffset;	

This is the code which launch the kernel:

const std::uint32_t threadsBlocksX = 1024u;
std::uint32_t blocksX = static_cast<int>(static_cast<float>(elementsCount) / static_cast<float>(threadsBlocksX));
if (elementsCount % threadsBlocksX)
dim3 dimBlock(threadsBlocksX, 1, 1);
dim3 dimGrid(blocksX, 1, 1);
Premute<< <dimGrid, dimBlock, 0, stream >> > (


Could you share the GPU resources of both platforms with us?
Suppose Quadro RTX3000 is a laptop dGPU, is that correct?

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ sudo make
$ ./deviceQuery 

On Jetson, have you maximized the device performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Do you mean matrix permutation?
If yes, you check our cuBLAS library.


Thank you for your response,

  1. Attached are the devices queries reports:
    Quadro_RTX3000_Device_Query.txt (3.5 KB)
    This is a laptop dGPU as you said.

    Xavier_NX_Device_Query.txt (3.5 KB)

  2. On the Jetson I’m using Max-N nvpmodel state (20W)

  3. I couldn’t find in the CuBLAS API spec any API which implement Matlab Matrix Premute operation:
    Please guide me where I can find it.


Thanks for sharing the device info.

Based on the output, the GPU resource between XavierNX and RTX3000 are similar.
So this can explain why you observe the operations faster in dGPU in some cases and faster in iGPU in some settings.


Maximum number of threads per streaming multiprocessor:     1024
Maximum number of threads per block:                        1024
Maximum block threads dimensions:                           1024,1024,64
Maximum grid blocks dimensions:                             2147483647,65535,65535


Maximum number of threads per streaming multiprocessor:     2048
Maximum number of threads per block:                        1024
Maximum block threads dimensions:                           1024,1024,64
Maximum grid blocks dimensions:                             2147483647,65535,65535

For the permutation implementation, could you share the number of dimensions of your matrix?

My matrix has 3 dimensions:
dim1 = 1024
dim2 = 385
dim3 = 51

I want to premute between dim1 and dim3 in order to get:
dim1 = 51
dim2 = 385
dim3 = 1024


Sorry that cuBLAS doesn’t support 3d matrix permutation.

We do have some similar implementations in TensorRT.
But you will need to wrap your input to a TensorRT buffer first.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.