Why is CUDA tex2D extremely slow on TX2?

tpham · April 26, 2019, 8:26am

Hi guys,

I’m writing a simple image remapping function using CUDA texture memory to process a sequence of images (at resolution of 1600x1200). It takes 0,480ms on my GTX1050 but takes more than 10ms on TX2’s GPU with only pixels copying from source to output using tex2D (measured by NVProfiler). Do anyone know how to improve this performance issue? Please help.

Here is my implementation:

template<typename channelType, typename pixelType>
__global__ void remap(
	const cudaTextureObject_t input,
	pixelType* output,
	const float* __restrict__ mapX,
	const float* __restrict__ mapY,
	int width, int height, size_t pixelStep
) {
	unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
	unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

	if (x < width && y < height) {
            const float xcoord = mapX[y * width + x];
	    const float ycoord = mapY[y * width + x];

	    uchar4 px = tex2D<uchar4>(input, x, y);
	    output[y * width + x] = make_uchar3(px.x, px.y, px.z);
	}
}

Thanks a lot!

kayccc · April 26, 2019, 8:35am

Hi tpham

Have you set your TX2 device to MAX performance?

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

tpham · April 26, 2019, 8:41am

Hi kaycc,

Will this setting burns my device? I’ve read from some articles that there are some heat problem while using this approach. Is there any soft-technique (such as memory allocation tip or optimal APIs) to use in TX2? I would consider to over-clock my TX2 as last choice.

Thanks

AastaLLL · May 3, 2019, 8:30am

Hi,

It won’t. We don’t over-clock it but just fix the frequency.
Please try it.

tpham · May 3, 2019, 9:03am

It works like charm. Thanks so much!

itaowazard · August 6, 2019, 5:18am

Hi,
i’ve decided to not start new thread 'cause i think my issue with CUDA core (tex2d) performance is similar to described above (except that this solution didn’t fit for me).
I’m newbie in CUDA so i assume i just used some wrong approach to solve my proglem. Here is the deal:

I’m trying to implement video cam format conversion (BG10 to RGBA) via CUDA cores (i’ve attached NSight project with test data out.data file in Debug folder for you to check).

First implementation was with CPU image data processing (single thread, single core) and it’s took about 200ms to convert single image (1280x720). This is too slow and the next step was to involve CUDA cores into problem salvation.

My CUDA implementation takes almost the same time for image process (~180ms).
I think that 256 CUDA cores should make it much faster.
Can you please check my source code and point to my mistake’s?

PS i’ve tryed jetson_clocks and nvpmodel -m 0 (performance boost about 25% but still it’s very slow)
testCUDA.7z (697 KB)

tpham · August 6, 2019, 5:46am

Hi itaowazard,

Statical linking and Release build would significiantly improve your kernel speed. I managed my above sample with 10x times faster with these approaches. Proper dividing into grids and blocks also make better performance. I am not at my computer so I can’t investigate your code, but you can try. Hope it help!

itaowazard · August 6, 2019, 5:58am

Hi tpham,
tnx for advice, link is static by default for CUDA libs,
but release build did the trick (never expected such difference),
tnx again

Topic		Replies	Views
CUDA performance issue on tx2 Jetson TX2	8	4339	October 18, 2021
cuda question on tx2 Jetson TX2	8	1336	October 18, 2021
TX2 running slow Jetson TX2	16	4324	October 18, 2021
TX2 Computing Performance has Dropped Jetson TX2 power , performance	12	1103	October 18, 2021
Jetson TX2 is slower than 5.0 device? Jetson TX2	4	1341	October 18, 2021
Performance degradation on CUDA Jetson TX2	10	2356	October 18, 2021
Cuda performance randomly improves in a quantized manner Jetson TX2	3	621	October 18, 2021
Code execution slower after flashing jetson TX2 with jetpack l4t3.1 Jetson TX2	8	1350	October 18, 2021
Slow performance with opencv at jetson tx2 Jetson TX2	13	4082	October 18, 2021
Kernel performance when switching compute capability from 3.0 to 6.2 on Jetson Tx2 Jetson TX2 cuda , kernel , performance , nvcc , jetson	8	847	April 25, 2023

Why is CUDA tex2D extremely slow on TX2?

Related topics