Why is CUDA tex2D extremely slow on TX2?

Hi guys,

I’m writing a simple image remapping function using CUDA texture memory to process a sequence of images (at resolution of 1600x1200). It takes 0,480ms on my GTX1050 but takes more than 10ms on TX2’s GPU with only pixels copying from source to output using tex2D (measured by NVProfiler). Do anyone know how to improve this performance issue? Please help.

Here is my implementation:

template<typename channelType, typename pixelType>
__global__ void remap(
	const cudaTextureObject_t input,
	pixelType* output,
	const float* __restrict__ mapX,
	const float* __restrict__ mapY,
	int width, int height, size_t pixelStep
) {
	unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
	unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

	if (x < width && y < height) {
            const float xcoord = mapX[y * width + x];
	    const float ycoord = mapY[y * width + x];

	    uchar4 px = tex2D<uchar4>(input, x, y);
	    output[y * width + x] = make_uchar3(px.x, px.y, px.z);

Thanks a lot!

Hi tpham

Have you set your TX2 device to MAX performance?

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

Hi kaycc,

Will this setting burns my device? I’ve read from some articles that there are some heat problem while using this approach. Is there any soft-technique (such as memory allocation tip or optimal APIs) to use in TX2? I would consider to over-clock my TX2 as last choice.



It won’t. We don’t over-clock it but just fix the frequency.
Please try it.

It works like charm. Thanks so much!

i’ve decided to not start new thread 'cause i think my issue with CUDA core (tex2d) performance is similar to described above (except that this solution didn’t fit for me).
I’m newbie in CUDA so i assume i just used some wrong approach to solve my proglem. Here is the deal:

I’m trying to implement video cam format conversion (BG10 to RGBA) via CUDA cores (i’ve attached NSight project with test data out.data file in Debug folder for you to check).

First implementation was with CPU image data processing (single thread, single core) and it’s took about 200ms to convert single image (1280x720). This is too slow and the next step was to involve CUDA cores into problem salvation.

My CUDA implementation takes almost the same time for image process (~180ms).
I think that 256 CUDA cores should make it much faster.
Can you please check my source code and point to my mistake’s?

PS i’ve tryed jetson_clocks and nvpmodel -m 0 (performance boost about 25% but still it’s very slow)
testCUDA.7z (697 KB)

Hi itaowazard,

Statical linking and Release build would significiantly improve your kernel speed. I managed my above sample with 10x times faster with these approaches. Proper dividing into grids and blocks also make better performance. I am not at my computer so I can’t investigate your code, but you can try. Hope it help!

Hi tpham,
tnx for advice, link is static by default for CUDA libs,
but release build did the trick (never expected such difference),
tnx again