Lower GPU IPC on TX1 compared to TK1

dloghin · March 15, 2017, 3:08am

Hi,

I have this code for estimating pi using Monte Carlo method:

typedef float sFloat;
typedef int64_t s64Int;

#define MAX64 0x7FFFFFFFFFFFFFFF

__host__ __device__ s64Int xorshift64star(s64Int x) {
	x ^= x >> 12; 	// a
	x ^= x << 25; 	// b
	x ^= x >> 27; 	// c
	return x * UINT64_C(2685821657736338717);
}

__global__ void pikernel(s64Int samples, s64Int* nIn, s64Int* nOut) {
	s64Int i;	
	sFloat xx, yy;
	s64Int ni = 0;	// points inside the circle
	s64Int no = 0;	// points outside the circle

	int tid = blockIdx.x * blockDim.x + threadIdx.x;

	// initialize the random generator (non-zero)
	s64Int offset = tid * samples;
	s64Int x = 12345678 + offset * 89482311;
	s64Int y = 87654321 + offset + 12345678;

	// generate points and count
	for(i = 0; i < samples; i++) {
		x = xorshift64star(x);
		y = xorshift64star(y);
		xx = ((sFloat)x)/(sFloat)MAX64;
		yy = ((sFloat)y)/(sFloat)MAX64;
		if (xx * xx + yy * yy > 1.0)
			no++;
		else
			ni++;
	}

	// update output
	nIn[tid] = ni;
	nOut[tid] = no;
}

Surprisingly, running it on Jetson TX1 takes longer than running it on Jetson TK1. When I increase GPU frequency using the method described here https://devtalk.nvidia.com/default/topic/952478/maximize-tx1-performance/, I get better results, but still worse than TK1. Moreover, profiling results show that TX1 GPU IPC is much lower compared to TK1’s GPU, independent of the frequency. These is a summary of the results:

Jetson TX1 System
Device “NVIDIA Tegra X1 (0)”
CUDA toolkit 8.0
g++ 5.4.0
Kernel: pikernel(long, long*, long*)

Output using default GPU frequency
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 2,909,335 usec

Output using GPU frequency of 998400000 Hz
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 1,016,859 usec

nvprof summary
Invocations Metric Name Metric Description Min Max Avg
1 ipc Executed IPC 0.435372 0.435372 0.435372
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.61% 0.61% 0.61%
1 sm_efficiency Multiprocessor Activity 99.94% 99.94% 99.94%
1 warp_execution_efficiency Warp Execution Efficiency 95.42% 95.42% 95.42%

Jetson TK1 System
Device “GK20A (0)”
CUDA toolkit 6.5
g++ 4.8.4
Kernel: pikernel(__int64, __int64*, __int64*)

Output
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 915,394 usec

nvprof summary
Invocations Metric Name Metric Description Min Max Avg
1 ipc Executed IPC 1.948358 1.948358 1.948358
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 2.18% 2.18% 2.18%
1 sm_efficiency Multiprocessor Activity 12.10% 12.10% 12.10%
1 warp_execution_efficiency Warp Execution Efficiency 98.02% 98.02% 98.02%

Does anyone have any idea why the IPC is lower on TX1?

AastaLLL · March 20, 2017, 2:12am

Hi,

Sorry for the late reply.
We are investigating this issue. Will update information to you later.

AastaLLL · April 12, 2017, 5:47am

Hi,

Sorry for keeping you waiting. Could you provide more information for us debugging?

How do you compile the cuda code?
Natively on the Tx1 and TK1 boards using 8.0 or 6.5 toolkits respectively?
Or cross compiling on x86_64 host?

2)Could you attach the compiled executables and ComputeCache generated on the setup? (If needed, we can provide the steps)

Thanks a lot and sorry for our late reply.

dloghin · July 7, 2017, 7:48am

Hi,

Sorry for late reply. Just now I got the time to work on this again.

I am compiling the code on TK1 and TX1 itself, natively.
I have uploaded the executables and ComputeCache content on github:

TK1 executable: [url]https://github.com/dloghin/snippets/blob/master/CUDA/pi/debug/MaxPerf/JetsonTK1/pi-gpu-tk1[/url]

TK1 ComputeCache:
[url]https://github.com/dloghin/snippets/blob/master/CUDA/pi/debug/MaxPerf/JetsonTK1/compute-cache-tk1[/url]

TX1 executable:
[url]https://github.com/dloghin/snippets/blob/master/CUDA/pi/debug/MaxPerf/JetsonTX1/pi-gpu-tx1[/url]

TX1 ComputeCache:
[url]https://github.com/dloghin/snippets/blob/master/CUDA/pi/debug/MaxPerf/JetsonTX1/compute-cache-tx1[/url]

I appreciate your help!

linuxdev · July 7, 2017, 11:04am

I am wondering if the TX1 is running the most recent L4T? You can check version with:

head -n 1 /etc/nv_tegra_release

dloghin · July 7, 2017, 2:28pm

It’s running L4T R24.2.1

# R24 (release), REVISION: 2.1, GCID: 8028265, BOARD: t210ref, EABI: aarch64, DATE: Thu Nov 10 03:51:59 UTC 2016

AastaLLL · July 10, 2017, 2:35am

Thanks, dloghin.

We will check this information and update to you later.

AastaLLL · October 31, 2018, 8:41am

Hi, dloghin

Sorry for the long delay.

This issue is related to our GPU driver code and is fixed recently.
To verify the fix, would you please rebuild the executable or share the whole sample with us?

Thanks.

Topic		Replies	Views
CPU performance problem on Jetson TX1 Jetson TX1	23	3493	October 18, 2021
TX1 slower than TK1 Jetson TX1	5	1383	August 19, 2016
CUDA Kernel runs much slower on TX1 than on discrete GPU Jetson TX1	8	2612	March 2, 2016
Cuda 7.0 Jetson TX1 performance and benchmarks Jetson TX1	21	17474	March 16, 2017
Performance issue on opencv TX1 Jetson TX1	8	911	October 18, 2021
confused,Our programs run on TX1 is slower than TK1. Jetson TX1	9	1291	October 18, 2021
Interprocess communication with TX1 Jetson TX1	2	721	July 26, 2018
Opencv4Tegra GPU vs CPU TK1 vs TX1 Jetson TX1 opencv	3	3739	April 28, 2016
Performance comparision TK1 vs TX1 Jetson TX1	6	4053	October 18, 2021
Transplant CUDA + openGL code from PC to TX1 Jetson TX1	1	574	April 28, 2016

Lower GPU IPC on TX1 compared to TK1

Related topics