Lower GPU IPC on TX1 compared to TK1

Hi,

I have this code for estimating pi using Monte Carlo method:

typedef float sFloat;
typedef int64_t s64Int;

#define MAX64 0x7FFFFFFFFFFFFFFF

__host__ __device__ s64Int xorshift64star(s64Int x) {
	x ^= x >> 12; 	// a
	x ^= x << 25; 	// b
	x ^= x >> 27; 	// c
	return x * UINT64_C(2685821657736338717);
}

__global__ void pikernel(s64Int samples, s64Int* nIn, s64Int* nOut) {
	s64Int i;	
	sFloat xx, yy;
	s64Int ni = 0;	// points inside the circle
	s64Int no = 0;	// points outside the circle

	int tid = blockIdx.x * blockDim.x + threadIdx.x;

	// initialize the random generator (non-zero)
	s64Int offset = tid * samples;
	s64Int x = 12345678 + offset * 89482311;
	s64Int y = 87654321 + offset + 12345678;

	// generate points and count
	for(i = 0; i < samples; i++) {
		x = xorshift64star(x);
		y = xorshift64star(y);
		xx = ((sFloat)x)/(sFloat)MAX64;
		yy = ((sFloat)y)/(sFloat)MAX64;
		if (xx * xx + yy * yy > 1.0)
			no++;
		else
			ni++;
	}

	// update output
	nIn[tid] = ni;
	nOut[tid] = no;
}

Surprisingly, running it on Jetson TX1 takes longer than running it on Jetson TK1. When I increase GPU frequency using the method described here https://devtalk.nvidia.com/default/topic/952478/maximize-tx1-performance/, I get better results, but still worse than TK1. Moreover, profiling results show that TX1 GPU IPC is much lower compared to TK1’s GPU, independent of the frequency. These is a summary of the results:

Jetson TX1 System
Device “NVIDIA Tegra X1 (0)”
CUDA toolkit 8.0
g++ 5.4.0
Kernel: pikernel(long, long*, long*)

Output using default GPU frequency
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 2,909,335 usec

Output using GPU frequency of 998400000 Hz
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 1,016,859 usec

nvprof summary
Invocations Metric Name Metric Description Min Max Avg
1 ipc Executed IPC 0.435372 0.435372 0.435372
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.61% 0.61% 0.61%
1 sm_efficiency Multiprocessor Activity 99.94% 99.94% 99.94%
1 warp_execution_efficiency Warp Execution Efficiency 95.42% 95.42% 95.42%

Jetson TK1 System
Device “GK20A (0)”
CUDA toolkit 6.5
g++ 4.8.4
Kernel: pikernel(__int64, __int64*, __int64*)

Output
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 915,394 usec

nvprof summary
Invocations Metric Name Metric Description Min Max Avg
1 ipc Executed IPC 1.948358 1.948358 1.948358
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 2.18% 2.18% 2.18%
1 sm_efficiency Multiprocessor Activity 12.10% 12.10% 12.10%
1 warp_execution_efficiency Warp Execution Efficiency 98.02% 98.02% 98.02%

Does anyone have any idea why the IPC is lower on TX1?

Hi,

Sorry for the late reply.
We are investigating this issue. Will update information to you later.

Hi,

Sorry for keeping you waiting. Could you provide more information for us debugging?

  1. How do you compile the cuda code?
    Natively on the Tx1 and TK1 boards using 8.0 or 6.5 toolkits respectively?
    Or cross compiling on x86_64 host?

2)Could you attach the compiled executables and ComputeCache generated on the setup? (If needed, we can provide the steps)

Thanks a lot and sorry for our late reply.

Hi,

Sorry for late reply. Just now I got the time to work on this again.

  1. I am compiling the code on TK1 and TX1 itself, natively.

  2. I have uploaded the executables and ComputeCache content on github:

TK1 executable: [url]https://github.com/dloghin/snippets/blob/master/CUDA/pi/debug/MaxPerf/JetsonTK1/pi-gpu-tk1[/url]

TK1 ComputeCache:
[url]https://github.com/dloghin/snippets/blob/master/CUDA/pi/debug/MaxPerf/JetsonTK1/compute-cache-tk1[/url]

TX1 executable:
[url]https://github.com/dloghin/snippets/blob/master/CUDA/pi/debug/MaxPerf/JetsonTX1/pi-gpu-tx1[/url]

TX1 ComputeCache:
[url]https://github.com/dloghin/snippets/blob/master/CUDA/pi/debug/MaxPerf/JetsonTX1/compute-cache-tx1[/url]

I appreciate your help!

I am wondering if the TX1 is running the most recent L4T? You can check version with:

head -n 1 /etc/nv_tegra_release

It’s running L4T R24.2.1

# R24 (release), REVISION: 2.1, GCID: 8028265, BOARD: t210ref, EABI: aarch64, DATE: Thu Nov 10 03:51:59 UTC 2016

Thanks, dloghin.

We will check this information and update to you later.

Hi, dloghin

Sorry for the long delay.

This issue is related to our GPU driver code and is fixed recently.
To verify the fix, would you please rebuild the executable or share the whole sample with us?

Thanks.