Hi,
I have this code for estimating pi using Monte Carlo method:
typedef float sFloat;
typedef int64_t s64Int;
#define MAX64 0x7FFFFFFFFFFFFFFF
__host__ __device__ s64Int xorshift64star(s64Int x) {
x ^= x >> 12; // a
x ^= x << 25; // b
x ^= x >> 27; // c
return x * UINT64_C(2685821657736338717);
}
__global__ void pikernel(s64Int samples, s64Int* nIn, s64Int* nOut) {
s64Int i;
sFloat xx, yy;
s64Int ni = 0; // points inside the circle
s64Int no = 0; // points outside the circle
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// initialize the random generator (non-zero)
s64Int offset = tid * samples;
s64Int x = 12345678 + offset * 89482311;
s64Int y = 87654321 + offset + 12345678;
// generate points and count
for(i = 0; i < samples; i++) {
x = xorshift64star(x);
y = xorshift64star(y);
xx = ((sFloat)x)/(sFloat)MAX64;
yy = ((sFloat)y)/(sFloat)MAX64;
if (xx * xx + yy * yy > 1.0)
no++;
else
ni++;
}
// update output
nIn[tid] = ni;
nOut[tid] = no;
}
Surprisingly, running it on Jetson TX1 takes longer than running it on Jetson TK1. When I increase GPU frequency using the method described here https://devtalk.nvidia.com/default/topic/952478/maximize-tx1-performance/, I get better results, but still worse than TK1. Moreover, profiling results show that TX1 GPU IPC is much lower compared to TK1’s GPU, independent of the frequency. These is a summary of the results:
Jetson TX1 System
Device “NVIDIA Tegra X1 (0)”
CUDA toolkit 8.0
g++ 5.4.0
Kernel: pikernel(long, long*, long*)
Output using default GPU frequency
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 2,909,335 usec
Output using GPU frequency of 998400000 Hz
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 1,016,859 usec
nvprof summary
Invocations Metric Name Metric Description Min Max Avg
1 ipc Executed IPC 0.435372 0.435372 0.435372
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.61% 0.61% 0.61%
1 sm_efficiency Multiprocessor Activity 99.94% 99.94% 99.94%
1 warp_execution_efficiency Warp Execution Efficiency 95.42% 95.42% 95.42%
Jetson TK1 System
Device “GK20A (0)”
CUDA toolkit 6.5
g++ 4.8.4
Kernel: pikernel(__int64, __int64*, __int64*)
Output
Using 2 blocks and 512 threads per block
Estimated value of Pi is 3.141437
Estimation took 915,394 usec
nvprof summary
Invocations Metric Name Metric Description Min Max Avg
1 ipc Executed IPC 1.948358 1.948358 1.948358
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 2.18% 2.18% 2.18%
1 sm_efficiency Multiprocessor Activity 12.10% 12.10% 12.10%
1 warp_execution_efficiency Warp Execution Efficiency 98.02% 98.02% 98.02%
Does anyone have any idea why the IPC is lower on TX1?