Hi,

I have this code for estimating pi using Monte Carlo method:

```
typedef float sFloat;
typedef int64_t s64Int;
#define MAX64 0x7FFFFFFFFFFFFFFF
__host__ __device__ s64Int xorshift64star(s64Int x) {
x ^= x >> 12; // a
x ^= x << 25; // b
x ^= x >> 27; // c
return x * UINT64_C(2685821657736338717);
}
__global__ void pikernel(s64Int samples, s64Int* nIn, s64Int* nOut) {
s64Int i;
sFloat xx, yy;
s64Int ni = 0; // points inside the circle
s64Int no = 0; // points outside the circle
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// initialize the random generator (non-zero)
s64Int offset = tid * samples;
s64Int x = 12345678 + offset * 89482311;
s64Int y = 87654321 + offset + 12345678;
// generate points and count
for(i = 0; i < samples; i++) {
x = xorshift64star(x);
y = xorshift64star(y);
xx = ((sFloat)x)/(sFloat)MAX64;
yy = ((sFloat)y)/(sFloat)MAX64;
if (xx * xx + yy * yy > 1.0)
no++;
else
ni++;
}
// update output
nIn[tid] = ni;
nOut[tid] = no;
}
```

Surprisingly, running it on Jetson TX1 takes longer than running it on Jetson TK1. When I increase GPU frequency using the method described here https://devtalk.nvidia.com/default/topic/952478/maximize-tx1-performance/, I get better results, but still worse than TK1. Moreover, profiling results show that TX1 GPU IPC is much lower compared to TK1’s GPU, independent of the frequency. These is a summary of the results:

**Jetson TX1 System**

Device “NVIDIA Tegra X1 (0)”

CUDA toolkit 8.0

g++ 5.4.0

Kernel: pikernel(long, long*, long*)

**Output** using default GPU frequency

Using 2 blocks and 512 threads per block

Estimated value of Pi is 3.141437

Estimation took **2,909,335** usec

**Output** using GPU frequency of 998400000 Hz

Using 2 blocks and 512 threads per block

Estimated value of Pi is 3.141437

Estimation took **1,016,859** usec

**nvprof summary**

Invocations Metric Name Metric Description Min Max Avg

1 **ipc** Executed IPC 0.435372 0.435372 **0.435372**

1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.61% 0.61% 0.61%

1 sm_efficiency Multiprocessor Activity 99.94% 99.94% 99.94%

1 warp_execution_efficiency Warp Execution Efficiency 95.42% 95.42% 95.42%

**Jetson TK1 System**

Device “GK20A (0)”

CUDA toolkit 6.5

g++ 4.8.4

Kernel: pikernel(__int64, __int64*, __int64*)

**Output**

Using 2 blocks and 512 threads per block

Estimated value of Pi is 3.141437

Estimation took **915,394** usec

**nvprof summary**

Invocations Metric Name Metric Description Min Max Avg

1 **ipc** Executed IPC 1.948358 1.948358 **1.948358**

1 flop_sp_efficiency FLOP Efficiency(Peak Single) 2.18% 2.18% 2.18%

1 sm_efficiency Multiprocessor Activity 12.10% 12.10% 12.10%

1 warp_execution_efficiency Warp Execution Efficiency 98.02% 98.02% 98.02%

Does anyone have any idea why the IPC is lower on TX1?