Has anyone run benchmarks on TX1? I got glmark2 score 818 on my Shield TV.
simpleMulticopy produced poorer performance than TK1:
[simpleMultiCopy] - Starting…
Using CUDA device [0]: GM20B
[GM20B] has 2 MP(s) x 128 (Cores/MP) = 256 (Cores)
Device name: GM20B
CUDA Capability 5.3 hardware with 2 multi-processors
scale_factor = 1.00
array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property “deviceOverlap”)
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)
Measured timings (throughput):
Memcpy host to device : 15.620518 ms (1.074050 GB/s)
Memcpy device to host : 3.952524 ms (4.244684 GB/s)
Kernel : 5.953629 ms (28.179814 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 25.526670 ms
Compute can overlap with one transfer: 19.573042 ms
Compute can overlap with both data transfers: 15.620518 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 9.440632 ms
Avg. time when overlapped using 4 streams : 5.101471 ms
Avg. speedup gained (serialized - overlapped) : 4.339161 ms
Measured throughput:
Fully serialized execution : 3.554257 GB/s
Overlapped using 4 streams : 6.577403 GB/s