hi guys,
i have have a quick question:
is there any max performance turnning script for TK1, just like the one for TX1 as below?
Cuda 7.0 Jetson TX1 performance and benchmarks
https://devtalk.nvidia.com/default/topic/901337/post/4747186/#4747186
because i find my TK1 borad is not max performance yet, even after setting 852000khz to GPU freq, by checking the output of simpleMultiCopy sample:
sudo ./simpleMultiCopy
[simpleMultiCopy] - Starting...
modprobe: FATAL: Module nvidia not found.
> Using CUDA device [0]: GK20A
[GK20A] has 1 MP(s) x 192 (Cores/MP) = 192 (Cores)
> Device name: GK20A
> CUDA Capability 3.2 hardware with 1 multi-processors
> scale_factor = 1.00
> array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
( ) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)
Measured timings (throughput):
Memcpy host to device : 2.529416 ms (6.632842 GB/s)
Memcpy device to host : 2.591583 ms (6.473733 GB/s)
Kernel : 4.475915 ms (37.483322 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 9.596914 ms
Compute can overlap with one transfer: 5.120999 ms
Compute can overlap with both data transfers: 4.475915 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 9.785031 ms
Avg. time when overlapped using 4 streams : 8.434423 ms
Avg. speedup gained (serialized - overlapped) : 1.350608 ms
Measured throughput:
Fully serialized execution : 3.429159 GB/s
Overlapped using 4 streams : 3.978272 GB/s
thanks in advance
-zhi