is there any max performance turnning script for TK1?

hi guys,

i have have a quick question:
is there any max performance turnning script for TK1, just like the one for TX1 as below?

Cuda 7.0 Jetson TX1 performance and benchmarks
https://devtalk.nvidia.com/default/topic/901337/post/4747186/#4747186

because i find my TK1 borad is not max performance yet, even after setting 852000khz to GPU freq, by checking the output of simpleMultiCopy sample:

sudo ./simpleMultiCopy
[simpleMultiCopy] - Starting...
modprobe: FATAL: Module nvidia not found.
> Using CUDA device [0]: GK20A
[GK20A] has 1 MP(s) x 192 (Cores/MP) = 192 (Cores)
> Device name: GK20A
> CUDA Capability 3.2 hardware with 1 multi-processors
> scale_factor = 1.00
> array_size   = 4194304


Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
( ) Can overlap two CPU<>GPU data transfers with GPU kernel execution
    (Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
 Memcpy host to device  : 2.529416 ms (6.632842 GB/s)
 Memcpy device to host  : 2.591583 ms (6.473733 GB/s)
 Kernel                 : 4.475915 ms (37.483322 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 9.596914 ms 
Compute can overlap with one transfer: 5.120999 ms
Compute can overlap with both data transfers: 4.475915 ms

Average measured timings over 10 repetitions:
 Avg. time when execution fully serialized      : 9.785031 ms
 Avg. time when overlapped using 4 streams      : 8.434423 ms
 Avg. speedup gained (serialized - overlapped)  : 1.350608 ms

Measured throughput:
 Fully serialized execution             : 3.429159 GB/s
 Overlapped using 4 streams             : 3.978272 GB/s

thanks in advance
-zhi

Yes. See [url]http://elinux.org/Jetson/Performance[/url]