Jetson TK1 performance

lemonherb · June 16, 2014, 4:58pm

This is what I found from a running some benchmarks on the Jetson TK1.

Peak compute (single): 308 Gflop/s (out of 327 Gflop/s peak)
Peak bandwidth: 13 GB/s (out of ??? GB/s)
L2 cache bandwidth: 32.75 GB/s (out of ??? GB/s)
Shared memory bandwidth: 107.6 GB/s (out of 218 GB/s)

The benchmark I used comes from: Code for energy roofline (archline) papers
which was designed for energy related measurements so
the L2 and shared memory benchmarks use a pointer chasing array to minimize integer computation.
I’m not sure if this affects the performance, but these are what I got.

Anybody have any idea on what the peak for the DRAM, and L2 caches are?
Also, if somebody has a benchmark that can achieve higher performance, I’d appreciate a link to it so I can see what they do better.

Thank you.

allanmac · June 16, 2014, 6:15pm

It would be interesting to see how well nvprof metrics match up with these benchmarks.

lemonherb · June 16, 2014, 6:19pm

What would I be looking at precisely?

Are you suggesting that I compare (# of shared loads / gpu time) and compare it against my performance numbers?

Thank you.

lemonherb · June 16, 2014, 6:29pm

Here are the results for the shared memory test benchmark:
I create 4M threads @ 256 threads/block and each thread access shared memory 512 times.
The system is fixed at 852 MHz.

==8881== NVPROF is profiling process 8881, command: ./smtest 512 4194304 256
number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 80.332932 (ms)
Effective bandwidth: 107.138 (GB/s)
==8881== Profiling application: ./smtest 512 4194304 256
==8881== Profiling result:
==8881== Event result:
Invocations Event Name Min Max Avg
Device “GK20A (0)”
Kernel: clear_cache(int*)
1 elapsed_cycles_sm 565937 565937 565937
Kernel: cache_kernel_512(int, int*, int*)
10 elapsed_cycles_sm 68134270 68139746 68135999

==8859== NVPROF is profiling process 8859, command: ./smtest 512 4194304 256
number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 81.245384 (ms)
Effective bandwidth: 105.935 (GB/s)
==8859== Profiling application: ./smtest 512 4194304 256
==8859== Profiling result:
==8859== Event result:
Invocations Event Name Min Max Avg
Device “GK20A (0)”
Kernel: clear_cache(int*)
1 shared_load 0 0 0
Kernel: cache_kernel_512(int, int*, int*)
10 shared_load 67108864 67108864 67108864

allanmac · June 16, 2014, 6:31pm

Something like “nvprof --metrics all ” will print all K1 metrics. It might take a while to run though. You can list all possible metrics with “nvprof --query-metrics”.

You could start out by capturing all the throughput numbers:

nvprof --metrics gld_requested_throughput,gst_requested_throughput,tex_cache_throughput,gst_throughput,gld_throughput,shared_efficiency,gld_efficiency,gst_efficiency,nc_gld_requested_throughput,local_load_throughput,local_store_throughput,shared_load_throughput,shared_store_throughput,nc_l2_read_throughput,nc_gld_throughput,nc_gld_efficiency,l2_texture_read_throughput,l2_read_throughput,l2_write_throughput,l2_atomic_throughput <benchmark>

lemonherb · June 16, 2014, 6:35pm

Actually, I just realized that I’m accessing 4 byte data (int) from shared memory.
This might explain why my GB/s is 50% of the peak.
I based my benchmark on an older (Fermi) GPU and forgot to update it to 8 byte data instead.

Otherwise, at each cycle, each warp is accessing the sm at a rate of 1 access / cycle which is what the system is capable of.

I should figure out how to create a pointer chasing benchmark based on 64 bit integers, except I’m concerned that this will cause the compiler to insert conversion instructions.

Anyone have any idea on how to create a pointer chasing array based on long int (8 byte data)?

Thank you.

allanmac · June 16, 2014, 6:42pm

There is no 8-byte shared memory support on sm_20, sm_32 or sm_50, right? It was only implemented on sm_30/35. I’m not sure about sm_32 but I assume they dropped the feature.

lemonherb · June 16, 2014, 6:43pm

Here are the numbers for the shared memory benchmark.

==8907== NVPROF is profiling process ==8907== Warning: Some kernel(s) will number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 794.819458 Effective bandwidth: 10.8285 (GB/s)
==8907== Profiling application: ./smtest ==8907== Profiling result:
==8907== Metric result:
Invocations Device “GK20A (0)”
Kernel: clear_cache(int*)
1 gld_requested_throughput 1 gst_requested_throughput 1 tex_cache_throughput 1 gst_throughput 1 gld_throughput 1 shared_efficiency 1 gld_efficiency 1 gst_efficiency 1 nc_gld_requested_throughput 1 local_load_throughput 1 local_store_throughput 1 shared_load_throughput 1 shared_store_throughput 1 nc_l2_read_throughput 1 nc_gld_throughput 1 nc_gld_efficiency 1 l2_texture_read_throughput 1 l2_read_throughput 1 l2_write_throughput 1 l2_atomic_throughput Kernel: cache_kernel_512(int, int*, 10 gld_requested_throughput 10 gst_requested_throughput 10 tex_cache_throughput 10 gst_throughput 10 gld_throughput 10 shared_efficiency 10 gld_efficiency 10 gst_efficiency 10 nc_gld_requested_throughput 10 local_load_throughput 10 local_store_throughput 10 shared_load_throughput 10 shared_store_throughput 10 nc_l2_read_throughput 10 nc_gld_throughput 10 nc_gld_efficiency 10 l2_texture_read_throughput 10 l2_read_throughput 10 l2_write_throughput 10 l2_atomic_throughput 8907, command: ./smtest 512 4194304 256
be replayed on device 0 in order to collect all events/metrics.
(ms)
512 4194304 256
Metric Name Metric Description Min Max Avg
Requested Global Load Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
Requested Global Store Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
Global Store Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
Global Load Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
Shared Memory Efficiency 0.00% 0.00% 0.00%
Global Memory Load Efficiency 100.00% 100.00% 100.00%
Global Memory Store Efficiency 100.00% 100.00% 100.00%
Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
L2 Throughput (Reads) 6.3339GB/s 6.3339GB/s 6.3339GB/s
L2 Throughput (Writes) 6.3319GB/s 6.3319GB/s 6.3319GB/s
L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
int*)
Requested Global Load Throughput 209.78MB/s 209.80MB/s 209.79MB/s
Requested Global Store Throughput 209.78MB/s 209.80MB/s 209.79MB/s
Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
Global Store Throughput 209.78MB/s 209.80MB/s 209.79MB/s
Global Load Throughput 209.78MB/s 209.80MB/s 209.79MB/s
Shared Memory Efficiency 100.00% 100.00% 100.00%
Global Memory Load Efficiency 100.00% 100.00% 100.00%
Global Memory Store Efficiency 100.00% 100.00% 100.00%
Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
Shared Memory Load Throughput 214.82GB/s 214.83GB/s 214.82GB/s
Shared Memory Store Throughput 419.56MB/s 419.59MB/s 419.58MB/s
L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
L2 Throughput (Reads) 209.94MB/s 209.96MB/s 209.95MB/s
L2 Throughput (Writes) 209.78MB/s 209.80MB/s 209.79MB/s
L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s

lemonherb · June 16, 2014, 6:47pm

I’m not sure, but if that’s true then I guess the peak shared memory throughput is 128 bytes (32 * 4 bytes) per cycle for Jetson Tk1, which means I’m getting 98.7% of the peak shared memory throughput.

allanmac · June 16, 2014, 6:48pm

Interesting, it looks like nvprof depresses the throughput numbers somewhat. Err, the shared memory throughput numbers look good!

lemonherb · June 16, 2014, 6:50pm

What do you mean?

lemonherb · June 16, 2014, 6:50pm

Actually, this number is HIGHER than what I measure… hmmm do you have any idea why?

214.82GB/s (nvprof) vs. 107 GB/s (measured).

Thanks.

lemonherb · June 16, 2014, 6:52pm

Maybe nvprof is loading 64 bits and discarding 32 bits, but counts all 64 bits towards this throughput measurement…

lemonherb · June 16, 2014, 6:55pm

Here are the numbers for the cache test benchmark:

==8944== NVPROF is profiling process 8944, command: ./cachetest 4 4194304 256
==8944== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
number of iterations is 10
Results validated: 0 errors
Time taken to load 0.0838861 GBs: 193.035553 (ms)
Effective bandwidth: 0.434563 (GB/s)
==8944== Profiling application: ./cachetest 4 4194304 256
==8944== Profiling result:
==8944== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “GK20A (0)”
Kernel: cache_kernel_4(int, int*, int*)
10 gld_requested_throughput Requested Global Load Throughput 26.248GB/s 26.258GB/s 26.253GB/s
10 gst_requested_throughput Requested Global Store Throughput 6.5620GB/s 6.5646GB/s 6.5634GB/s
10 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 gst_throughput Global Store Throughput 6.5620GB/s 6.5646GB/s 6.5634GB/s
10 gld_throughput Global Load Throughput 26.248GB/s 26.258GB/s 26.253GB/s
10 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
10 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
10 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
10 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
10 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
10 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 l2_read_throughput L2 Throughput (Reads) 26.250GB/s 26.260GB/s 26.255GB/s
10 l2_write_throughput L2 Throughput (Writes) 6.5621GB/s 6.5647GB/s 6.5634GB/s
10 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
Kernel: clear_cache(int*)
1 gld_requested_throughput Requested Global Load Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 gst_requested_throughput Requested Global Store Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gst_throughput Global Store Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 gld_throughput Global Load Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
1 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_read_throughput L2 Throughput (Reads) 6.3839GB/s 6.3839GB/s 6.3839GB/s
1 l2_write_throughput L2 Throughput (Writes) 6.3831GB/s 6.3831GB/s 6.3831GB/s
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s

lemonherb · June 16, 2014, 6:56pm

For the cache test, I measure 32.7 GB/s but this says 26.2 GB/s…

lemonherb · June 16, 2014, 6:57pm

For the cache test, I measure 32.7 GB/s but nvprof says 26.2 GB/s…

nicolas.delbosc · June 17, 2014, 9:30am

Hi, have you tried running the bandwidth test from the CUDA SDK? I would be interested in the results.

lemonherb · June 17, 2014, 2:06pm

Yes, I got about 12.8 GB/s using 512 MB array size.
You get slightly better performance (~13 GB/s) by writing your own benchmark.

nicolas.delbosc · June 18, 2014, 1:42pm

Thank you.
Sadly this is much smaller than I was hoping for…

Topic		Replies	Views
[Jetson-TK1] nvprof, hardware performance counters and actual DRAM bandwidth usage Jetson TK1	2	1542	June 10, 2015
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1789	June 11, 2008
theoretical/real shared/dram peak memory throughput CUDA Programming and Performance	12	5134	January 5, 2017
K80 bandwidth test CUDA Programming and Performance	16	10489	July 4, 2015
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	7095	November 2, 2010
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13734	June 2, 2008
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22768	May 5, 2010
Cuda 7.0 Jetson TX1 performance and benchmarks Jetson TX1	21	17327	March 16, 2017
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4933	April 10, 2009
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17459	June 8, 2010

Jetson TK1 performance

Related topics