Jetson TK1 performance

This is what I found from a running some benchmarks on the Jetson TK1.

Peak compute (single): 308 Gflop/s (out of 327 Gflop/s peak)
Peak bandwidth: 13 GB/s (out of ??? GB/s)
L2 cache bandwidth: 32.75 GB/s (out of ??? GB/s)
Shared memory bandwidth: 107.6 GB/s (out of 218 GB/s)

The benchmark I used comes from: http://hpcgarage.org/archline
which was designed for energy related measurements so
the L2 and shared memory benchmarks use a pointer chasing array to minimize integer computation.
I’m not sure if this affects the performance, but these are what I got.

Anybody have any idea on what the peak for the DRAM, and L2 caches are?
Also, if somebody has a benchmark that can achieve higher performance, I’d appreciate a link to it so I can see what they do better.

Thank you.

It would be interesting to see how well nvprof metrics match up with these benchmarks.

What would I be looking at precisely?

Are you suggesting that I compare (# of shared loads / gpu time) and compare it against my performance numbers?

Thank you.

Here are the results for the shared memory test benchmark:
I create 4M threads @ 256 threads/block and each thread access shared memory 512 times.
The system is fixed at 852 MHz.

==8881== NVPROF is profiling process 8881, command: ./smtest 512 4194304 256
number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 80.332932 (ms)
Effective bandwidth: 107.138 (GB/s)
==8881== Profiling application: ./smtest 512 4194304 256
==8881== Profiling result:
==8881== Event result:
Invocations Event Name Min Max Avg
Device “GK20A (0)”
Kernel: clear_cache(int*)
1 elapsed_cycles_sm 565937 565937 565937
Kernel: cache_kernel_512(int, int*, int*)
10 elapsed_cycles_sm 68134270 68139746 68135999

==8859== NVPROF is profiling process 8859, command: ./smtest 512 4194304 256
number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 81.245384 (ms)
Effective bandwidth: 105.935 (GB/s)
==8859== Profiling application: ./smtest 512 4194304 256
==8859== Profiling result:
==8859== Event result:
Invocations Event Name Min Max Avg
Device “GK20A (0)”
Kernel: clear_cache(int*)
1 shared_load 0 0 0
Kernel: cache_kernel_512(int, int*, int*)
10 shared_load 67108864 67108864 67108864

Something like "nvprof --metrics all " will print all K1 metrics. It might take a while to run though. You can list all possible metrics with “nvprof --query-metrics”.

You could start out by capturing all the throughput numbers:

nvprof --metrics gld_requested_throughput,gst_requested_throughput,tex_cache_throughput,gst_throughput,gld_throughput,shared_efficiency,gld_efficiency,gst_efficiency,nc_gld_requested_throughput,local_load_throughput,local_store_throughput,shared_load_throughput,shared_store_throughput,nc_l2_read_throughput,nc_gld_throughput,nc_gld_efficiency,l2_texture_read_throughput,l2_read_throughput,l2_write_throughput,l2_atomic_throughput <benchmark>

Actually, I just realized that I’m accessing 4 byte data (int) from shared memory.
This might explain why my GB/s is 50% of the peak.
I based my benchmark on an older (Fermi) GPU and forgot to update it to 8 byte data instead.

Otherwise, at each cycle, each warp is accessing the sm at a rate of 1 access / cycle which is what the system is capable of.

I should figure out how to create a pointer chasing benchmark based on 64 bit integers, except I’m concerned that this will cause the compiler to insert conversion instructions.

Anyone have any idea on how to create a pointer chasing array based on long int (8 byte data)?

Thank you.

There is no 8-byte shared memory support on sm_20, sm_32 or sm_50, right? It was only implemented on sm_30/35. I’m not sure about sm_32 but I assume they dropped the feature.

Here are the numbers for the shared memory benchmark.

==8907== NVPROF is profiling process 8907, command: ./smtest 512 4194304 256
==8907== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 794.819458 (ms)
Effective bandwidth: 10.8285 (GB/s)
==8907== Profiling application: ./smtest 512 4194304 256
==8907== Profiling result:
==8907== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “GK20A (0)”
Kernel: clear_cache(int*)
1 gld_requested_throughput Requested Global Load Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
1 gst_requested_throughput Requested Global Store Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
1 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gst_throughput Global Store Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
1 gld_throughput Global Load Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
1 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_read_throughput L2 Throughput (Reads) 6.3339GB/s 6.3339GB/s 6.3339GB/s
1 l2_write_throughput L2 Throughput (Writes) 6.3319GB/s 6.3319GB/s 6.3319GB/s
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
Kernel: cache_kernel_512(int, int*, int*)
10 gld_requested_throughput Requested Global Load Throughput 209.78MB/s 209.80MB/s 209.79MB/s
10 gst_requested_throughput Requested Global Store Throughput 209.78MB/s 209.80MB/s 209.79MB/s
10 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 gst_throughput Global Store Throughput 209.78MB/s 209.80MB/s 209.79MB/s
10 gld_throughput Global Load Throughput 209.78MB/s 209.80MB/s 209.79MB/s
10 shared_efficiency Shared Memory Efficiency 100.00% 100.00% 100.00%
10 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
10 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
10 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
10 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 shared_load_throughput Shared Memory Load Throughput 214.82GB/s 214.83GB/s 214.82GB/s
10 shared_store_throughput Shared Memory Store Throughput 419.56MB/s 419.59MB/s 419.58MB/s
10 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
10 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 l2_read_throughput L2 Throughput (Reads) 209.94MB/s 209.96MB/s 209.95MB/s
10 l2_write_throughput L2 Throughput (Writes) 209.78MB/s 209.80MB/s 209.79MB/s
10 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s

I’m not sure, but if that’s true then I guess the peak shared memory throughput is 128 bytes (32 * 4 bytes) per cycle for Jetson Tk1, which means I’m getting 98.7% of the peak shared memory throughput.

Interesting, it looks like nvprof depresses the throughput numbers somewhat. Err, the shared memory throughput numbers look good!

What do you mean?

Actually, this number is HIGHER than what I measure… hmmm do you have any idea why?

214.82GB/s (nvprof) vs. 107 GB/s (measured).

Thanks.

Maybe nvprof is loading 64 bits and discarding 32 bits, but counts all 64 bits towards this throughput measurement…

Here are the numbers for the cache test benchmark:

==8944== NVPROF is profiling process 8944, command: ./cachetest 4 4194304 256
==8944== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
number of iterations is 10
Results validated: 0 errors
Time taken to load 0.0838861 GBs: 193.035553 (ms)
Effective bandwidth: 0.434563 (GB/s)
==8944== Profiling application: ./cachetest 4 4194304 256
==8944== Profiling result:
==8944== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “GK20A (0)”
Kernel: cache_kernel_4(int, int*, int*)
10 gld_requested_throughput Requested Global Load Throughput 26.248GB/s 26.258GB/s 26.253GB/s
10 gst_requested_throughput Requested Global Store Throughput 6.5620GB/s 6.5646GB/s 6.5634GB/s
10 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 gst_throughput Global Store Throughput 6.5620GB/s 6.5646GB/s 6.5634GB/s
10 gld_throughput Global Load Throughput 26.248GB/s 26.258GB/s 26.253GB/s
10 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
10 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
10 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
10 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
10 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
10 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 l2_read_throughput L2 Throughput (Reads) 26.250GB/s 26.260GB/s 26.255GB/s
10 l2_write_throughput L2 Throughput (Writes) 6.5621GB/s 6.5647GB/s 6.5634GB/s
10 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
Kernel: clear_cache(int*)
1 gld_requested_throughput Requested Global Load Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 gst_requested_throughput Requested Global Store Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gst_throughput Global Store Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 gld_throughput Global Load Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
1 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_read_throughput L2 Throughput (Reads) 6.3839GB/s 6.3839GB/s 6.3839GB/s
1 l2_write_throughput L2 Throughput (Writes) 6.3831GB/s 6.3831GB/s 6.3831GB/s
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s

For the cache test, I measure 32.7 GB/s but this says 26.2 GB/s…

For the cache test, I measure 32.7 GB/s but nvprof says 26.2 GB/s…

Hi, have you tried running the bandwidth test from the CUDA SDK? I would be interested in the results.

Yes, I got about 12.8 GB/s using 512 MB array size.
You get slightly better performance (~13 GB/s) by writing your own benchmark.

Thank you.
Sadly this is much smaller than I was hoping for…