This is what I found from a running some benchmarks on the Jetson TK1.
Peak compute (single): 308 Gflop/s (out of 327 Gflop/s peak)
Peak bandwidth: 13 GB/s (out of ??? GB/s)
L2 cache bandwidth: 32.75 GB/s (out of ??? GB/s)
Shared memory bandwidth: 107.6 GB/s (out of 218 GB/s)
The benchmark I used comes from: http://hpcgarage.org/archline
which was designed for energy related measurements so
the L2 and shared memory benchmarks use a pointer chasing array to minimize integer computation.
I’m not sure if this affects the performance, but these are what I got.
Anybody have any idea on what the peak for the DRAM, and L2 caches are?
Also, if somebody has a benchmark that can achieve higher performance, I’d appreciate a link to it so I can see what they do better.
Thank you.
It would be interesting to see how well nvprof metrics match up with these benchmarks.
What would I be looking at precisely?
Are you suggesting that I compare (# of shared loads / gpu time) and compare it against my performance numbers?
Thank you.
Here are the results for the shared memory test benchmark:
I create 4M threads @ 256 threads/block and each thread access shared memory 512 times.
The system is fixed at 852 MHz.
==8881== NVPROF is profiling process 8881, command: ./smtest 512 4194304 256
number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 80.332932 (ms)
Effective bandwidth: 107.138 (GB/s)
==8881== Profiling application: ./smtest 512 4194304 256
==8881== Profiling result:
==8881== Event result:
Invocations Event Name Min Max Avg
Device “GK20A (0)”
Kernel: clear_cache(int*)
1 elapsed_cycles_sm 565937 565937 565937
Kernel: cache_kernel_512(int, int*, int*)
10 elapsed_cycles_sm 68134270 68139746 68135999
==8859== NVPROF is profiling process 8859, command: ./smtest 512 4194304 256
number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 81.245384 (ms)
Effective bandwidth: 105.935 (GB/s)
==8859== Profiling application: ./smtest 512 4194304 256
==8859== Profiling result:
==8859== Event result:
Invocations Event Name Min Max Avg
Device “GK20A (0)”
Kernel: clear_cache(int*)
1 shared_load 0 0 0
Kernel: cache_kernel_512(int, int*, int*)
10 shared_load 67108864 67108864 67108864
Something like "nvprof --metrics all " will print all K1 metrics. It might take a while to run though. You can list all possible metrics with “nvprof --query-metrics”.
You could start out by capturing all the throughput numbers:
nvprof --metrics gld_requested_throughput,gst_requested_throughput,tex_cache_throughput,gst_throughput,gld_throughput,shared_efficiency,gld_efficiency,gst_efficiency,nc_gld_requested_throughput,local_load_throughput,local_store_throughput,shared_load_throughput,shared_store_throughput,nc_l2_read_throughput,nc_gld_throughput,nc_gld_efficiency,l2_texture_read_throughput,l2_read_throughput,l2_write_throughput,l2_atomic_throughput <benchmark>
Actually, I just realized that I’m accessing 4 byte data (int) from shared memory.
This might explain why my GB/s is 50% of the peak.
I based my benchmark on an older (Fermi) GPU and forgot to update it to 8 byte data instead.
Otherwise, at each cycle, each warp is accessing the sm at a rate of 1 access / cycle which is what the system is capable of.
I should figure out how to create a pointer chasing benchmark based on 64 bit integers, except I’m concerned that this will cause the compiler to insert conversion instructions.
Anyone have any idea on how to create a pointer chasing array based on long int (8 byte data)?
Thank you.
There is no 8-byte shared memory support on sm_20, sm_32 or sm_50, right? It was only implemented on sm_30/35. I’m not sure about sm_32 but I assume they dropped the feature.
Here are the numbers for the shared memory benchmark.
==8907== NVPROF is profiling process 8907, command: ./smtest 512 4194304 256
==8907== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
number of iterations is 10
Results validated: 0 errors
Time taken to load 8.60671 GBs: 794.819458 (ms)
Effective bandwidth: 10.8285 (GB/s)
==8907== Profiling application: ./smtest 512 4194304 256
==8907== Profiling result:
==8907== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “GK20A (0)”
Kernel: clear_cache(int*)
1 gld_requested_throughput Requested Global Load Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
1 gst_requested_throughput Requested Global Store Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
1 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gst_throughput Global Store Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
1 gld_throughput Global Load Throughput 6.3316GB/s 6.3316GB/s 6.3316GB/s
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
1 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_read_throughput L2 Throughput (Reads) 6.3339GB/s 6.3339GB/s 6.3339GB/s
1 l2_write_throughput L2 Throughput (Writes) 6.3319GB/s 6.3319GB/s 6.3319GB/s
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
Kernel: cache_kernel_512(int, int*, int*)
10 gld_requested_throughput Requested Global Load Throughput 209.78MB/s 209.80MB/s 209.79MB/s
10 gst_requested_throughput Requested Global Store Throughput 209.78MB/s 209.80MB/s 209.79MB/s
10 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 gst_throughput Global Store Throughput 209.78MB/s 209.80MB/s 209.79MB/s
10 gld_throughput Global Load Throughput 209.78MB/s 209.80MB/s 209.79MB/s
10 shared_efficiency Shared Memory Efficiency 100.00% 100.00% 100.00%
10 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
10 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
10 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
10 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 shared_load_throughput Shared Memory Load Throughput 214.82GB/s 214.83GB/s 214.82GB/s
10 shared_store_throughput Shared Memory Store Throughput 419.56MB/s 419.59MB/s 419.58MB/s
10 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
10 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 l2_read_throughput L2 Throughput (Reads) 209.94MB/s 209.96MB/s 209.95MB/s
10 l2_write_throughput L2 Throughput (Writes) 209.78MB/s 209.80MB/s 209.79MB/s
10 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
I’m not sure, but if that’s true then I guess the peak shared memory throughput is 128 bytes (32 * 4 bytes) per cycle for Jetson Tk1, which means I’m getting 98.7% of the peak shared memory throughput.
Interesting, it looks like nvprof depresses the throughput numbers somewhat. Err, the shared memory throughput numbers look good!
Actually, this number is HIGHER than what I measure… hmmm do you have any idea why?
214.82GB/s (nvprof) vs. 107 GB/s (measured).
Thanks.
Maybe nvprof is loading 64 bits and discarding 32 bits, but counts all 64 bits towards this throughput measurement…
Here are the numbers for the cache test benchmark:
==8944== NVPROF is profiling process 8944, command: ./cachetest 4 4194304 256
==8944== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
number of iterations is 10
Results validated: 0 errors
Time taken to load 0.0838861 GBs: 193.035553 (ms)
Effective bandwidth: 0.434563 (GB/s)
==8944== Profiling application: ./cachetest 4 4194304 256
==8944== Profiling result:
==8944== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “GK20A (0)”
Kernel: cache_kernel_4(int, int*, int*)
10 gld_requested_throughput Requested Global Load Throughput 26.248GB/s 26.258GB/s 26.253GB/s
10 gst_requested_throughput Requested Global Store Throughput 6.5620GB/s 6.5646GB/s 6.5634GB/s
10 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 gst_throughput Global Store Throughput 6.5620GB/s 6.5646GB/s 6.5634GB/s
10 gld_throughput Global Load Throughput 26.248GB/s 26.258GB/s 26.253GB/s
10 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
10 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
10 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
10 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
10 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
10 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
10 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
10 l2_read_throughput L2 Throughput (Reads) 26.250GB/s 26.260GB/s 26.255GB/s
10 l2_write_throughput L2 Throughput (Writes) 6.5621GB/s 6.5647GB/s 6.5634GB/s
10 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
Kernel: clear_cache(int*)
1 gld_requested_throughput Requested Global Load Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 gst_requested_throughput Requested Global Store Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 tex_cache_throughput Texture Cache Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gst_throughput Global Store Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 gld_throughput Global Load Throughput 6.3828GB/s 6.3828GB/s 6.3828GB/s
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 nc_gld_requested_throughput Requested Non-Coherent Global Load Throu 0.00000B/s 0.00000B/s 0.00000B/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_throughput Non-Coherent Global Memory Load Throughp 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_gld_efficiency Non-Coherent Global Load Efficiency 0.00% 0.00% 0.00%
1 l2_texture_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_read_throughput L2 Throughput (Reads) 6.3839GB/s 6.3839GB/s 6.3839GB/s
1 l2_write_throughput L2 Throughput (Writes) 6.3831GB/s 6.3831GB/s 6.3831GB/s
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
For the cache test, I measure 32.7 GB/s but this says 26.2 GB/s…
For the cache test, I measure 32.7 GB/s but nvprof says 26.2 GB/s…
Hi, have you tried running the bandwidth test from the CUDA SDK? I would be interested in the results.
Yes, I got about 12.8 GB/s using 512 MB array size.
You get slightly better performance (~13 GB/s) by writing your own benchmark.
Thank you.
Sadly this is much smaller than I was hoping for…