Jetson TX2 Cache Line Size

lkrupp · June 6, 2019, 6:59am

Hello everyone,

I read this post about caches of CUDA devices: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticscaches.htm

It states that the size of a cache line is 128 byte for L1 cache and 32 byte for L2 cache. Is this also correct for the Jetson TX2 platform?

Thank you for any help.

Best regards.

AastaLLL · June 10, 2019, 3:42am

Hi,

The memory design is different on the Jetson platform.
Here is our document for your reference:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

For the cache size information, you can get it from the CUDA sample.

$ /usr/local/cuda-10.0/bin/cuda-install-samples-10.0.sh .
$ NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery

Here is the result of Nano.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3965 MBytes (4157140992 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  <b>L2 Cache Size:                                 262144 bytes</b>
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

Thanks.

lkrupp · June 10, 2019, 1:39pm

Hi, thank you for the information. Unfortunately the deviceQuery sample just displays the cache size but not the size of a cache line for which I am looking. Do you have any information on either that or the size of a memory transaction? I suppose it should be 32 byte.

Best regards.

AastaLLL · June 11, 2019, 2:46am

Hi,

Sorry that I didn’t notice that you are asking the cache line information.
Let me check this internally and update information with you later.

Thanks.

lkrupp · June 17, 2019, 7:59am

Hi, I wanted to recheck if you have any news concerning this question.
Thank you.

Best regards

AastaLLL · June 24, 2019, 8:53am

Hi,

Thanks for your patience.

We are still checking this information for you.
Will update with you once we got the feedback.

matteo.fusi · July 2, 2019, 2:03pm

It would be interesting. I am doing some research and I can’t find any information about that too.

Moreover, it would be useful to know if the GPU L2 cache is multi-banked or not.

AastaLLL · July 9, 2019, 9:26am

Hi,

Sorry for keeping you waiting.

We are still checking this with our developer.
Will update more information with you once we have a conclusion.

Thanks.

lkrupp · July 30, 2019, 9:05am

Are there any news on that topic?

AastaLLL · August 7, 2019, 9:18am

Hi,

We got some feedback from our developer today.

It should be 128B. You can find this information in our document:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0
A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Thanks.

Topic		Replies	Views
CUDA deviceQuery and Bandwidth Test on Tx2 are weird Jetson TX2	10	2611	October 18, 2021
Is the available global memory in TX2 less than TX1? Jetson TX2	6	1371	October 18, 2021
Incorrect CUDA deviceQuery Results? Jetson Nano	6	2729	October 14, 2021
Maximum size for persisting L2 cache CUDA Programming and Performance cuda	4	848	October 25, 2022
About jetson-nano device query Jetson Nano	3	3107	October 18, 2021
How many CUDA threads can I open at TX2? Jetson TX2	2	2898	October 18, 2021
Manually installing CUDA 11.0.2 on Jetson Xavier NX - Help! Jetson TX2 cuda	7	15208	November 27, 2021
TX2 Computing Performance has Dropped Jetson TX2 power , performance	12	974	October 18, 2021
How many CUDA multiprocessors does the Jetson Nano have? Jetson Nano cuda	3	618	September 22, 2023
deviceQuery app fails with error 38 without root premissions on Jetson TX2 Jetson TX2 cuda	2	445	October 18, 2021

Jetson TX2 Cache Line Size

Related topics