Jetson TX2 Cache Line Size

Hello everyone,

I read this post about caches of CUDA devices: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticscaches.htm

It states that the size of a cache line is 128 byte for L1 cache and 32 byte for L2 cache. Is this also correct for the Jetson TX2 platform?

Thank you for any help.

Best regards.

Hi,

The memory design is different on the Jetson platform.
Here is our document for your reference:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

For the cache size information, you can get it from the CUDA sample.

$ /usr/local/cuda-10.0/bin/cuda-install-samples-10.0.sh .
$ NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery

Here is the result of Nano.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3965 MBytes (4157140992 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  <b>L2 Cache Size:                                 262144 bytes</b>
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

Thanks.

Hi, thank you for the information. Unfortunately the deviceQuery sample just displays the cache size but not the size of a cache line for which I am looking. Do you have any information on either that or the size of a memory transaction? I suppose it should be 32 byte.

Best regards.

Hi,

Sorry that I didn’t notice that you are asking the cache line information.
Let me check this internally and update information with you later.

Thanks.

Hi, I wanted to recheck if you have any news concerning this question.
Thank you.

Best regards

Hi,

Thanks for your patience.

We are still checking this information for you.
Will update with you once we got the feedback.

It would be interesting. I am doing some research and I can’t find any information about that too.

Moreover, it would be useful to know if the GPU L2 cache is multi-banked or not.

Hi,

Sorry for keeping you waiting.

We are still checking this with our developer.
Will update more information with you once we have a conclusion.

Thanks.

Are there any news on that topic?

Hi,

We got some feedback from our developer today.

It should be 128B. You can find this information in our document:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0
A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Thanks.