questions about Tegra x2

csliyuncn · November 22, 2017, 6:52am

I have the following question about Tegra X2 :

If I use one dimension size for grid,block. How many total threads does Jetson Tegra X2 have if I enable all the Tx2 multi processor (2) for computation?
Tx2 has integrated 8GB LPDDR4. Does 256 cuda corse can access these RAM? And what is the maxim size can be assigned to Cuda Cores use? 8GB? Or less?
If i assign one RAM space(that’s say 1 byte RAM)(0x00000B-0x000001B),if I start 100 threads(1block), all these threads access this RM address is Synchronous access? Or serial access?

If all these 100 threads must access this same RAM address in Serial way, how can we access this RAM address in Synchronous way?

dusty_nv · November 22, 2017, 9:03pm

To maintain high thread occupancy and latency hiding, you should typically launch multiple threads for the number of CUDA cores in the GPU.
See the CUDA Occupancy Calculator tool and these sections from the CUDA C Programming Guide:

The GPU can access almost the whole 8GB of RAM, minus a couple hundred MB reserved for the Linux kernel.

You could use atomic operations, like atomicAdd() and others, to have all threads access the same memory with deterministic results.

Please see CUDA Memory Fence functions for synchronizing access between threads and blocks.

csliyuncn · November 23, 2017, 2:23am

[quote=“dusty_nv”]

To maintain high thread occupancy and latency hiding, you should typically launch multiple threads for the number of CUDA cores in the GPU.
See the CUDA Occupancy Calculator tool and these sections from the CUDA C Programming Guide:

The GPU can access almost the whole 8GB of RAM, minus a couple hundred MB reserved for the Linux kernel.

You could use atomic operations, like atomicAdd() and others, to have all threads access the same memory with deterministic results.

Please see CUDA Memory Fence functions for synchronizing access between threads and blocks.

i check the link you sent ,but it tell about read-modify-write atomic operation about one RAM address, it seems work in serial way. but my case is multi-thread(in one block) read one same RAM read(but modify or write the different RAM address. this is different.),so in my case, it works in serial way? or in Synchronous way?

dusty_nv · November 23, 2017, 11:52am

There are a maximum of 1024 threads per block, and practically unlimited blocks per grid (2^31-1 in X dimensions and 65,535 in Y/Z dimension).
See this Compute Capability table for more data on the number of threads.

The atomic operations operate on one RAM address. The memory fences like __threadfence() do not operate on memory directly, but provide syncronization primitives. See also __syncthreads() from the Cooperative Groups API which provides block-level syncronization.

You can have each CUDA thread read the same constant RAM (which will be cached), and then write to different RAM addresses (without requiring synchronization). For example, if you were manipulating an image by a scalar value, and you had each thread read the scalar before applying it to each thread’s individual pixel. You may also be interested to read about constant memory which may offer improved performance and caching in this case.

snarky · November 23, 2017, 4:36pm

Reading memory will be in “semi-parallel” because there is a few caches between the RAM and the GPU.
If many cores read the same address, the cache will make it much faster than if you read it directly from DRAM (which would have to be fully serialized.)
This is similar to how many fragments can read from the same small texture at the same time when the GPU does rendering – it’s a common occurrence, and seems to be reasonably well optimized on NVIDIA hardware.
If you look at the NVIDIA hardware block diagrams for their GPU architectures, you will see more than one cache block between shared main DRAM and the actual execution units, and different shader/compute blocks separate out different little caches.

csliyuncn · November 27, 2017, 7:02am

hi Snarky,

thanks. and i check the tx2 HW architecture, and it show SM0 and SM1 have 512K B L2 cache. and A57 CPU0 …CPU3 share 2MB L2 cache, Denver CPU0,CPU1 share 2MB L2 cache…

csliyuncn · November 28, 2017, 7:17am

Hi dusty_nv.

thanks.