The GPU can access almost the whole 8GB of RAM, minus a couple hundred MB reserved for the Linux kernel.
You could use atomic operations, like atomicAdd() and others, to have all threads access the same memory with deterministic results.
Please see CUDA Memory Fence functions for synchronizing access between threads and blocks.
i check the link you sent ,but it tell about read-modify-write atomic operation about one RAM address, it seems work in serial way. but my case is multi-thread(in one block) read one same RAM read(but modify or write the different RAM address. this is different.),so in my case, it works in serial way? or in Synchronous way?
There are a maximum of 1024 threads per block, and practically unlimited blocks per grid (2^31-1 in X dimensions and 65,535 in Y/Z dimension).
See this Compute Capability table for more data on the number of threads.
The atomic operations operate on one RAM address. The memory fences like __threadfence() do not operate on memory directly, but provide syncronization primitives. See also __syncthreads() from the Cooperative Groups API which provides block-level syncronization.
You can have each CUDA thread read the same constant RAM (which will be cached), and then write to different RAM addresses (without requiring synchronization). For example, if you were manipulating an image by a scalar value, and you had each thread read the scalar before applying it to each thread’s individual pixel. You may also be interested to read about constant memory which may offer improved performance and caching in this case.
Reading memory will be in “semi-parallel” because there is a few caches between the RAM and the GPU.
If many cores read the same address, the cache will make it much faster than if you read it directly from DRAM (which would have to be fully serialized.)
This is similar to how many fragments can read from the same small texture at the same time when the GPU does rendering – it’s a common occurrence, and seems to be reasonably well optimized on NVIDIA hardware.
If you look at the NVIDIA hardware block diagrams for their GPU architectures, you will see more than one cache block between shared main DRAM and the actual execution units, and different shader/compute blocks separate out different little caches.