I can imagine two situations (I am not sure if it can be forced to actually happen in practice but let’s suppose it can):
- 1 subcore execute 32 threads in parallel (a warp). The memory access is perfectly aligned and sequential within the warp.
The second situation is:
- 32 subcore execute 1 thread in parallel. In principle the memory access is still perfectly aligned and sequential. (It’s the same code/kernel as situation 1)
According to the guide memory requests can be “grouped” into one big memory request. (For some compute versions cache also comes into play, but let’s suppose it’s not in the cache).
Situation 1: The warps memory accesses are probably grouped into 1, 2 or 3 big memory requests. (Depending on the element type/size).
But what would happen for situation 2: Would the individual memory accesses/requests of the subcores also be grouped into one big memory request ?
Or would this become 32 individual memory requests ?