Why reading one byte produces multiple global load l2 transactions?

vincentl13x · August 30, 2018, 9:40am

I’m reading from a global int8 array from different threads, each thread reading one byte from the array. My device is a GTX 1080 and the CUDA version is 9.1.
NV Visual Profiler shows Global Load L2 Transactions/Access=3.1, Ideal Transactions/Access = 1.

The disassembly is like

@!P3 LDG.E.CI.U8 R9, [R4]         
@!P3 LDG.E.CI.U8 R7, [R4+0x1] ;
@!P3 LDG.E.CI.U8 R11, [R4+0x2] ;
@!P3 LDG.E.CI.U8 R10, [R4+0x3] ;

My question is why it is possible that loading 8 bits require multiple transactions, since transaction size can be 32,64,128 bits.

Is the access in transactions / access an instruction here?

Robert_Crovella · August 30, 2018, 3:51pm

It’s impossible to say from the SASS you’ve excerpted.

If adjacent threads in a warp are reading bytes, but those bytes are scattered (i.e. the address contained in R4, for each thread) then you will get multiple transactions per request.

vincentl13x · August 30, 2018, 3:59pm

Thanks for your response.
Indeed there are multiple threads reading bytes scattered at different addresses. The actual code is complex, something like

for (int i = 0; i < 4; i++)
... = data[
(((((((threadIdx.z*7 + threadIdx.y)*7) + threadIdx.x)/81) % 4)*49))*4
(((((((threadIdx.z*7 + threadIdx.y)*7) + threadIdx.x)/9) % 9)*7)) + 
((((threadIdx.z*4 + threadIdx.y)*7 + threadIdx.x) % 9)) + i]

How is the number of request counted? If multiple threads are reading, e.g. A[threadIdx.x], is the number of request one, or the number of threads?

Robert_Crovella · August 30, 2018, 4:19pm

The request is the LD instruction issued warp-wide.

That single request may generate 1 or more than 1 transaction.

If the global load request goes to the L2 as it seems to be doing in this case, then there will be one transaction per 32-byte memory segment (or cache-line, if you prefer).

So if 8 of the addresses are contained within a single 32-byte segment, those 8 addresses will be “coalesced” by the memory controller into a single transaction. If the addresses are spaced by 32 bytes or more, then the number of transactions will be equal to the number of threads (or addresses, if you prefer). That could be up to 32 transactions per request (the maximum, per warp).

Topic		Replies	Views
Perplexed by Global Load Transactions Per Request in P100 CUDA Programming and Performance	1	655	January 9, 2020
Trying to understand Transactions per request for P100 CUDA Programming and Performance	2	1521	February 26, 2018
What is a memory transaction and a request? CUDA Programming and Performance	1	1163	March 6, 2020
CUDA memory transactions CUDA Programming and Performance	9	8976	April 11, 2011
what does gld_request really mean? CUDA Programming and Performance	5	2259	January 18, 2010
Global load transaction count when in coalesced memory access Visual Profiler and nvprof	3	2268	July 7, 2017
coalesced reads CUDA Programming and Performance	3	1433	May 3, 2014
Memory access should be coalesced but is not CUDA Programming and Performance	6	1212	May 16, 2019
Memory transaction size and coalesced access CUDA Programming and Performance	6	4988	November 12, 2008
load / store from/to register-array to global memory CUDA Programming and Performance	1	512	May 6, 2019

Why reading one byte produces multiple global load l2 transactions?

Related topics