Why reading one byte produces multiple global load l2 transactions?

I’m reading from a global int8 array from different threads, each thread reading one byte from the array. My device is a GTX 1080 and the CUDA version is 9.1.
NV Visual Profiler shows Global Load L2 Transactions/Access=3.1, Ideal Transactions/Access = 1.

The disassembly is like

@!P3 LDG.E.CI.U8 R9, [R4]         
@!P3 LDG.E.CI.U8 R7, [R4+0x1] ;
@!P3 LDG.E.CI.U8 R11, [R4+0x2] ;
@!P3 LDG.E.CI.U8 R10, [R4+0x3] ;

My question is why it is possible that loading 8 bits require multiple transactions, since transaction size can be 32,64,128 bits.

Is the access in transactions / access an instruction here?

It’s impossible to say from the SASS you’ve excerpted.

If adjacent threads in a warp are reading bytes, but those bytes are scattered (i.e. the address contained in R4, for each thread) then you will get multiple transactions per request.

Thanks for your response.
Indeed there are multiple threads reading bytes scattered at different addresses. The actual code is complex, something like

for (int i = 0; i < 4; i++)
... = data[
(((((((threadIdx.z*7 + threadIdx.y)*7) + threadIdx.x)/81) % 4)*49))*4
(((((((threadIdx.z*7 + threadIdx.y)*7) + threadIdx.x)/9) % 9)*7)) + 
((((threadIdx.z*4 + threadIdx.y)*7 + threadIdx.x) % 9)) + i]

How is the number of request counted? If multiple threads are reading, e.g. A[threadIdx.x], is the number of request one, or the number of threads?

The request is the LD instruction issued warp-wide.

That single request may generate 1 or more than 1 transaction.

If the global load request goes to the L2 as it seems to be doing in this case, then there will be one transaction per 32-byte memory segment (or cache-line, if you prefer).

So if 8 of the addresses are contained within a single 32-byte segment, those 8 addresses will be “coalesced” by the memory controller into a single transaction. If the addresses are spaced by 32 bytes or more, then the number of transactions will be equal to the number of threads (or addresses, if you prefer). That could be up to 32 transactions per request (the maximum, per warp).