Device memory in nvidia visual profiler

I’m a bit confused what nvidia visual profiler takes as a device memory in kernel memory section of analysis. Programming guide says that device memory includes “global, local, shared, constant, or texture memory”. If so, why I get larger amount of transactions in Global L1 Cache than reads of device memory? Transactions in device memory section are only uncached ones?

I assume that transaction are performed per warp and can be 32,64, 96 or 128 byte. So if every thread reads 8 bit type there are one 32 bytes transaction.


Thanks for help!

edit: 32 bytes, not bits

Yes, device memory (DRAM) transactions are only the ones that don’t hit in L1 or L2 (or one of the other caches).

Transactions to device memory (DRAM) occur on an L2 cache basis. The entity in the L2 cache is called a “line” and the entity in DRAM is called a “segment”. Only segments can be read from or written to DRAM. A segment is 32 bytes, not 32 bits.

As long as the individual requirements of threads in a warp that are contributing to a transaction can be “coalesced” into a set of lines/segments, only those lines/segments will be requested.