I’m a bit confused what nvidia visual profiler takes as a device memory in kernel memory section of analysis. Programming guide says that device memory includes “global, local, shared, constant, or texture memory”. If so, why I get larger amount of transactions in Global L1 Cache than reads of device memory? Transactions in device memory section are only uncached ones?
I assume that transaction are performed per warp and can be 32,64, 96 or 128 byte. So if every thread reads 8 bit type there are one 32 bytes transaction.
Thanks for help!
edit: 32 bytes, not bits