Cache entry size


in order to maximize the throughput of non-coalesced (coalescing is not possible in my case) loading of nodes of a tree from 1D texture memory, I would like to adjust the size of each node so that it is matches the size of a texture cache entry. Does anyone know what is the size of the whole data chunk that is transferred from the device memory to the cache at once?

I can pack each node into 40 bytes, but it would maybe be better to use 48 bytes if it matches the cache entry better. Any suggestions? For the 40 bytes version, I would use a texture of floats… for the 48 version it would probably be a float4 texture. I can also use a structure of arrays, but I was just wondering if a single array with small (cache entry matching) structures would perform better. As I said, coalescing is not possible, so the rule “SOA is better than AOS” should not apply in this case.


According to Demystifying GPU Microarchitecture through Microbenchmarking, L1 texture cache line width is 32 bytes and L2 line width is 256 bytes.

Thanks! The paper also looks interesting at the first sight, I’ll give it a try.