I was developing an application and I noticed a strange phenomenon. So I wanted to be clear-headed and decided to simplify the experiment and this is what I did:
The experiment consists that each thread in the same warp accesses a data located at a random position. I do a huge amount of accesses on a wide quantity of data. So, the position is randomly generated based on the threadId on the grid and the current step of the program. I then vary the number of warps and blocks and compare the time taken to access data of 4 bytes and 32 bytes respectively. Here’s what I get, the slowdown factor is represented:
The majority of the results are logical, the cache-lines are 32 bytes and therefore the number of accesses is about the same for both cases. Results are consistent with 8 bytes and 16 bytes, results are about the same (within the variance of the experimentations). On the other hand, the upper left corner remains a real mystery for me, someone would have an explanation?
Thanks in advance!