I am curious about some configuration details of the global memory and L2 cache of Tesla K40. There are few questions that I was searching but couldn’t get much information and so I thought of asking here. My questions are listed below :
How many banks are present in the global memory of K40?
I saw that the global memory bus width is 384 bits; but what is the width of each bank (like if I exceed
that width then the data would be allocated in the next bank)?
How the whole global memory is divided into each bank; I mean size of each bank?
I searched for a microbenchmark but couldn’t find it. So is there a microbenchmark available that could
help me to understand this details?
You can think of the 8 banks of memory as 8 independent memory channels just like on Intel CPUs.
The address space is divided round robin across the banks/channels:
word0 → bank 0
word1 → bank 1
…
word7 → bank 7
word8 → bank 0
word9 → bank 1
…
The article says the stride to span all 8 banks is 1KiB, so that means each word is 128 bytes. However, each bank is only 4 bytes wide, so to get 128 bytes, it does a 32 long burst access (these days with bandwidth always increasing while latency stays constant, you have to use larger and larger block sizes to utilize the available bandwidth).
I think you can assume that the L2 organization is the same as the external memory.
I don’t think knowing these details will help speed up your code. I seriously doubt that you can access all 8 banks in parallel from a single warp if that’s what you’re thinking.
As long as your memory accesses are aligned to 128 bytes (for each transaction), then throughput should be good.
Thank you for your reply. However the information you provided is not aligned with my query. I am clear about the information provided in the website. I didn’t asked anyone to reiterate the information given in the link in my first post. I want to get the similar kind of information for K40. I am also seeking if there is any microbenchmark to gather this information. I understood whatever provided in the website. But that is for maxwell. I am seeking similar information (more specifically the questions I listed) for Kepler architecture and if there is any microbenchmark to understand this details.