Details of Global and L2 cache configuration in Tesla K40

duttasankha · June 29, 2015, 6:21pm

Hi

I am curious about some configuration details of the global memory and L2 cache of Tesla K40. There are few questions that I was searching but couldn’t get much information and so I thought of asking here. My questions are listed below :

How many banks are present in the global memory of K40?
I saw that the global memory bus width is 384 bits; but what is the width of each bank (like if I exceed
that width then the data would be allocated in the next bank)?
How the whole global memory is divided into each bank; I mean size of each bank?
I searched for a microbenchmark but couldn’t find it. So is there a microbenchmark available that could
help me to understand this details?
How the L2 is connected to each of the global memory banks?
In this link [url]http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-Discloses-Full-Memory-Structure-and-Limitations-GTX-970[/url] for maxwell the L2 banks are same as the global memory banks and connected to each
bank. Is this same for the K40 as well? Then how the same questions would go for L2 as well and is there a microbenchmark to find those details for L2 ?

Any help would be really appreciable. Thank you.

Uncle_Joe · June 30, 2015, 5:11am

The answer is right in front of you!

External Media

You can think of the 8 banks of memory as 8 independent memory channels just like on Intel CPUs.

The address space is divided round robin across the banks/channels:

word0 → bank 0
word1 → bank 1
…
word7 → bank 7
word8 → bank 0
word9 → bank 1
…

The article says the stride to span all 8 banks is 1KiB, so that means each word is 128 bytes. However, each bank is only 4 bytes wide, so to get 128 bytes, it does a 32 long burst access (these days with bandwidth always increasing while latency stays constant, you have to use larger and larger block sizes to utilize the available bandwidth).

I think you can assume that the L2 organization is the same as the external memory.

I don’t think knowing these details will help speed up your code. I seriously doubt that you can access all 8 banks in parallel from a single warp if that’s what you’re thinking.

As long as your memory accesses are aligned to 128 bytes (for each transaction), then throughput should be good.

duttasankha · July 1, 2015, 1:30am

The answer is right in front of you!

External Media

You can think of the 8 banks of memory as 8 independent memory channels just like on Intel CPUs.

The address space is divided round robin across the banks/channels:

word0 → bank 0
word1 → bank 1
…
word7 → bank 7
word8 → bank 0
word9 → bank 1
…

The article says the stride to span all 8 banks is 1KiB, so that means each word is 128 bytes. However, each bank is only 4 bytes wide, so to get 128 bytes, it does a 32 long burst access (these days with bandwidth always increasing while latency stays constant, you have to use larger and larger block sizes to utilize the available bandwidth).

I think you can assume that the L2 organization is the same as the external memory.

I don’t think knowing these details will help speed up your code. I seriously doubt that you can access all 8 banks in parallel from a single warp if that’s what you’re thinking.

As long as your memory accesses are aligned to 128 bytes (for each transaction), then throughput should be good.

Hi

Thank you for your reply. However the information you provided is not aligned with my query. I am clear about the information provided in the website. I didn’t asked anyone to reiterate the information given in the link in my first post. I want to get the similar kind of information for K40. I am also seeking if there is any microbenchmark to gather this information. I understood whatever provided in the website. But that is for maxwell. I am seeking similar information (more specifically the questions I listed) for Kepler architecture and if there is any microbenchmark to understand this details.

sBc-Random · April 8, 2019, 4:01pm

Does anyone have an update for this thread for the Voltas?

Thanks!

Topic		Replies	Views
Details of Global and L2 cache configuration in Tesla K40 CUDA Setup and Installation	0	695	June 29, 2015
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4039	December 23, 2015
How to find number of banks in GPU global memory? CUDA Programming and Performance	4	3405	June 12, 2016
Cache line size of L1 and L2 CUDA Programming and Performance	3	20720	November 14, 2011
Cu_device_attribute_global_memory_bus_width CUDA Programming and Performance gpu	6	880	February 23, 2021
global memory caching CUDA Programming and Performance	4	1400	March 13, 2012
question about latency of global memory CUDA Programming and Performance	2	22599	October 23, 2009
Will there be multi-level level shared memory soon? CUDA Programming and Performance	1	666	September 8, 2014
Pascal L1 cache CUDA Programming and Performance	21	11876	January 20, 2024
Global memory access bottleneck CUDA Programming and Performance	8	3459	September 4, 2015

Details of Global and L2 cache configuration in Tesla K40

Related topics