Shared memory bandwidth

Is there any official NVIDIA citation for the actual shared memory bandwidth? Unofficial will do as well…

Any references would be much appreciated

I’m quoting this from memory, so someone would likely correct me if I’m wrong here. You should be able to read from shared memory in 1 clock cycle. But there are a number of caveats. One caveat is the potential for bank conflicts, which would reduce the bandwidth. Another caveat is that you will likely need to compute some sort of offset to index into shared memory, which would likely take more than 1 clock cycle.

In a simple test I performed in this thread: http://forums.nvidia.com/index.php?showtopic=46742&hl= I was able to get an effective total bandwidth of 233.36 GB/s from the shared memory. More is probably possible with unrolled loops.

What kind of loops are you talking about, sir?

Thanks.

In page 5 of the CUDA Programming Guide,…

applications can take advantage of it by minimizing overfetch and round-trips to DRAM and therefore becoming less dependent on DRAM memory bandwidth?

My question is:

What could this possibly mean?

It means that by reading data into shared memory first and then reading it multiple times from shared memory (as opposed to global memory each time) you will get improved performance.

One of the uses of shared memory is as a user-managed cache. The other one is communication between threads in the same threadblock.

Paulius

Oh, never mind. I already unrolled the loops in that code. Sorry for the confusion. Just look at the forum post I gave a link to, it explains all the details of my little benchmark.


It means that by reading data into shared memory first and then reading it multiple times from shared memory (as opposed to global memory each time) you will get improved performance.


Well now, here’s another question. When you read data into(which is write, right) the shared memory and you are now reading it multiple times FROM the shared memory right away, would there be have data dependences? From your experience/anywhere the programming guide shows, how many cycles does it take to write data from the thread to the main memory?

Global memory latency for reading is 400 to 600 cycles (according to Programming Manual), I believe writing have similar latency.

Howeve it is not clear from your question what are you asking of: you begin with reading data to shared memory, then ask something about data dependencies (which depends on algorithm you’re going to implement) and end up with ‘main memory’ latency… So, what was the question? :)

Maybe I mixed it up coz I was thinking if data dependences take longer for memory latency as well.


You said: ‘I believe writing have similar latency.’


It’d be nice if we know how long exactly it takes for write latency, coz in general writes take longer than reads.

The writing latency matters not at all. A thread has no need to re-read a value from global memory (since it can keep that value in a register), and there is no global synchronization. So writes are “fire-and-forget” in the hardware, the thread keeps in executing instructions with no latency after giving the write to the memory controller. This has been stated by an NVIDIA employee in the past, but the search function is down and I can’t find the post. I’m pretty sure the keywords “fire and forget” will lead to the post.