memory bandwidth device to SM bandwidth

flossy · March 2, 2008, 8:13pm

Hi All

If i have one of these :

Device 0: “GeForce 8800 GTX”
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 804585472 bytes
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1350000 kilohertz

what is the expected maximum memory bandwidth and how do i get it in my threads ?

has anyone done a study of the bandwidth under different loading schemes in a cuda thread ?

e.g. should i load float4’s or char1’s, should every thread do a load or only some of them, should i load via texture fetch or simply device memory access.

Thanks for helping a newbie !

Phil.

AndreiB · March 2, 2008, 8:56pm

Which memory bandwidth are you asking about?
Device memory have achievable bandwidth of 70 GB/sec, host-to-device memory bandwidth is limited by PCI-Express bus and is about 2-2.5 GB/sec.

To get good device memory bandwidth you need make all your memory accesses coalesced (see Programming Guide for details). Also you should avoid using float4 as it seems to break coalescing in some way (others reported problems with it). 8-bit and 16-bit data types are also not coalesced, so you have to use 32- or 64-bit types to get maximum bandwidth.

Textures are good for read-only access when you cannot ensure coalescing. If your accesses are coalesced then textures won’t give you any benefit.
If all threads in a warp access same memory address (read-only) then costants may be a goos choice as they are cached.

Hope this helps =)

flossy · March 2, 2008, 9:54pm

Which memory bandwidth are you asking about?

Device memory have achievable bandwidth of 70 GB/sec, host-to-device memory bandwidth is limited by PCI-Express bus and is about 2-2.5 GB/sec.

To get good device memory bandwidth you need make all your memory accesses coalesced (see Programming Guide for details). Also you should avoid using float4 as it seems to break coalescing in some way (others reported problems with it). 8-bit and 16-bit data types are also not coalesced, so you have to use 32- or 64-bit types to get maximum bandwidth.

Textures are good for read-only access when you cannot ensure coalescing. If your accesses are coalesced then textures won’t give you any benefit.

If all threads in a warp access same memory address (read-only) then costants may be a goos choice as they are cached.

Hope this helps =)

[snapback]336664[/snapback]

Andrei

Thanks, i was asking about device to SMEM/registers bandwidth

So simple smem[threadid] = device[threadid] is a good way to go (with correct alignment) - i guess i would get bank conflicts to smem if i used 64bits here though ?

Do you know how much data is read from memory for each coalesced read ?

Thanks

Phil.

seibert · March 2, 2008, 10:26pm

“As much as you need” :)

The important thing is that consecutive threads in a warp access consecutive memory locations, regardless of whether it is a 32 or 64-bit read per thread. The hardware then merges the reads together to match the capacity of the card. (384-bits per read in your case)

AndreiB · March 3, 2008, 12:34am

I guess bank conflicts are unavoidable here. Anyway, coalescing global memory accesses is of much higher importance than shared memory bank conflicts, so just ignore them.

flossy · March 3, 2008, 8:12am

Thinking about device memory reads vs texture fetches.

If my threads were reading a 2D plane ‘row by row’ would the ‘2D locality’ awareness of the 2D texture fetches help me when the threads move to the next row ?

What does the texture fetching hardware do in response to a 2d texture fetch ?

Thanks

Phil.

flossy · March 3, 2008, 8:39am

Thanks, now that raised another question in my small brain…

IF the G80 has 8 multiprocessors and each can run 32 (16+16) threads at once, that means we get 256 threads active at once - whcih doesnt seem to correlate to the 384 bit memory reads…

Am i getting too esoteric ?

Thanks

Phil.

seibert · March 3, 2008, 12:28pm

It’s not clear how the hardware deals with the partial read at the end. Only the early G80 cards with their unusual 384 and 320 bit buses have sizes that don’t divide 32 threads x 32 bits = 1024 bits. All the later cards have 256 (or less) bit memory buses, which can service a coalesced read for a warp without wasted bits.

Perhaps this problem, along with the cost of a wide memory bus, led to reduction in memory bus size in the later cards. Even the leaked specs for the upcoming 9800 GX2 and GTX suggest that future cards will be sticking to 256 bit memory interfaces. (This is all pure speculation, so don’t take this too seriously.)

MisterAnderson42 · March 3, 2008, 3:36pm

What do you mean “row by row”? If you mean threads in a warp accessing across the row, that can be easily coalesced. If you mean threads in a warp accessing down a column, then just transpose your data and then it is coalesced. If threads in a warp access both down rows and across columns but with good 2D locality within the warp, then the 2D texture cache will get you close to the coalesced performance. If threads in a warp read semi-randomly with 1D locality, then you can use a 1D texture bound to device memory.

burnie · June 10, 2008, 3:47pm

I’m still unsure abaout coalescing global mem access.

In the simpleStreams - example from NVIDIA there is the following code:

__global__ void init_array(int *g_data, int value)

{ 

   int idx = blockIdx.x * blockDim.x + threadIdx.x;

   g_data[idx] = value; // uncoalesced on purpose to burn some time

}

But when I see this code I think its coalesced, or not?? the consecutive threads access consecutive memory locations (32 bit integer). But they write “uncoalesced on purpose to burn some time”… What’s the Problem here?

Greets burnie

EDIT: In the example instead of value is given an adress in global memory, so i think then the “uncoalesced” should be true…

Topic		Replies	Views
texture memory vs global memory CUDA Programming and Performance	10	13621	August 20, 2007
Uncoalesced global memory bandwidth CUDA Programming and Performance	3	2228	March 28, 2009
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13627	June 2, 2008
Maximum number of queued kernels CUDA Programming and Performance	21	7676	September 3, 2008
global memory latency CUDA Programming and Performance	12	16156	December 13, 2007
Memory coalescing in one thread CUDA Programming and Performance	17	16569	March 31, 2011
Texture Memory / Large Data / Global Memory Advice CUDA Programming and Performance	14	10680	May 18, 2010
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1677	June 11, 2008
Effective global memory bandwidth? CUDA Programming and Performance	17	17554	September 18, 2007
Putting the GPU at work CUDA Programming and Performance	21	20151	July 5, 2007

memory bandwidth device to SM bandwidth

Related topics