Device 0: “GeForce 8800 GTX”
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 804585472 bytes
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1350000 kilohertz
what is the expected maximum memory bandwidth and how do i get it in my threads ?
has anyone done a study of the bandwidth under different loading schemes in a cuda thread ?
e.g. should i load float4’s or char1’s, should every thread do a load or only some of them, should i load via texture fetch or simply device memory access.
Which memory bandwidth are you asking about?
Device memory have achievable bandwidth of 70 GB/sec, host-to-device memory bandwidth is limited by PCI-Express bus and is about 2-2.5 GB/sec.
To get good device memory bandwidth you need make all your memory accesses coalesced (see Programming Guide for details). Also you should avoid using float4 as it seems to break coalescing in some way (others reported problems with it). 8-bit and 16-bit data types are also not coalesced, so you have to use 32- or 64-bit types to get maximum bandwidth.
Textures are good for read-only access when you cannot ensure coalescing. If your accesses are coalesced then textures won’t give you any benefit.
If all threads in a warp access same memory address (read-only) then costants may be a goos choice as they are cached.
Thanks, i was asking about device to SMEM/registers bandwidth
So simple smem[threadid] = device[threadid] is a good way to go (with correct alignment) - i guess i would get bank conflicts to smem if i used 64bits here though ?
Do you know how much data is read from memory for each coalesced read ?
The important thing is that consecutive threads in a warp access consecutive memory locations, regardless of whether it is a 32 or 64-bit read per thread. The hardware then merges the reads together to match the capacity of the card. (384-bits per read in your case)
I guess bank conflicts are unavoidable here. Anyway, coalescing global memory accesses is of much higher importance than shared memory bank conflicts, so just ignore them.
Thinking about device memory reads vs texture fetches.
If my threads were reading a 2D plane ‘row by row’ would the ‘2D locality’ awareness of the 2D texture fetches help me when the threads move to the next row ?
What does the texture fetching hardware do in response to a 2d texture fetch ?
Thanks, now that raised another question in my small brain…
IF the G80 has 8 multiprocessors and each can run 32 (16+16) threads at once, that means we get 256 threads active at once - whcih doesnt seem to correlate to the 384 bit memory reads…
It’s not clear how the hardware deals with the partial read at the end. Only the early G80 cards with their unusual 384 and 320 bit buses have sizes that don’t divide 32 threads x 32 bits = 1024 bits. All the later cards have 256 (or less) bit memory buses, which can service a coalesced read for a warp without wasted bits.
Perhaps this problem, along with the cost of a wide memory bus, led to reduction in memory bus size in the later cards. Even the leaked specs for the upcoming 9800 GX2 and GTX suggest that future cards will be sticking to 256 bit memory interfaces. (This is all pure speculation, so don’t take this too seriously.)
What do you mean “row by row”? If you mean threads in a warp accessing across the row, that can be easily coalesced. If you mean threads in a warp accessing down a column, then just transpose your data and then it is coalesced. If threads in a warp access both down rows and across columns but with good 2D locality within the warp, then the 2D texture cache will get you close to the coalesced performance. If threads in a warp read semi-randomly with 1D locality, then you can use a 1D texture bound to device memory.
I’m still unsure abaout coalescing global mem access.
In the simpleStreams - example from NVIDIA there is the following code:
__global__ void init_array(int *g_data, int value)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
g_data[idx] = value; // uncoalesced on purpose to burn some time
}
But when I see this code I think its coalesced, or not?? the consecutive threads access consecutive memory locations (32 bit integer). But they write “uncoalesced on purpose to burn some time”… What’s the Problem here?
Greets burnie
EDIT: In the example instead of value is given an adress in global memory, so i think then the “uncoalesced” should be true…