char global memory access optimization

donniedarko · May 28, 2010, 12:18pm

Hi, I’m really sorry if my question is very trivial, but there is something I couldn’t exactly understand from the programming guide:

I have a coalesced memory space full of chars. when reading from or writing to global memory, does 4 way bank conflict occurs as it does in the shared memory case??? What do you recommend? Thank you in advance.

cbuchner1 · May 28, 2010, 12:39pm

No bank conflicts in global memory, but uncoalesced reads - which are worse. This is on Compute 1.1 devices - not sure how 1.2 and better handle consecutive char reads from global memory.

I recommend to cast the global memory pointer to int* and to perform a read to shared memory with (N+3)/4 threads as if it were an int array. N is the number of consecutive chars you need to access. Slightly faster might be to use int2* pointers and (N+7)/8 threads only.

Then access shared memory as a char array - you might experience some 4 way bank conflicts here, but this is not as bad as the uncoalesced reads you would have had when accessing he data as chars from global memory. If you do a lot of read accesses to your char array in shared memory, consider creating 4 identical copies of the array at different bank offsets, and let the threads read from the arrays in alternating patterns (i.e. threadIdx.x%4)

Christian

tera · May 28, 2010, 12:42pm

Global memory does not have bank conflicts.

Coalescing is a property of the access, not of the memory layout itself. I assume that you meant what is called a ‘packed array’ in Pascal and is the standard layout in C, namely, placing consecutive array members in adjacent bytes in memory.

Whether access from consecutive threads to consecutive array members will get coalesced depends on the compute capability of the device. Devices from compute capability 1.2 on have reorder buffers and will coalesce these just fine, trimming the transaction width to the minimal size necessary.

donniedarko · May 28, 2010, 12:52pm

Thanks for the quick replies. My access is coalesced but I thought that since I have 8-bit data on the memory, 4 threads acess the same 32 bit space, maybe this can cause conflicts as it does in the shared case.

cbuchner1 · May 28, 2010, 12:56pm

have you verified this with the CUDA profiler?

Reading consecutive bytes with consecutive threads on compute 1.1 devices definitely won’t coalesce.

donniedarko · May 28, 2010, 1:05pm

my devices are compute 1.3 (gtx 285 and quadro fx 5800).

donniedarko · May 28, 2010, 1:06pm

my devices are compute 1.3 (gtx 285 and quadro fx 5800).

donniedarko · May 28, 2010, 1:12pm

my devices are compute 1.3 (gtx 285 and quadro fx 5800).

donniedarko · May 31, 2010, 8:11am

I am still confused. Is coalesced access to char global memory possible? What is the best way of reading chars from global memory?

avidday · May 31, 2010, 8:31am

No - coalescing only works for types of size which is a multiple of the word size (ie. 32 bits/4 bytes). If you want to have coalesced char access, you probably need to think about an access scheme where each thread can read 4 contiguous, 32 bit aligned char values at a time (so the char4 vector type, unsigned integers which are splitting them into chars afterwards).

cbuchner1 · May 31, 2010, 8:39am

Essentially what I tried to say in post #2 ;)

I just wasn’t 100% sure if Compute 1.2 and higher meanwhile have added a fix for chars and shorts.

Apparently not, because I do not remember reading anything about this in the programming guide.

Christian

avidday · May 31, 2010, 8:47am

As John Wayne would probably have have said, “Pilgrim, you can lead a horse to water, but you can’t make it drink”…

The situation might be a little bit relaxed in compute 1.2/1.3 and 2.0, but coalescing still implies 1 transaction per half warp request, and that can’t be done (certainly not in 1.2/1.3, and probably not in 2.0). I have some Fermi cards up and running now, but I haven’t had any time to play with them and see what they do.

Nico · May 31, 2010, 9:10am

I believe there’s a partial fix:

Devices of Compute Capability 1.2 and 1.3

Threads can access any words in any order, including the same words, and a single

memory transaction for each segment addressed by the half-warp is issued. This is

in contrast with devices of compute capabilities 1.0 and 1.1 where threads need to

access words in sequence and coalescing only happens if the half-warp addresses a

single segment.

More precisely, the following protocol is used to determine the memory transactions

necessary to service all threads in a half-warp:
Find the memory segment that contains the address requested by the lowest

numbered active thread. The segment size depends on the size of the words

accessed by the threads:

32 bytes for 1-byte words,

64 bytes for 2-byte words,

128 bytes for 4-, 8- and 16-byte words.

Find all other active threads whose requested address lies in the same segment.

Reduce the transaction size, if possible:

If the transaction size is 128 bytes and only the lower or upper half is used,

     reduce the transaction size to 64 bytes;

If the transaction size is 64 bytes (originally or after reduction from 128

     bytes) and only the lower or upper half is used, reduce the transaction size

     to 32 bytes.

Carry out the transaction and mark the serviced threads as inactive.

Repeat until all threads in the half-warp are serviced.

If I’m interpreting this correctly, for a single warp this would add up to two (one per half warp) transfers of 32bytes each (coalesced). So it looks like you’re still wasting half the bandwidth.

N.

laughingrice · May 31, 2010, 11:03am

As others have said, coalescing depends on 32bit or larger accesses per thread.

On the other hand, coalescing is a device 1.0/1.1 features. Although the name hasn’t been dropped officially, it’s so much the right term for 1.2/1.3.

With devices 1.2 and up, the card has coalescing buffers, and if possible, it groups reads/writes into 32,64 or 128 byes. These grouping are not sensitive to order but are sensitive to alignment. If you look at the profiler, there is no coalesced reads/writes entries any more but rather 32,64,128 entries instead.

With bytes, it’s usually beneficial to access using textures (assuming you need read access), and then you get caches to combine reads between half warps, as coalescing buffers only combine reads for half warps. (actually I found that in a lot of cases, textures can be faster than coalesced reads as well). The other beneficial thing to do is handle 4 bytes per thread and read/write via shared memory

donniedarko · May 31, 2010, 2:22pm

No bank conflicts in global memory, but uncoalesced reads - which are worse. This is on Compute 1.1 devices - not sure how 1.2 and better handle consecutive char reads from global memory.

I recommend to cast the global memory pointer to int* and to perform a read to shared memory with (N+3)/4 threads as if it were an int array. N is the number of consecutive chars you need to access. Slightly faster might be to use int2* pointers and (N+7)/8 threads only.

Then access shared memory as a char array - you might experience some 4 way bank conflicts here, but this is not as bad as the uncoalesced reads you would have had when accessing he data as chars from global memory. If you do a lot of read accesses to your char array in shared memory, consider creating 4 identical copies of the array at different bank offsets, and let the threads read from the arrays in alternating patterns (i.e. threadIdx.x%4)

Christian

how can I control the number of threads in the different part of the kernel? Using if? :">

cbuchner1 · May 31, 2010, 2:46pm

something like this might work

// let N be the number of consecutive bytes you need

// g_mem (type unsigned char*) points to global memory location of the bytes

// dynamically assigned shared memory, size computed by the host, needs to be

// at least N bytes

extern __shared__ unsigned char shared[];

// offset of s_bytes within shared[], e.g. pass from the host

unsigned int off1 = 0; 

// shared memory array to hold byte data

unsigned char *s_bytes = &shared[off1];

if (threadIdx.x < (N+3)/4)

{

		// perform a coalesced read of 32 bit integers into an unsigned char array

		*(unsigned int*)(&s_bytes[threadIdx.x * 4]) = *(unsigned int*)(&g_mem[threadIdx.x * 4]);

}

// assuming you have cuPrintf from nVidia developer site...

cuPrintf("Byte %d is %d\n", (unsigned int)threadIdx.x, (unsigned int)s_bytes[threadIdx.x]);

donniedarko · May 31, 2010, 4:16pm

something like this might work

// let N be the number of consecutive bytes you need

// g_mem (type unsigned char*) points to global memory location of the bytes

// dynamically assigned shared memory, size computed by the host, needs to be

// at least N bytes

extern __shared__ unsigned char shared[];

// offset of s_bytes within shared[], e.g. pass from the host

unsigned int off1 = 0; 

// shared memory array to hold byte data

unsigned char *s_bytes = &shared[off1];

if (threadIdx.x < (N+3)/4)

{

		// perform a coalesced read of 32 bit integers into an unsigned char array

		*(unsigned int*)(&s_bytes[threadIdx.x * 4]) = *(unsigned int*)(&g_mem[threadIdx.x * 4]);

}

// assuming you have cuPrintf from nVidia developer site...

cuPrintf("Byte %d is %d\n", (unsigned int)threadIdx.x, (unsigned int)s_bytes[threadIdx.x]);

Hey Christian, I can’t thank you enough! Thank you very very much for your help. This was the solution in my mind but I kept getting errors. There is one more thing I want to ask. At the last step when reaching the shared memory there still exists bank conflicts right?

cbuchner1 · May 31, 2010, 4:48pm

yes this implementation has bank conflicts.

But it all depends on how you read it ;) Read it with a stride of 4 and you’re fine ;)

Christian

Topic		Replies	Views
Memory coalescing in one thread CUDA Programming and Performance	17	16601	March 31, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2009	December 9, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1994	October 8, 2020
Another question about coalesced reads/writes CUDA Programming and Performance	10	2130	August 18, 2009
questions about coalescing access coalescing access CUDA Programming and Performance	8	1971	November 13, 2009
Shared memory question CUDA Programming and Performance	27	7323	June 23, 2008
Bytes in shared memory CUDA Programming and Performance	8	3045	April 19, 2017
Question about coalesced memory access CUDA Programming and Performance	10	2755	September 24, 2009
Quick memory access question. Threads fighting over a data source? CUDA Programming and Performance	9	4055	October 20, 2008
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2163	May 28, 2009

char global memory access optimization

Related topics