char global memory access optimization

Hi, I’m really sorry if my question is very trivial, but there is something I couldn’t exactly understand from the programming guide:

I have a coalesced memory space full of chars. when reading from or writing to global memory, does 4 way bank conflict occurs as it does in the shared memory case??? What do you recommend? Thank you in advance.

No bank conflicts in global memory, but uncoalesced reads - which are worse. This is on Compute 1.1 devices - not sure how 1.2 and better handle consecutive char reads from global memory.

I recommend to cast the global memory pointer to int* and to perform a read to shared memory with (N+3)/4 threads as if it were an int array. N is the number of consecutive chars you need to access. Slightly faster might be to use int2* pointers and (N+7)/8 threads only.

Then access shared memory as a char array - you might experience some 4 way bank conflicts here, but this is not as bad as the uncoalesced reads you would have had when accessing he data as chars from global memory. If you do a lot of read accesses to your char array in shared memory, consider creating 4 identical copies of the array at different bank offsets, and let the threads read from the arrays in alternating patterns (i.e. threadIdx.x%4)


Global memory does not have bank conflicts.

Coalescing is a property of the access, not of the memory layout itself. I assume that you meant what is called a ‘packed array’ in Pascal and is the standard layout in C, namely, placing consecutive array members in adjacent bytes in memory.

Whether access from consecutive threads to consecutive array members will get coalesced depends on the compute capability of the device. Devices from compute capability 1.2 on have reorder buffers and will coalesce these just fine, trimming the transaction width to the minimal size necessary.

Thanks for the quick replies. My access is coalesced but I thought that since I have 8-bit data on the memory, 4 threads acess the same 32 bit space, maybe this can cause conflicts as it does in the shared case.

have you verified this with the CUDA profiler?

Reading consecutive bytes with consecutive threads on compute 1.1 devices definitely won’t coalesce.

my devices are compute 1.3 (gtx 285 and quadro fx 5800).

my devices are compute 1.3 (gtx 285 and quadro fx 5800).

my devices are compute 1.3 (gtx 285 and quadro fx 5800).

I am still confused. Is coalesced access to char global memory possible? What is the best way of reading chars from global memory?

No - coalescing only works for types of size which is a multiple of the word size (ie. 32 bits/4 bytes). If you want to have coalesced char access, you probably need to think about an access scheme where each thread can read 4 contiguous, 32 bit aligned char values at a time (so the char4 vector type, unsigned integers which are splitting them into chars afterwards).

Essentially what I tried to say in post #2 ;)

I just wasn’t 100% sure if Compute 1.2 and higher meanwhile have added a fix for chars and shorts.

Apparently not, because I do not remember reading anything about this in the programming guide.


As John Wayne would probably have have said, “Pilgrim, you can lead a horse to water, but you can’t make it drink”…

The situation might be a little bit relaxed in compute 1.2/1.3 and 2.0, but coalescing still implies 1 transaction per half warp request, and that can’t be done (certainly not in 1.2/1.3, and probably not in 2.0). I have some Fermi cards up and running now, but I haven’t had any time to play with them and see what they do.

I believe there’s a partial fix:

If I’m interpreting this correctly, for a single warp this would add up to two (one per half warp) transfers of 32bytes each (coalesced). So it looks like you’re still wasting half the bandwidth.


As others have said, coalescing depends on 32bit or larger accesses per thread.

On the other hand, coalescing is a device 1.0/1.1 features. Although the name hasn’t been dropped officially, it’s so much the right term for 1.2/1.3.

With devices 1.2 and up, the card has coalescing buffers, and if possible, it groups reads/writes into 32,64 or 128 byes. These grouping are not sensitive to order but are sensitive to alignment. If you look at the profiler, there is no coalesced reads/writes entries any more but rather 32,64,128 entries instead.

With bytes, it’s usually beneficial to access using textures (assuming you need read access), and then you get caches to combine reads between half warps, as coalescing buffers only combine reads for half warps. (actually I found that in a lot of cases, textures can be faster than coalesced reads as well). The other beneficial thing to do is handle 4 bytes per thread and read/write via shared memory

how can I control the number of threads in the different part of the kernel? Using if? :">

something like this might work

// let N be the number of consecutive bytes you need

// g_mem (type unsigned char*) points to global memory location of the bytes

// dynamically assigned shared memory, size computed by the host, needs to be

// at least N bytes

extern __shared__ unsigned char shared[];

// offset of s_bytes within shared[], e.g. pass from the host

unsigned int off1 = 0; 

// shared memory array to hold byte data

unsigned char *s_bytes = &shared[off1];

if (threadIdx.x < (N+3)/4)


		// perform a coalesced read of 32 bit integers into an unsigned char array

		*(unsigned int*)(&s_bytes[threadIdx.x * 4]) = *(unsigned int*)(&g_mem[threadIdx.x * 4]);


// assuming you have cuPrintf from nVidia developer site...

cuPrintf("Byte %d is %d\n", (unsigned int)threadIdx.x, (unsigned int)s_bytes[threadIdx.x]);

Hey Christian, I can’t thank you enough! Thank you very very much for your help. This was the solution in my mind but I kept getting errors. There is one more thing I want to ask. At the last step when reaching the shared memory there still exists bank conflicts right?

yes this implementation has bank conflicts.

But it all depends on how you read it ;) Read it with a stride of 4 and you’re fine ;)