Another question about coalesced reads/writes

mickey_mouse · August 13, 2009, 5:52pm

Hi there, i need one info about that strange monster called “coalesced memory access” External Image
I have read around more topics about this argument, in this forum and also in the Net. But I’m not able to answer at my question:
I have many threads that need to access and process exactly one byte of data. If each thread loads him byte from global memory into shared memory the memory access is not coalesced (i think…) So how can I speed up this copy?
The only answer that I’ve reached is retrieve that data through one linear texture, so only the first read access to global memory, the next ones access at texture cache.
Is this the only way or is there a better way to do this?

thanks for answers External Image

sidxavier · August 14, 2009, 7:50am

Hi Mickey External Image

First you need to tell the compute capability of ur targeted hardware… For comp capblity less than 1.2 you dont have coalescing for single byte reads… If your hardware is like Cmput cap 1.3 then you can have single byte coalescing. Also you should use shared mem only when you have to reuse the data that you read from device mem… otherwise its waste of resource and time.

Sid.

mickey_mouse · August 14, 2009, 8:25am

OK, you’re right. My device is an Nvidia Quadro FX 770m, so its compute capability is 1.1. My target is making my software portable, in order to be executed on all CUDA-enabled devices. For your second answer I want to explain that each thread operates on one byte at a time, but need to access to more than one byte during its execution. So shared memory it’s the better way ( in my opinion at least :D )

So you say that there’s no way to load one byte per thread? Neither using something like cudaMemcpyArray or something similar?

thanks

mickey

cbuchner1 · August 14, 2009, 10:58am

Use 1/4th of the threads to copy one int each. The rest of the threads remains idle. Works like a charm, but requires some pointer typecasts to make the compiler happy. If your thread count isn’t a multiple of 4, you’ll have to round up the result of the division by 4.

mborgerd · August 14, 2009, 12:50pm

This thread fills in some of the sticky details about using shared memory for this type of operation.

[topic=“87502”]Coalescing memory access for short to float conversion[/topic]

mickey_mouse · August 16, 2009, 5:48pm

Thanks for your answers. But there is one more thing. Each group of 16 threads operates on 16 bytes, so the first 16 threads needs the first 16 bytes, the second group of 16 threads access at second group of 16 bytes and so on. In this situation I need that each group of 16 threads access at the related data.
threads 0…15 | 16…31 | 32…47
bytes 0…15 | 16…31 | 32…47
In the above situation I need to load the int 0…4 (0…15 bytes) with first 4 threads, but the second 4 int (5…8) must be loaded by threads 16…19 and so on. So the reads are not all coalesced, but only in part.
There isn’t any way to optimize more these reads/writes? I’ve think to waste some space in global memory allocating each char into one int, but the space wasted is 3/4 of the allocated one.
Other suggestions please?
thanks in advance

mickey

mickey_mouse · August 16, 2009, 5:55pm

Thanks for the link, but I have one question: if I declare “extern” one piece of shared memory it is possible to access it by different blocks? And if yes how works thread synchronization between blocks? Through semaphores and other things like this?

thanks

mickey

avidday · August 16, 2009, 7:10pm

Shared memory is local to each running block, you can’t share it between blocks.

mickey_mouse · August 16, 2009, 11:55pm

Ok, it is exactly as I though. So the only way to coalesce reads/writes in my situation is to group the bytes in 16-byte groups and load them with 4 reads of 4 int as described two posts ago? Nothing better?

thanks

mickey

mborgerd · August 17, 2009, 11:07pm

It could be better to have each thread read 4 bytes as an int32, process each of the 4 bytes and then write the 4 output bytes as an int32.

Or use a threadblock size a multiple of 64: then one halfwarp (16 threads) can read 16 ints which feeds 64 threads with one byte each.

mickey_mouse · August 18, 2009, 9:19am

Ok, thank you so much. I think it’s the best solution.

No one know the answer for question in this post? It is a bit of time that I’m trying to make run the profiler, but no success right now… :(

thanks

mickey

Topic		Replies	Views
Quick memory access question. Threads fighting over a data source? CUDA Programming and Performance	9	4055	October 20, 2008
char global memory access optimization CUDA Programming and Performance	17	11872	May 31, 2010
Question about coalesced memory access CUDA Programming and Performance	10	2755	September 24, 2009
Shared memory question CUDA Programming and Performance	27	7323	June 23, 2008
Coalesced Memory access related doubt CUDA Programming and Performance	13	2009	December 9, 2010
coalesced access to global memory CUDA Programming and Performance	6	1168	May 8, 2014
coalescing memory in short to float conversion CUDA Programming and Performance	3	4479	January 23, 2009
copying to shared block mem CUDA Programming and Performance	11	4179	April 6, 2008
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5965	November 27, 2010
global memory latency CUDA Programming and Performance	12	16171	December 13, 2007

Another question about coalesced reads/writes

Related topics