Another question about coalesced reads/writes

Hi there, i need one info about that strange monster called “coalesced memory access” External Image
I have read around more topics about this argument, in this forum and also in the Net. But I’m not able to answer at my question:
I have many threads that need to access and process exactly one byte of data. If each thread loads him byte from global memory into shared memory the memory access is not coalesced (i think…) So how can I speed up this copy?
The only answer that I’ve reached is retrieve that data through one linear texture, so only the first read access to global memory, the next ones access at texture cache.
Is this the only way or is there a better way to do this?

thanks for answers External Image

Hi Mickey External Image

First you need to tell the compute capability of ur targeted hardware… For comp capblity less than 1.2 you dont have coalescing for single byte reads… If your hardware is like Cmput cap 1.3 then you can have single byte coalescing. Also you should use shared mem only when you have to reuse the data that you read from device mem… otherwise its waste of resource and time.

Sid.

OK, you’re right. My device is an Nvidia Quadro FX 770m, so its compute capability is 1.1. My target is making my software portable, in order to be executed on all CUDA-enabled devices. For your second answer I want to explain that each thread operates on one byte at a time, but need to access to more than one byte during its execution. So shared memory it’s the better way ( in my opinion at least :D )

So you say that there’s no way to load one byte per thread? Neither using something like cudaMemcpyArray or something similar?

thanks

mickey

Use 1/4th of the threads to copy one int each. The rest of the threads remains idle. Works like a charm, but requires some pointer typecasts to make the compiler happy. If your thread count isn’t a multiple of 4, you’ll have to round up the result of the division by 4.

This thread fills in some of the sticky details about using shared memory for this type of operation.

[topic=“87502”]Coalescing memory access for short to float conversion[/topic]

Thanks for your answers. But there is one more thing. Each group of 16 threads operates on 16 bytes, so the first 16 threads needs the first 16 bytes, the second group of 16 threads access at second group of 16 bytes and so on. In this situation I need that each group of 16 threads access at the related data.
threads 0…15 | 16…31 | 32…47
bytes 0…15 | 16…31 | 32…47
In the above situation I need to load the int 0…4 (0…15 bytes) with first 4 threads, but the second 4 int (5…8) must be loaded by threads 16…19 and so on. So the reads are not all coalesced, but only in part.
There isn’t any way to optimize more these reads/writes? I’ve think to waste some space in global memory allocating each char into one int, but the space wasted is 3/4 of the allocated one.
Other suggestions please?
thanks in advance

mickey

Thanks for the link, but I have one question: if I declare “extern” one piece of shared memory it is possible to access it by different blocks? And if yes how works thread synchronization between blocks? Through semaphores and other things like this?

thanks

mickey

Shared memory is local to each running block, you can’t share it between blocks.

Ok, it is exactly as I though. So the only way to coalesce reads/writes in my situation is to group the bytes in 16-byte groups and load them with 4 reads of 4 int as described two posts ago? Nothing better?

thanks

mickey

It could be better to have each thread read 4 bytes as an int32, process each of the 4 bytes and then write the 4 output bytes as an int32.

Or use a threadblock size a multiple of 64: then one halfwarp (16 threads) can read 16 ints which feeds 64 threads with one byte each.

Ok, thank you so much. I think it’s the best solution.

No one know the answer for question in this post? It is a bit of time that I’m trying to make run the profiler, but no success right now… :(

thanks

mickey