I have a kernel in which I have to do some copying based on importance. So I will calculate an array as big my data that contains the index of old data that has to become new data. Something along the lines of :
It will happen quite often that there are only a few distinct values in index, from low to high like index = {1 1 1 1 123 123 123 123 123 123 123 123 123 …}
So I will have trouble with uncoalesced access. Is the only way to do this fast by binding x,y,z,vx,vy,vz to a texture and sampling the texture? That will likely give me a lot of benefit since a lot of threads in a block will want the same value from the texture. As far as I understood accessing the same index in a global array in a warp leads to serialization. Is that correct?
You realize that you’ll be modifying the data you are reading, right? Are you guaranteed to never read after write? It looks like not, from your example. This means that your result could be incorrect if you read from texture.
If you could guarantee it, then texture would be a good solution.
make sure that if e.g. 312 is a value of my index array, I will make sure the value at position 312 is also 312, so I will not overwrite input-data that will be used. (aka the elegant, hard solution)
copy my x,y,z,vx,vy,vz arrays and bind the textures to the copy (but this will double my global mem usage. I will have to see if that will be a problem in practise. This I call my dumb solution :D )
Use simple tex1Dfetch fetches for the read. The writes should be trivial to coalesce. The warp serializations will be a non-issue, you are going to be memory bound with this operation.
I realize that this might change your data structures elsewhere in your code, but reading float4 textures (one for position, one for velocity) is faster than replacing the same accesses with individual float texture reads for x,y,z.
Also: I’m curious, what is this for?. I have this exact same operation happening in my code, though importance only has unique values in it. I use it to reshuffle the order of particles in memory to improve memory throughput in other algorithms.
I have indeed been thinking of switching to float4’s. The only trouble is that elsewhere in my program (another kernel) I need to read in the written values also. So I will have to profile what is faster:
tex1Dfetch float4 + coalesced read (only) of float4
4x tex1Dfetch float + 4x coalesced read (only) of float
You probably have a good guess (given your testing in this field) if using float2’s might be even an option, because as far as I remember float2’s were not to bad in coalesced accesses, while float4 were ‘slow’
So the output of this kernel will be used as input for this kernel (making float4 interesting), but before that as input of another kernel that reads the values coalesced (making float4 not so nice to use)
My solution is just to always read the float4 texture, even when I could use a coalesced read. Unfortunately this is not feasible when you need to need one kernel to work with many float4 arrays.
However, I do think this one is due for some testing. I made my choice between 4 tex1Dfetch float reads vs 1 float4 read way back in CUDA 0.8. Let’s see if things have changed since then. I’ll try out my key random memory access kernel with the switched data structure and post the results later today.
Well, I will also be occupied with traveling most of next week, so I hope I will be able to implement it Monday morning. If I am able, I will report what is fastest for me.