I am not so texture oriented. But this is what the p.g says
"
template<class Type, enum cudaTextureReadMode readMode>
Type tex1D(texture<Type, 1, readMode> texRef,
float x);
"
tex1D function can accept a floating point “x”. So use “0.5*index” with float4… Will it not work?
–edit–
tex1D() requires cudaArray() bounded region… but if that helps, you could use it.
Because, if we somehow set up: n4 = data4[n], then n4.x will always equal one of the ‘x’ values in the data set (or possibly be an interpolation between some values of ‘x’.)
What was asked for was instead a way for every second thread to do unaligned access so that they would be assigned n4.x = data4[n].z … n4.w = data4[n+1].y
Additional ideas? Mmm …
Use twice the memory - duplicating values - so that your access pattern matches the hardware?
Have the currently even threads side by side in one warp executing smoothly and the odd threads with borked access pattern in another warp?
Brooks GPU (used by AMD Streams) has this set of “stream” functions which groups a stream.
For example if u have a stream of elements like <1,2,3,4,5,6,7,8> – you can group them as { <1,2>, <2,3>, <3,4> … }.
There are many different ways of grouping, boundary condition specification and so on…
Check out stream operator section in Brooks Spec.
You may want to group your input data before memcpy your data to the GPU…
OR
You can use a GPU kernel to do the grouping instead of CPU (if ur data size is large)
Hi,
I’ve used float2 texture instead of “regular/single” float. The mapping was very trivial and easy (CUDA rocks!!)
Just doing so gave ~15-20% performance boost for my overall code.
It would still be nice to know if I could use float4 instead of float2 as I described in the first post. nVidia??? :)
Your “even threads” - if that really is where they must be - could still do their access the more straightforward float4 way if you had two texRef (one float2, one float4) to the same underlying data …