question about texture reads and coalescading reads

Hi, i have a few questions:

  1. about texture
    At some point in my program, all threads are executing tex2D() function with exactly the same index.
    Is there any lock for reading the same cell, or can all threads access it at the same time? (if this would be normal array, there would be all bank conflicts i guess?)

  2. about simple conversion from uncoalesced to coalesced memory access
    At this point i have kernel which accepts one big array like: 1 2 3 4 5 6 7 8 9 10 11 12 13 …
    Lets say when i execute program, each thread gets 4 elements from it’s subarray, this elements are then passed to the calling functions as offset from the original array.
    Something like this: fcall( Arr), fcall( &Arr[4] ), fcall( &Arr[8] )…
    When i am accessing this array from fcall, this is clearly uncoalesced read/write access.
    What i would like to do here is to get those array reads/writes coalesced from inside of those fcalls.

Would something like that be coalesced, if i fill original array like this:
1 2 3 4 0 0 0 0 0 0 0 0 0 0 0 0
0 5 6 7 8 0 0 0 0 0 0 0 0 0 0 0
0 0 9 10 11 12 0 0 0 0 0 0 0 0 0 0

and then pass to each thread
fcall( Arr ), fcall( &Arr[16+1] ), fcall( &Arr[16+2] ) …?

I realize here is a lot of wasted memory, but that is not a problem at this point.

Thanks for answers.

PS: Sorry for my bad english and probably an awkward explanation >.<

  1. There is no problem with multiple threads all accessing the same location in a texture, the cache will handle this. Bank conflicts only occur in shared memory.

  2. No, your example would not result in coalesced reads because for a given instruction together the threads are still not reading a contiguous region of memory. The programming guide explains this in detail.

  1. On G200, all threads accessing the same gmem location is coalesced. When doing texture fetches, it should also be streamlined (load into cache once). However, I have a suspicion that the gmem reads will be faster than the texture fetches on G200.

  2. I’m not sure what your access pattern is. Did you mean: fcall( Arr ), fcall( &Arr[16+1] ), fcall( &Arr[[u]2*[/u]16+2] ) ? In this case, with a stride of >=16, it will not be coalesced at all even on G200. fcall( Arr), fcall( &Arr[4] ), fcall( &Arr[8] ) would actually be half-coalesced on G200.

DRAM does have banks, but their behavior is a bit more subtle and large-scale than your example.

Thanks to both for answers.

alex_dubinsky: yes i ment there 2*

Well i finished converting program to coalescading reads/writes and it is working much better now. Meanwhile i also wrote some small program that uses coalescading reads/writes, because getting right indexes was quite a pain in the *** imo. If anyone else has the same problem, feel free to test/use it. :)

coalesced.cu.txt (3.57 KB)

[attachment=8004:coalesce…ernel.cu.txt]
coalesced_kernel.cu.txt (2.13 KB)