element-adjacent-element reads

hello,

a number of my kernels commence with threads consuming an array element, together with the neighbouring array element

d1 = array[x + threadIdx.x];
d2 = array[x + threadIdx.x + 1];

y = f(d1, d2);

if the whole warp participates, reading both d1 and d2 should amount to 2 reads, but not more, if i am not mistaken
if multiple warps participate, the theoretical min number of reads is number of warps + 1, if i am not mistaken

so, is there an argument to pre-read the data into shared memory first?
can only simply issue a synchronization call between reading d1 and d2, such that much of d2 hits the cache?
is the synchronization call even necessary?

If that data is read only just leverage the texture cache (__ldg if available on your arch). Hitting L1 (non-coherent/texture cache) can be nearly as efficient as hitting shared, particularly if you’re not getting a lot of reuse from the store to shared. L1 can broadcast values in the same way that shared can. L1 also minimizes transactional overfetch from L2 (32 byte blocks instead of 128 byte).

There is a huge amount of engineering into making the texture cache efficient for exactly this kind of operation. But this is only my initial take on things from the limited info you provided. Definitely consider shared if the data reuse factor is high between other warps and threads.

Oh and also consider a warp-shuffle. But I think that last thread in the warp needing 1 more value will kill the advantage of that.

And as far as when you need to call __syncthreads(), within a warp shared memory accesses are issued in sass order. I believe the volatile keyword can preserve your cuda-c ordering to sass ordering. So, if each warp isn’t stepping on another warps region of shared access, there’s no need to sync. You can just rely on the instructions being issued in order.

yes, the data is read-only

i didn’t provide much information, as i saw it as really a trivial case: the threads of the kernel need 2 data inputs, spaced an element apart, rather at the beginning of the kernel, in order to calculate the output of the kernel

come to think of it, i seemingly was of the impression that the warps need to order their reads, in order to ‘ride on each others’ backs’
but, after reading your answer, this view dawned upon me, as well as the point that this may not necessarily be required

a warp may either hit the cache on the 1st read (d1), the 2nd read (d2), both reads, or neither
a cache miss would cause it to pave the road for other warps to hit the cache on the same reads

still, isn’t some form of synchronization required, from a load/ store unit pipeline view?
can 2 back-to-back reads in the load/ store unit pipeline not still cause a cache miss, what would would otherwise have been a cache hit (due to lack of synchronization)?

I"m pretty sure with just having back to back LDG.CI (__ldg) instructions the texture unit will execute the minimum number of transactions to get the data to the threads. Even if the data isn’t in cache, it would make sense that the memory unit would be aware of what data was recently requested and not ask for it again.

But it should be easy to setup a little test and verify this through nvprof.

noted, thanks

Ok, I was testing something else and realized I was able to confirm this with the data I had. I was running a synthetic benchmark with 16 LDG.CI instructions in a row all pulling from the same cache line. The nvprof stats indicated that only one cacheline was transferred from L2 to L1 (4 transactions) and that the hit rate was above 90%. This is despite the fact that there’s no way the first load could have collected the data prior to the second load being issued.

I also ran the same test with texture loads and observed the same thing. What I was actually doing was comparing the performance of LDG.CI with TLDS. It seems LDG has almost half the latency with double the throughput, so there’s little reason to be using texture instructions if ldg is available. I think if you’re using vector loads then things become a little more competitive.

yes, scottgray the legend…

Actually I would have considered user JanetYellen to be the legend on this forum, until I came to my senses realizing that there is no obligation to use real identities on this forum and one can upload any profile picture.

What???

And all this time I was so impressed that the chair of the FRB was also a competent GPU programmer…