I don’t see how. You’ve not indicated the type of dA
. Let’s assume float
. A 128-bit vectorized load would load:
dA[i*TX]
dA[i*TX+1]
dA[i*TX+2]
dA[i*TX+3]
into a single thread. But the only values you are using are
dA[i*TX]
(and in the next loop iteration)
dA[i*TX+512]
So how would it help to load those values:
dA[i*TX+1]
dA[i*TX+2]
dA[i*TX+3]
?
Your code as posted never uses them, that I can see.
Probably not. I can only work with what you show here (once again, reminding myself, I should probably not respond to posts that have only partial codes - wasting your time and mine).
In order to be a proficient CUDA programmer, I personally believe there are several concepts (2, probably) that you need to understand to write good code. One of those 2 concepts is the idea of coalesced access. As CUDA programmers, we strive for it. The basic idea is that adjacent threads in a warp should read (or write) adjacent locations in memory.
Is that particular line of code doing that? You need to be able to answer that question in order to have any useful understanding of coalescing. (by the way, uniform access is not coalesced access).
For a coalesced load, in a particular cycle thread 0 is reading (let’s say) location 0, thread 1 is reading location 1, and so on.
Does your load do that?
dA[ i * TX ]
It does not. Thread 0 reads location i*TX
, thread 1 reads location i*TX
, and so on. Even if we extend this idea “across the for-loop” (which is likely to confuse you if you don’t have the basic idea of what coalescing is), with an eye towards restructuring the code, we see that:
loop iteration: load location:
0 0
1 512
2 1024
...
(*)
And that applies to all threads in the warp.
Those locations are not adjacent to each other so you would never be able to arrange coalescing, without a restructuring of the data storage pattern (I already hinted at transpose in your previous question.)
For basic CUDA programming concepts presented in an orderly way, you may wish to avail yourself of this resource.
Once again, I can only work with the code you show here. I don’t know what else you may be doing. I’m now going to adhere to the principle I previously stated for this case. I don’t think it makes much sense to discuss an incomplete piece of code. I probably won’t be able to respond further.
(*) Note: coalescing has no bearing on separate iterations of a for-loop. The purpose of that discussion is to look at the data loaded more wholistically, to see if a restructuring of the load patterns or the data storage pattern itself might be useful.