Reading from global memory to registers in a fast way

Now inside of the kernel reading from global memory to register is like this:
NPAGES=24;
TX=512;

    const int tx  = threadIdx.x;
    const int bx = blockIdx.x;
    float rA[NPAGES] = {M_S_ZERO};
    const int m_ = m-(NPAGES-1)*TX;

    // Read from global memory to registers
    dA += bx * ldda + tx;
    #pragma unroll
    for(int i = 0; i < NPAGES-1; i++){
        rA[i] = dA[ i * TX ];
    }
    if( tx <  m_){
        rA[NPAGES-1] = dA[ (NPAGES-1) * TX ];
    }

As I understand each thread separately will read the data from global memory to its register. Is it correct?

Can I do those operations in a better way that finish faster? e.g. using shared memory instead of register and do it as a whole copy from global memory to shared memory.

That’s not true at the warp level, and may not be true device wide. At the warp level, 32 threads are reading from the same location here:

First of all, the LSU/Memory controller will identify that the same address is being requested warp-wide and will only issue a single read request to that location. When the read request is satisfied, the GPU will use a broadcast mechanism to deliver the data simultaneously to each thread in the warp (i.e. will deposit the data simultaneously in the relevant register for each thread in the warp.

Second, in such a situation, while the very first request for that item will likely go to global memory, the retrieval of that data will populate the L2 cache. Thereafter subsequent requests will likely hit in the L2 cache. Again, these subsequent requests are issued warp-wide, and will enjoy the same broadcast mechanism.

Based on what you have shown here, I think you are unlikely to improve on this using ideas like you have suggested. The read activity is already condensed.

The pattern you have shown here (so called “uniform” access) does make it a candidate to consider usage of __constant__ memory for storage of dA. That might be worth investigating. And if you have time to spare, sure, load each item of dA using a single thread, store it in shared memory, and then see if you get an improvement via shared memory. I don’t know a-priori if such transformations are likely to help. To some degree it may be dependent on what else is going on. However, at least for the first for-loop, the multiplication of the index by TX (512) means that the locations that that for-loop cares about are spaced apart by 512 locations. That doesn’t lead to nice load patterns.

1 Like

Thanks for good details.

Can I use vectorized load and store instructions LD.E.{64,128} and ST.E.{64,128} here by reinterpret_cast to improve it?

The pattern you have shown here (so called “uniform” access) does make it a candidate to consider usage of __constant__ memory for storage of dA .

After reading dA I have to work on data and write them back. So using __constant__ make sense here?

Finaly about this part :

the multiplication of the index by TX (512) means that the locations that that for-loop cares about are spaced apart by 512 locations. That doesn’t lead to nice load patterns.

Why the load pattern is not fine? I have divided the original matrix to sub-matrix and for that reason I am accessing like that. Is it possible to imagine any other patterns?

I don’t see how. You’ve not indicated the type of dA. Let’s assume float. A 128-bit vectorized load would load:

dA[i*TX]
dA[i*TX+1]
dA[i*TX+2]
dA[i*TX+3]

into a single thread. But the only values you are using are

dA[i*TX]

(and in the next loop iteration)

dA[i*TX+512]

So how would it help to load those values:

dA[i*TX+1]
dA[i*TX+2]
dA[i*TX+3]

?

Your code as posted never uses them, that I can see.

Probably not. I can only work with what you show here (once again, reminding myself, I should probably not respond to posts that have only partial codes - wasting your time and mine).

In order to be a proficient CUDA programmer, I personally believe there are several concepts (2, probably) that you need to understand to write good code. One of those 2 concepts is the idea of coalesced access. As CUDA programmers, we strive for it. The basic idea is that adjacent threads in a warp should read (or write) adjacent locations in memory.

Is that particular line of code doing that? You need to be able to answer that question in order to have any useful understanding of coalescing. (by the way, uniform access is not coalesced access).

For a coalesced load, in a particular cycle thread 0 is reading (let’s say) location 0, thread 1 is reading location 1, and so on.

Does your load do that?

dA[ i * TX ]

It does not. Thread 0 reads location i*TX, thread 1 reads location i*TX, and so on. Even if we extend this idea “across the for-loop” (which is likely to confuse you if you don’t have the basic idea of what coalescing is), with an eye towards restructuring the code, we see that:

loop iteration:   load location:
0                          0
1                          512
2                          1024
 ...

(*)
And that applies to all threads in the warp.

Those locations are not adjacent to each other so you would never be able to arrange coalescing, without a restructuring of the data storage pattern (I already hinted at transpose in your previous question.)

For basic CUDA programming concepts presented in an orderly way, you may wish to avail yourself of this resource.

Once again, I can only work with the code you show here. I don’t know what else you may be doing. I’m now going to adhere to the principle I previously stated for this case. I don’t think it makes much sense to discuss an incomplete piece of code. I probably won’t be able to respond further.

(*) Note: coalescing has no bearing on separate iterations of a for-loop. The purpose of that discussion is to look at the data loaded more wholistically, to see if a restructuring of the load patterns or the data storage pattern itself might be useful.

2 Likes

About this part that you have written:

adjacent threads in a warp should read (or write) adjacent locations in memory .
Does your load do that?

dA[ i * TX ]

It does not. Thread 0 reads location i*TX , thread 1 reads location i*TX , and so on.

We have this line also, so for thread 1 location is not i*TX and is not same as the location for thread 0.

Yes, I missed that. So you can probably disregard everything I have said in this thread. I would just delete it all, but that might be more chaotic than leaving it as-is. Good luck!

Thanks.
Just final question. I am using float. How can I do a 128-bit vectorized load for this line? Because registers location do not have an index.

    dA += bx * ldda + tx;
    #pragma unroll
    for(int i = 0; i < NPAGES-1; i++){
        rA[i] = dA[ i * TX ];
    }

1 2 3

I have a problem with rewriting the Read loop:

The original code is :

    dA += bx * ldda + tx;
    #pragma unroll
    for(int i = 0; i < NPAGES-1; i++){
        rA[i] = dA[ i * TX ];
    }

I want to use float4 so the loop now should executes 1/4 times. For that reason should I inrease the i like i+=4 or limit i to i<(NPAGES-1)/4?

The first line of the code is: dA += bx * ldda + tx;, so I have the location of dA based on each thread. What should I do here?



I think below code is not correct.

// read

   dA += bx * ldda + tx;

    #pragma unroll
    for(int i = 0; i < (NPAGES-1)/4; i++){
        reinterpret_cast<float4*>(rA)[i] = reinterpret_cast<float4*>(dA)[i*TX];
    }
    if( tx <  m_){
        rA[NPAGES-1] = dA[ (NPAGES-1) * TX ];
    }

Note that on GPUs all loads must be naturally aligned. This is different from x86 CPUs that support unaligned access (at some cost to performance).

Therefore re-interpreting a float* as a float4* and then de-referencing the latter will only work as intended if the pointer has 16-byte alignment.

Thanks njuffa,

I have a matrix which allocated by cudamalloc, and here the dA is refering to that matrix in global memory. I was expected to see it as a aligned memory.

Based on the section 5.3 of CUDA_C_Programming_Guide I think it should be 16-bytes aligned memory. Am I wrong?

I have another question about registers which they do not have indexing. How can I manage them to be able to de-reference them for vectorization load?