Uncoalesced global loads

JKS1e6.pdf (1.0 MB)

I ran the profiling tool/NSIGHT Compute for my CUDA C implementation (results attached). I have been running into this issue: “Uncoalesced global accesses” (at the bottom of the pdf).

I understand that threads are not accessing array elements in order. The code in some places uses indirect access, for example, b[ a[threadId] ]. Could this be one of the causes?

Is there a way to determine which section/s of the code is causing this error? What is the 14 digit string at the end of the error (uncoalesced global access) pointing to?

It could be.

This blog demonstrates a method to identify specific lines of source code that are leading to various nsight compute reports. I recommend reviewing all 3 parts of the blog. For this question, focus on the steps that look at the source report page. Be sure to compile your app with -lineinfo

It is a PC (program counter) address - the address of the (SASS) instruction that the observation is being reported against. Rather than try to use this info, I suggest using the method I already indicated.

Note that for questions on nsight compute we have a dedicated forum you can use.

I was able to identify the biggest bottleneck to the performance, and it seems to come down to these lines (attached
JKS1e6.pdf (191.8 KB)
NSIGHT Compute Report):

    for(int i=0; i<3; i++){
        Localalpha[i] =  alpha[ix*3+i];
        Localbeta[i] =   beta[ix*3+i];
        Localvx[i] =  vx[index[ix*3+i]-1];
        Localvy[i] =  vy[index[ix*3+i]-1];
    }

The memory access patterns are not ideal primarily due to the indirect accesses and the striding. Are there any hacks for indirect accesses and striding that may make it more “GPU friendly” (or improve the memory access patterns)?

I did not look at the PDF.

Avoid indirect addressing, as it typically results in irregular access patterns that significantly degrade GPU memory throughput, in addition to increasing the bandwidth requirements. I don’t know the context of this snippet. Maybe you can alleviate the problem by blocking and buffering in shared memory?

You could experiment with using the float3 type for vx, vy, alpha, beta. My assumption is that there is a reason all of these are written as triplets. This may or may not result in better performance, which is why I use the word “experiment”.