how to avoid the global memory latency in selection

Dear All,

I try to select, nforc, datas from a array, w(ndata) if its element is larger than the value, tol,

my code is (in which 256 threads are used):

       real*8, device :: w(n), blk(100*256)

         isub=threadidx%x
        ipx=(isub-1)*100

         iforc=0
         do i=isub,ndata,256
            if(w(il).gt.tol) then
               iforc=iforc+1
               blk(ipx+ijforc)=i
              endif
          enddo

          nsav(isub)=iforc
          call syncthreads()

However, I found most time used is the statement

blk(ipx+ijforc)=i

I seems the problem is due to the global memory latency.

Does anyone have a better solution ?

Thanks in advance

Minghui

Hi yangmh,

It’s not the latency. Rather, it’s that the data is not contiguous across threads in a warp leading to memory divergence. To fix, you’ll need to change your algorithm so that blk can be accessed using “i” instead of “ipx+ijforce”.

Hope this helps,
Mat

Hi, Mat,

Thanks for your help !

That means, If the array w(ndata) is divided into several parts, each of them is treated in shared memory, the memory divergence could be avoided ?

Minghui