Texture Unit in Pascal architecture

I imagine your comment was directed to OP. However, in case you are referring to what I posted:

I used MIN2 because I was not trying to scrub the entirety of the code presented, and wanted to focus attention on the issue that I had observed. I chose not to make any other changes to the code. Like any code, there may be numerous possible optimizations that can be suggested. I was not trying to suggest that my modification represented an exhaustive treatment of optimization of the code.

My comment was in fact directed at the OP, and it’s a minor comment as such :-) It makes perfect sense that txbob focused narrowly on the issue at hand, as there seem to be too many issues in flight at the same time already …

MIN2 jumped out at me because it’s use seems to depart from the design principle of least surprise. It suddenly raises the question of "Is there something special processing that MIN2 does that min() doesn’t? Which might be the case if the data were floating-point and the default handling of NaNs prescribed by IEEE-754 wasn’t suitable for the present use case.

Dear njuffa, txbob and BulatZiganshin,
Thank you very much for all your
time and effort looking at my (defective:-( code.

So far (I hope to get more performance data later):

njuffa:
Yes you are right. fml_i[y] (particularly the top end near y=length)
should be independently read by many blocks.

txbob:
yip I think I have now understood and do the shared memory reduction
only once. This was a big performance gain, more than doubling the
kernel’s speed.

Thank you for reporting the missing volatile bug.
(The bug did not show up in my testing.)
Fixing it, according to nvprof only cost 2.8% of average kernel time.

BulatZiganshin:
In the loop which reads global memory (ie before the reduction),
I have replaced using shared memory by registers and nvprof says
it saves 1.4% of average kernel time.

njuffa
I have replaced MIN2(x,y) by min(x,y) and nvprof says this saves
1.3% of average kernel time.

I have also simplified my shared memory reduction (following
Mark Harris’ slides) and again according to nvprof this has saved
1.3% of average kernel time.

Many thanks
Bill

Dear njuffa, txbob and BulatZiganshin,
I thought for completeness I should post the debugged version of the code.
Thank you again for all your help
Bill

//WBL 10 Jan 2018 clean up for https://devtalk.nvidia.com/default/topic/1012969/cuda-programming-and-performance/texture-unit-in-pascal-architecture/
//Was r1.78
#define INF 10000000 /* (INT_MAX/10) */

//BLOCK_SIZE must be power of two 32 or greater
//BLOCK_SIZE 128 slightly lower performance on GeForce GTX 745 (4GB compute capability 5.0)
#define BLOCK_SIZE 64

/*
Set the output to the smallest sum of fml_i and fml_j
note data in fml_i (especially at the top end) are repeatedly read
but data in fml_j are only read once.
Each block calculates the smallest sum for once chunk and writes it (one int) to dml.
The outputs do not overlap.

The volume of work starts tiny (one addition) but grows quadratically as i reduces to 1.

typical use:
i 2909 down to 1
turn = 4
length = 2913
fml_i 2914 int
fml_j 4247155 int
dml 2914 int
I.e. arrays are of fixed size but part used grows linearly or quadratically (fml_j)
as i decreases to 1.
Identically the number of blocks grows linearly as i decreases to 1.
*/

//Use __restrict__ to give compiler best chance to optimise
__global__ void
kernel(
  const int i, const int turn, const int length,
  const int* __restrict__ fml_i, const int* __restrict__ fml_j,  //In
  int* __restrict__ dml) {                                       //Out

  const int x = blockIdx.x;
  const int j = x + (i + 2*(turn+1)) + 1;
        int y = threadIdx.x;
        int thread = j*(j-1)/2 + threadIdx.x + i + (turn+1) + 1;
  int value = INF;
  for(; y <= x; thread+=blockDim.x, y+=blockDim.x) {
    value = min(fml_i[y] + fml_j[thread], value);
  }
  volatile __shared__ int en[BLOCK_SIZE];
  en[threadIdx.x] = value; //must set whole of en

  const int ix = threadIdx.x;
#if BLOCK_SIZE >=1024
  __syncthreads(); if(ix < 512) en[ix] = min(en[ix], en[ix+512]);
#endif
#if BLOCK_SIZE >=512
  __syncthreads(); if(ix < 256) en[ix] = min(en[ix], en[ix+256]);
#endif
#if BLOCK_SIZE >=256
  __syncthreads(); if(ix < 128) en[ix] = min(en[ix], en[ix+128]);
#endif
#if BLOCK_SIZE >=128
  __syncthreads(); if(ix <  64) en[ix] = min(en[ix], en[ix+ 64]);
#endif
  if(ix < 32) {
#if BLOCK_SIZE >=64
    __syncthreads();            en[ix] = min(en[ix], en[ix+ 32]);
#endif
    en[ix] = min(en[ix], en[ix+ 16]);
    en[ix] = min(en[ix], en[ix+  8]);
    en[ix] = min(en[ix], en[ix+  4]);
    en[ix] = min(en[ix], en[ix+  2]);
    en[ix] = min(en[ix], en[ix+  1]);
  }

  if(threadIdx.x==0){
    dml[j] = en[0];
  }
}

Just to follow up, I have a technical report “CUDA RNAfold” RN/18/02 just posted on BioRxiv

which expands on this and other CUDA hacks.

Comments most welcome
Bill