Texture Unit in Pascal architecture

Hi all,

I found out from nvprof that texture unit will still be used even I did not use any texture memory on Pascal Titan X. I found that in Pascal texture cache can be used as L1 unified memory for SM. Will the texture unit also be in use when L1/texture cache is in use?

Thanks,

The Maxwell and Pascal architecture combined the TEX and L1 cache into a single unified cache. All global, local, surface, and texture operations go through this cache.

http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf p.6

“Pairs of processing blocks share four texture filtering units and a texture cache. The compute L1 cache
function has now also been combined with the texture cache, and shared memory is a separate unit
(similar to the approach used on G80, the first CUDA capable GPU), that is shared across all four blocks.”

I do not understand the question “Will the texture unit also be in use when L1/texture cache is in use?”. All local, global, surface, and texture operations go through the texture cache. Local and global operations do not perform LOD, sampling, filtering, and data type conversion.

Thank you for your post, it may save me some coding time.

I have two global arrays.

  1. Small and heavily read
  2. The second big and read only once.
    Both are read in order rather than at random.

My thoughts are there is no point caching the second,
but the compiler will not know this, instead both arrays
will be cached and so the first (small <12kbytes) will often be pushed
out of the cache by the second (<16Mbytes).

I was thinking to put the small heavily used array in a 1D int texture
and using the texture cache. But I read your post to mean on my
GeForce GTX 745 the L1 caches are combined, so there is no texture cache
and this approach will not work.
Have I understood correctly?

Is there another way?

How can I tell if the feared trashing of the caches is happening or not.

As always, any help or guidance would be most welcome
Bill

12 KB can be placed into the shared memory of each SM

another possibility is to use L2-only caching for the large array and L1-enabled caching for the small one:

https://nvlabs.github.io/cub/classcub_1_1_cache_modified_input_iterator.html#details

https://nvlabs.github.io/cub/group___util_io.html#gac5f2805ad56fdd0f2860a5421d76d9b9

well, LOAD_CS looks like a good fit for the large table, and LOAD_CA for the small one. you can experiment in order to find best modifiers. also, note that results may be different for other GPU generations since it’s a part of architecture that is especially frequently changed :)

Slightly simplifying what will happen: Given just these two arrays, elements from the small array will be read into to cache at most twice. According to what you stated, elements from the large array will be read only once, and may kick elements of the small array out of the cache at that time. The next access to an element from the small array will place it in the cache again, and it will be read from there many times. The loss in efficiency compared to the ideal case (large array is kept out of the cache altogether) seems minuscule.

As BulatZiganshin points out, manually caching the small array in shared memory seems like the way to go for ultimate performance.

I think your GTX 745 is compute capability 5.0 (GM107) so I would agree with your assessment. On cc5.0 L1/Tex/RO are unified. A few possibilities:

  • read the 2nd array using uncached loads

https://stackoverflow.com/questions/12553086/cuda-disable-l1-cache-only-for-one-variable

  • put the first array in constant memory if read patterns are uniform (same address across warp)
  • put the first array in shared memory (might be best/most performant option)

The profiler can help with assessing cache behavior. Take a look at the cache related metrics such as hit rate.

njuffa, your analysis is incorrect - outcome depends on the ratio of accesses. if we read A[i] once per 1000 reads of B[j], then cache with 1000 cells will drop the A[i] value before it will be requested again

Although, there are high chances that it will be not the case, and anyway - if there are so many accesses to B that 12KB-large A may be swapped out of 1-4 MB cache of modern GPUs, probably we don’t have enough accesses to A anyway to make any significant speed improvements by better caching it

That’s not how I interpreted the usage pattern stated by the OP. To first order, the single reading of the large array B serves to flush the cache completely, once. Accesses to the small array A thereafter will repopulate the cache with data from A, and any subsequent access (many, according to the OP), will now hit the cache.

Maybe the access pattern in the OP’s code is more complicated than the original description suggests. Maybe there is more data than just A and B. Maybe accesses to A and B are interspersed in interesting ways. Even so, given that A is the most frequently accessed data around, and by itself completely fits into the cache, caching should work very well for this situation.

It will be interesting to see how performance changes from reading both A and B through the cache, vs. caching A in shared memory manually. If the code does not do so yet, I would suggest use of ‘restrict’ and ‘const restrict’ pointers to allow the compiler the maximum freedom in arranging an “optimal” sequence of load instructions.

my bad, i was sure that my interpretation of this English code is only one possible. But English is awfully polyvalue language, we should ban its use for any serious conversations! :D

At this point I am actually not at all sure that my interpretation of the OP’s description is correct. In general, it is better to look at actual code than read high-level descriptions of it.

Many thanks for all your kind thoughts.

In the hope that this will help I am going to try and post the code of
my kernel:

//WBL 3 Jan 2018 clean up for https://devtalk.nvidia.com/default/topic/1012969/cuda-programming-and-performance/texture-unit-in-pascal-architecture/
//Was r1.65

#define INF 10000000 /* (INT_MAX/10) */
#define MIN2(A, B)      ((A) < (B) ? (A) : (B))

//BLOCK_SIZE 128 only slightly lower performance on GeForce GTX 745 (4GB compute capability 5.0)
#define BLOCK_SIZE 64

/*
kernel, excepting data which is set to INF, set the output to the smallest sum of fml_i and fml_j
note data in fml_i (especially at the top end) are repeatedly read
but data in fml_j are only read once.
Each block calculates the smallest sum for once chunk and writes it (one int) to dml.
The outputs do not overlap.

The volume of work starts tiny (one addition) but grows quadratically as i reduces to 1.

typical use:
i 2909 down to 1
turn = 4
length = 2913
fml_i 2914 int
fml_j 4247155 int
dml 2914 int
I.e. arrays are of fixed size but part used grows linearly or quadratically (fml_j)
as i decreases to 1.
Identically the number of blocks grows linearly as i decreases to 1.
*/

//tried __restrict__ and got slightly lower performance on GeForce GTX 745
__global__ void
kernel(
  const int i, const int turn, const int length,
  const int* fml_i, const int* fml_j,  //In
  int* dml) {                          //Out

  const int x = blockIdx.x;
  const int j = x + (i + 2*(turn+1)) + 1;
        int y = threadIdx.x;
        int thread = j*(j-1)/2 + threadIdx.x + i + (turn+1) + 1;
  int decomp = INF;
  __shared__ int en[BLOCK_SIZE];
  for(; y <= x; thread+=blockDim.x, y+=blockDim.x) {
    //assert(x>=0 && x<=length);
    //assert(y>=0 && y<=length);
    //assert(y<=x);

    en[threadIdx.x] = ((fml_i[y] != INF ) &&   (fml_j[thread] != INF))?   fml_i[y] +   fml_j[thread] : INF;

//Use reduction, require power of two block size
#if BLOCK_SIZE > 32
  #define SYNC32 __syncthreads()
#else
  #define SYNC32
#endif
    int ix = threadIdx.x;
    const int ix_stop  = MIN2(x-y+threadIdx.x, blockDim.x - 1);
      //assert(ix_stop >= 0 && ix_stop < blockDim.x);
      //assert(en[ix] > -INF && en[ix] <= INF);
      //assert(en[ix] != 0);   //for testing only
#if BLOCK_SIZE >=1024
  __syncthreads(); if(ix+512 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+512]);
#endif
#if BLOCK_SIZE >=512
  __syncthreads(); if(ix+256 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+256]);
#endif
#if BLOCK_SIZE >=256
  __syncthreads(); if(ix+128 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+128]);
#endif
#if BLOCK_SIZE >=128
  __syncthreads(); if(ix+ 64 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 64]);
#endif
#if BLOCK_SIZE >=64
  __syncthreads(); if(ix+ 32 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 32]);
#endif
  SYNC32;          if(ix+ 16 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 16]);
  SYNC32;          if(ix+  8 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+  8]);
  SYNC32;          if(ix+  4 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+  4]);
  SYNC32;          if(ix+  2 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+  2]);
  SYNC32;          if(ix+  1 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+  1]);
  SYNC32;
  decomp = MIN2(decomp,en[ix]);
  }//endfor

  if(threadIdx.x==0){
    dml[j] = decomp;
  }
}

Thank you everyone.

The same volume of data are read from the small array and the large array
essentially sum = A[i] + B[j]
but i takes many of the same small (<3000) values over and over again,
whereas j (<4500000) is never repeated.

My understanding may be out of date but:

  1. I had bad experience with constant memory early on (CUDA 3.1)
    Each thread in each warp will access an adjacent int in constant memory
    I suspect this will cause it to serialise reads.
    My feeling is constant memory is only being retained for compatibility.
    I seem to remember (ages back) constant memory was actually implemented by
    a tiny (2kb??) read only cache.
  2. Again I may be out of date, but reading data into shared memory means it
    can only be used by the block that read it?

I will re-read your many helpful suggestions.
Many thanks
Bill

I am having an impossible time trying to assess the access pattern for fml_i in my head. Is this some sort of triangular structure? So I wouldn’t want to predict what happens to the cache here. The code certainly looks very different from what I had envisioned based on the original description; I withdraw my earlier comments, they don’t seem to apply at all.

Manually caching fml_i in shared memory seems advisable and trivially possible (unless BLOCKSIZE > 1024, but then a BLOCKSIZE > 1024 would not normally be recommended, since one would want have at least two thread blocks active per SM for good performance).

By benchmarking the original code, as posted, against a version using shared memory for fml_i, we would get a reasonable idea about the amount of destructive interference in the cache, but only indirectly. At this point I am actually quite curious what the result would be, it should help build intuition for similar usage patterns.

BTW, does the comment about restrict being counter-productive in terms of performance still apply? That may be owed to compiler artifacts related to load ordering / batching, or maybe register pressure effects (as pulling out loads to be performed early may increase the live range of variables).

Shared memory is shared by all threads in a thread block, correct.

You are correct, constant memory is built for broadcast access across a warp. If different threads in the warp present different addresses, the hardware will serialize the access and replay the load for as many times as different addresses occur across the warp. This is an obvious performance penalty. Last I did experiments (Kepler architecture), use of constant memory would still be a win if the average number of unique addresses presented is < 2.5 (across the small number of scenarios I examined).

Constant memory is still needed for literal constants, both those introduced by programmers and compiler-generated ones (the latter go to a different constant bank than the former). Constant memory (yet another bank) is also used to pass kernel arguments. So constant memory is not obsolete. With the broadcast access, an access to constant memory is basically the same cost as a register access.

If you implement a function by table lookup, constant memory might still be a good place to consider. I looked at a real-life use case for that not too long ago, where I examined the trade-off between on-the-fly computation and table lookup. Normally, I push in the direction of using computation, as “FLOPs are too cheap to meter” and computation is energetically advantageous compared with memory access. However, in this particular case it turned out (contrary to my expectation) that table access won in terms of performance, because the serialization had only a minor impact.

  1. if i understood the kernel right, thread block doesn’t reuse any values, so forget about caching data into shared memory

  2. so, the old suggestions apply - direct compiler to use non-caching read for the large array and try various forms of caching for the small one

  3. most part of your code is the block reduce algorithm, you may try ready-to-use supeer-optimized implementation from https://nvlabs.github.io/cub/classcub_1_1_block_reduce.html#details

  4. may be, code may further optimized by prefetching data in parallel with computations. the simplest way to implement it is to use CUB’s BlockLoad algo, followed by multi-item BlockReduce

apologies my reply seems to have been dropped:-(
here it is again…

Dear njuffa,
Thank you again.

Yip you are right the large array is essentially a triangular matrix.

Hmm perhaps my comments about the role of kernel input i are misleading.
The kernel is going to be launched a huge number of times. Each time
the triangular matrix gets bigger and i is different on each launch.
If the triangle matrix side is n, the kernel has to do approx n*n/2 operations.

Am I also correct in thinking that shared memory still cannot be used by threads
outside the current block. So I would have to change the way blocks are used to
be able to effectively used shared memory to cache the small array?

Many thanks
Bill

Am I misreading the posted code? It looked to me that there is re-use of fml_i[y], but it seems other posters disagree :-)

Manual caching in shared memory would mean that you copy the entire contents of fml_i into a shared memory array sh_fml_i at the start of the kernel (based on the comments regarding its size, this should be possible). Replace all instances of fml_i with sh_fml_i after that.

When I made my comment here:

https://devtalk.nvidia.com/default/topic/1028130/cuda-programming-and-performance/best-way-to-find-many-minimums/post/5229816/#5229816

I meant that you should do this:

__global__ void
kernel(
  const int i, const int turn, const int length,
  const int* fml_i, const int* fml_j,  //In
  int* dml) {                          //Out

  const int x = blockIdx.x;
  const int j = x + (i + 2*(turn+1)) + 1;
        int y = threadIdx.x;
        int thread = j*(j-1)/2 + threadIdx.x + i + (turn+1) + 1;
  __shared__ int en[BLOCK_SIZE];
  en[threadIdx.x] = INF;
  for(; y <= x; thread+=blockDim.x, y+=blockDim.x) {
    //assert(x>=0 && x<=length);
    //assert(y>=0 && y<=length);
    //assert(y<=x);
    int temp = ((fml_i[y] != INF ) &&   (fml_j[thread] != INF))?   fml_i[y] +   fml_j[thread] : INF;
    en[threadIdx.x] = MIN2(en[threadIdx.x], temp);
    } //endfor
//Use reduction, require power of two block size
#if BLOCK_SIZE > 32
  #define SYNC32 __syncthreads()
#else
  #define SYNC32
  // this is a defective case
#endif
    int ix = threadIdx.x;
    const int ix_stop  = MIN2(x-y+threadIdx.x, blockDim.x - 1);
      //assert(ix_stop >= 0 && ix_stop < blockDim.x);
      //assert(en[ix] > -INF && en[ix] <= INF);
      //assert(en[ix] != 0);   //for testing only
#if BLOCK_SIZE >=1024
  __syncthreads(); if(ix+512 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+512]);
#endif
#if BLOCK_SIZE >=512
  __syncthreads(); if(ix+256 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+256]);
#endif
#if BLOCK_SIZE >=256
  __syncthreads(); if(ix+128 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+128]);
#endif
#if BLOCK_SIZE >=128
  __syncthreads(); if(ix+ 64 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 64]);
#endif
#if BLOCK_SIZE >=64
  __syncthreads(); if(ix+ 32 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 32]);
#endif
  SYNC32;          if(ix+ 16 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 16]);
  SYNC32;          if(ix+  8 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+  8]);
  SYNC32;          if(ix+  4 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+  4]);
  SYNC32;          if(ix+  2 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+  2]);
  SYNC32;          if(ix+  1 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+  1]);
  SYNC32;

  if(threadIdx.x==0){
    dml[j] = en[0];
  }
}

Allow the for loop to stride through memory, performing your reduction along the way.

Only do the shared-memory sweep ONCE, at the end of the processing.

By the way, I’m not suggesting this is an exhaustive treatment of “how to optimize this code”. I’m really just focusing on one issue here, which I previously mentioned, and the code modification above is only intended to clarify that one concept. There may be numerous possible additional optimization suggestions/possibilities.

The code I posted (from OP) also contains a bug in the use of warp sync reduction for BLOCK_SIZE <= 32. In this case, the code will dispense with __syncthreads(), but this should only be done if the shared memory pointer in use at the warp level is marked volatile. The code as posted does not do that, so it may break in this case. I would not use the code as-is when BLOCK_SIZE is 32 or less.

Apart from that known issue, the code is entirely untested by me, and is merely a mechanical transformation of the code supplied by OP to identify a particular concept. The code may have any number of defects in it. Use it at your own risk.

njuffa, the loop is

for(; ...; thread+=blockDim.x, y+=blockDim.x)  
    A[y] + B[thread]...

so, with increment by blockDim.x, each next wave of y/thread values leapfrogs over values used in previous loop cycle.

yes

code provided by txbob is much better, though. i’m not sure about compiler smartness, so it may be better to replace

en[threadIdx.x] = INF;
  for(...) {
    en[threadIdx.x] = MIN2(en[threadIdx.x], temp);
  }

with

int value = INF;
  for(...) {
    value = MIN2(value, temp);
  }
 en[threadIdx.x] = value;

Side remark: Why use MIN2(), when CUDA supports min(), suitably overloaded for a large number of elemental types, and maps that directly to a single hardware instruction for e.g. ‘int’ and ‘float’?