Should I always use some shared to maximize use of L1?

dscerutti · July 10, 2017, 7:49pm

Hello,

I’ve been having some success writing fast kernels, but I am now faced with a situation where I can do everything in registers–no need for any information to go into shared–and I realize that I don’t have a deep understanding of what makes the best use of L1 resources. I can certainly set the kernel to “preferL1” to get 16kb shared and 48kb registers, but does that not leave 16kb of the overall L1 cache space unused? Should I write my code in such a way as to stuff a portion of the information I could otherwise have in registers into shared arrays? There is information that I will need to reference at roughly 1/4 or 1/16th the rate of other information, so if I stuff the less frequently needed data into a 16kb shared partition will that help clear more room for threads to place the data they most urgently need in registers?

Cheers!

Robert_Crovella · July 11, 2017, 12:13am

L1 cache and register space are not the same thing. The separation is 16KB shared and 48KB L1, or vice-versa.

Different GPU architectures have different possible separation choices between shared and L1, but in many cases you cannot select 0KB shared.

Selecting 48KB L1 and 16KB shared does not leave 16KB of L1 cache space unused - since on most architectures (except Volta) it was not possible to select 0KB shared.

Generally speaking, shared is a good optimization (better than registers, better than cache) if you can intelligently think of a way as a programmer to use it. That’s admittedly a fuzzy/circular definition, but its often referred to as a “user-managed cache”.

Normally, you don’t get to explicitly “put things in registers”. The compiler does that for you. If you’re expert enough to know (or inspect) exactly what is in registers, and you can keep your entire working set in registers, then I would see no particular reason to use shared.

But for the less expert of us, using shared is often a win, and on most architectures, using up to 16KB of shared is not going to make the L1 scenario any worse.

Using shared can also have implications for occupancy, this may be a bigger perf factor than any implications for register storage. Using 16KB of shared per threadblock will limit your occupancy if you select only 16KB of shared memory total. But using ~1KB of shared per threadblock is not likely to have any occupancy impact, and could be a win if you can figure out what to put there and why.

dscerutti · July 11, 2017, 5:28am

Yes, I post on this forum fishing for replies from txbob (and njuffa). The good stuff. Haven’t caught a Mark Harris quite yet, but I know they’re marbled with +rep and have a crisp, insightful flavor.

I think I’m gaining an understanding of what “register space” really means. I had read Mark Harris’s description on StackExchange from a long time ago, and I’ll just go with his statement “scalar variables will be stored in registers by the compiler.” So, I’m trying to keep the number of scalar variables that are relevant to the calculation around at any one time small, to keep the compiler from stuffing things off in global for later retrieval, but I’m looking at the register space as something that is managed by the compiler and transiently used to hold data for whatever variables are relevant to the calculation at any particular line of the code. Better?

Occupancy is something that I’m gaining a mastery of as my CUDA coding matures. The solution I’m actually working towards, at this point, will be one with a larger portion of shared, but I will be trying to access it in the cleanest manner I can (every thread in the warp accesses data held by a different bank on every read). There are 1024 threads in the block on this new function (different but related to the one in my other topic that txbob has kindly replied to), BATCHSIZE is 256 not 240, and the warp specialization is cleaner for it.

njuffa · July 11, 2017, 5:50am

These days, Mark Harris shares his wisdom primarily on the Parallel Forall blog: [url]https://devblogs.nvidia.com/parallelforall/[/url]. I rarely see him active in these forums or in Stackoverflow’s CUDA tag anymore.

Thread-local variables go into local memory by default. If you look at the SASS (machine code) generated for debug builds, which is built with all optimizations turned off, I think you’ll find that is where they live. Local memory is simply a separately mapped portion of global memory, with a per-thread mapping.

As an optimization, the compiler may move thread-local into registers. This is pretty much always possible for scalars. Arrays can be moved into register if they are sufficiently small and if all addressing on them is compile-time constant (since the register file is not dynamically indexable like memory). Use of variables in source code is only loosely correlated to register pressure since the compiler takes care of instruction scheduling and register allocation. Various optimizations can increase register pressure: early scheduling of loads can increase life ranges, loop optimizations can create induction variables, common subexpression elimination (CSE) can create the need for new temporary variables, etc, etc. Other variables from source code may simply disappear as computations are re-arranged.

The primary purpose of shared memory is to hold data that is not thread-local, but is shared between threads. Obviously it can also be used as a cache for thread-local data if so desired, provided there is re-use.

As for thread-block sizing: Where possible, it is usually better to use multiple small blocks than large blocks. The smaller granularity often leads to a better utilization of resources. If I recall correctly, Scott LeGrand was one of the people who pushed for GPUs to support very small thread blocks at good occupancy, something that is reality today. While high occupancy is sometimes important for performance, that is not always so. There is a famous paper by Vasily Volkov giving concrete examples how good performance may also be achievable with fairly low occupancy. I can’t find the paper right now, but his slide presentation “Better Performance at Lower Occupancy” can be found in NVIDIA’s GTC archives: [url]Page Not Found | NVIDIA. Volkov’s 2016 PhD thesis “Understanding Latency Hiding on GPUs” is here: [url]https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html[/url]

dscerutti · July 11, 2017, 6:51am

Good to hear from you, njuffa!

I had tried to use higher numbers of smaller blocks; at least in once case I was getting equivalent performance with 128 threads per block versus 1024 (though it was in another kernel than the one we have been discussing). My understanding of occupancy and performance as a function of block sizes has been on a steep curve lately, so perhaps I should revisit that.

I know I got scared off about a week ago reading that the maximum number of blocks per SM was 8. But, that was way back on C2050, or maybe even earlier. Judging by the rest of the code I inherited, it seems that 16 may now be the limit; at least that’s the most I can specify in the launch_bounds before the compiler throws a warning “I’m ignoring your request.” Is that correct?

Cheers,
Dave

njuffa · July 11, 2017, 7:03am

There is a handy table in this blog post: https://devblogs.nvidia.com/parallelforall/inside-pascal/

Presumably this data should also be in an appendix of the Programming Guide for CUDA 8, but I haven’t checked.

What architectures are you building for when __launch_bounds() limits you 16 threads blocks? I assume that as long as Kepler is in the mix, 16 is the correct limit to enforce. If the compiler still limits you to 16 thread blocks in a __launch_bounds() attribute when you build only for sm_50 or higher, that would seem to be a bug.

dscerutti · July 11, 2017, 7:14am

Yes, indeed Kepler is still in the mix–I’ve dropped support for architectures below that, but it’s good to see that now 32 is the standard.

tera · July 11, 2017, 10:03pm

Note that you can always use preprocessor conditionals based on__CUDA_ARCH__ if you can’t find a single launch_bounds() specification that is optimal amongst all targeted device architectures.

Should I always use some __shared__ to maximize use of L1?

Should I always use some shared to maximize use of L1?