Performance degradation in 7.0. Silly handling of constant memory in SASS vs 6.5

So I tried updating to Cuda 7.0 on ubuntu 14.04 running on some 980GTXs. I had massive performance degradation (at least 4 times slower), I think I’ve narrowed to cause down to the PTX to SASS compiler being silly.

So I’m unsure If I can publicly post the code so I’ll put the parts I think may be relevant.

typedef struct {
    int vals[QRY_LEN];
} query_t;

__global__ void
lookup_kernel(const int * const __restrict__ index, gph_kern_result * __restrict__ candidates, query_t qry);

Now the code basic looks at subsets of the integers of index and qry and looks for minimum bit differences. So the main loop consists of things functionally equivalent to.

biterr = __popc(qry[x] ^ index[y]) + __popc(qry[x + 32] ^ index[y + 1]) + __popc(qry[x + 64] ^ index[y + 2]);
best_biter = min(best_biter, bitter);

I’ll look at sm_50 SASS since I know how to extract that from 6.5 code, though the ptx is probably getting recompiled for the 980’s sm_52.
Now in 6.5 The __popc(qry ^ index[y]) get’s transformed directly into a few instruction. NOTE the IADD is for a previous __popc. R22 is from a previous global load.

126         /*0328*/                   LOP.XOR R28, R22, c[0x0][0x2ac];                /* 0     x4c4704000ab7161c */
 127         /*0330*/                   IADD R29, R29, R24;                             /* 0     x5c10000001871d1d */
 128         /*0338*/                   POPC R28, R28;

So the core loop is 3 instructions per __popc (xor, pop, add / min)

But in 7.0 the SASS compiler seems to think the constant value from qry needs to be loaded a single byte at a time and SHLed, ORed and LOP3.LUTed back together into the original int. Note I just copied and pasted various pieces that show the relevant instructions

76         /*01a8*/                   LDC.U8 R4, c[0x0][R18+0x82];                                   /* 0xef90000008271204 */
   77         /*01b0*/                   LDC.U8 R7, c[0x0][R18+0xd7];                                   /* 0xef9000000d771207 */
   78         /*01b8*/                   LDC.U8 R3, c[0x0][R18+0x102];
100         /*0268*/                   I2I.U32.U8 R11, R11;                                           /* 0x5ce0000000b7020b */
  101         /*0270*/                   LDC.U8 R19, c[0x0][R18+0x157];                                 /* 0xef90000015771213 */
  102         /*0278*/                   I2I.U32.U8 R4, R4;
 120         /*0308*/         {         SHL R11, R11, 0x8;                                             /* 0x3848000000870b0b */
  121         /*0310*/                   I2I.U32.U8 R8, R8;        }                                    /* 0x5ce0000000870208 */
  122         /*0318*/         {         LOP.OR R9, R5, R4;                                             /* 0x5c47020000470509 */
  123         /*0328*/                   I2I.U32.U8 R4, R21;        }
 340         /*09e8*/                   I2I.U32.U8 R16, R31;                                           /* 0x5ce0000001f70210 */
  341         /*09f0*/         {         SHL R31, R10, 0x10;                                            /* 0x3848000001070a1f */
  342         /*09f8*/                   LDC.U8 R30, c[0x0][R18+0x155];        }                        /* 0xef9000001557121e */
  343                                                                                                   /* 0x041fc000822007f0 */
  344         /*0a08*/         {         LOP3.LUT R22, R31, R28, R22, 0xfe;                             /* 0x5be70b0fe1c71f16 */
  345         /*0a10*/                   I2I.U32.U8 R27, R21;        }

So now 3 instructs is like 10 and register usage explodes, which with my 32 register limit (optimal for 6.5) leads to lots of locals, so performance is decimated.

I could almost understand this behavior if qry could be unaligned but it’s passed by value, so I have no idea why it would decide to load int constants a byte at a time.

I obviously sticking with 6.5 until a find a work around somewhere, or 7.0 gets a bug fix. Though I was hoping to use a couple features from 7.0 they’re not critical.

You’re likely to get more traction if you can provide a short sample code that someone else can compile (without having to add anything) and play with. Note that this doesn’t mean posting your whole code, but instead a short example that demonstrates the issue.

If you can create a simple reproducible example that shows this kind of difference, you probably just want to file a bug with nvidia, which can be done at

txbob gave some good advice. While a bug in the compiler is always possible, in recent years the CUDA toolchain has stabilized to the point that when something weird is going on, it is more likely driver error than a due to a compiler bug.

Based on that, I think you would want to double check the builds. Are both builds release builds? The description of a 4x slowdown, increased use of local storage instead of registers, etc. suggests the build in question could be a debug build (compiled with -G for example, or lowered PTXAS optimizations such as -Xptxas -O1).

According to the code snippets provided, qry is an array of elements of type ‘int’. I am not aware of a compiler transformation (i.e. optimization used in release builds) that would break accesses to int into individual byte accesses. Does the source code have alternate code paths by any chance, one of which uses int and the other uchar4?

Without buildable source code that demonstrates the issue it is impossible to provide more than speculation here. Self-contained short repro code would be useful, as txbob pointed out.

It’s a definitely release build (all I do is swap /usr/local/cuda to point to the other cuda version). I was hoping someone had an idea before I spent the time building a test case. I’ll start doing that since it seems necessary.

Alright I’ve think I have a small example. That show the issue

I compiled with

nvcc -ccbin g++ -m64  -Xptxas="-v" -gencode arch=compute_50,code=sm_50

With 7.0 it takes 26.4 ms and 6.5 takes 3.84 ms. The difference in speed is due to the excessive locals in 7.0 and all those locals are most likely from the weird constant loading I mentioned initially.

#include <stdio.h>

const int INDEX_SUBSAMPLE 	= 32;
const int QRY_LEN 		= 256;
const int BLOCKS                = 8;
const int THREADS_PER_BLOCK     = 1024;
const int INDEX_LEN		= 256 * 1024 * 1024 / INDEX_SUBSAMPLE;

typedef struct {
	int vals[QRY_LEN];
} query_t;

__global__ void
__launch_bounds__(THREADS_PER_BLOCK, 2)
lookup_kernel(query_t qry, const int * const __restrict__ index, int * __restrict__ result)
	const int * const __restrict__ query = qry.vals;
	int best_loc 			= 0;
	int best_be             = QRY_LEN * 32 + 1;
	for(int loc = threadIdx.x; loc < INDEX_LEN; loc += blockDim.x * gridDim.x){
		int old_best_be 		= best_be;
		for(int j = 0; j < 16; j += 1) {
			int cur_be = 0;
			for(int i = 0; i < 8; i++)
				cur_be += __popc(query[j + i * INDEX_SUBSAMPLE] ^ index[loc + i]);

			best_be = min(best_be, cur_be);
		if(best_be < old_best_be) {
			best_loc = loc;
	result[blockDim.x * blockIdx.x + threadIdx.x] = best_loc;

	query_t qry = {0};
	int *d_index = NULL;
	cudaMalloc(&d_index, sizeof(int) * INDEX_STORE_LEN);
	int *d_res = NULL;
	cudaMalloc(&d_res, sizeof(int) * CANDIDATES_LEN);
	int *h_res = NULL;
	cudaHostAlloc(&h_res, sizeof(int) * CANDIDATES_LEN, 0);
	cudaEvent_t start, end;

	lookup_kernel<<<BLOCKS, THREADS_PER_BLOCK>>>(qry, d_index, d_res);
	cudaMemcpyAsync(h_res, d_res, sizeof(int) * CANDIDATES_LEN, cudaMemcpyDefault); 

	float ms = -1.0f;
	cudaEventElapsedTime(&ms, start, end);
	printf("Took %f ms\n", ms);

	return 0;

To look at the SASS code you can use

cuobjdump -sass a.out

A quick thought. What happens if you leave off the launch_bounds attribute? I am wondering whether this is a case where the CUDA 7.0 compiler wants to use one additional register compared to CUDA 6.5, but runs up against the hard limit imposed by launch_bounds causing massive spilling and/or recomputation to make it fit. Have you tried using #unroll pragmas on both of the innermost for-loops?

[Later:] Never mind about the #pragma unroll, those loops get fully unrolled by the compiler without such annotation, at least with CUDA 6.5. It seems that left to make its own choices, the compiler will not fit the kernel into 32 registers for sm_50. I have not studied the code closely yet, but it seems trying to squeeze the code down to 32 registers per thread (and potentially triggering spilling) could be avoided at minimal performance loss by removing the launch_bounds attribute and using blocks of 128 threads. I don’t have an sm_50 device to try it but it seems worth giving a whirl.

Now that you have a standalone repro code, I would recommend filing a bug right away, independent of any workarounds we may be able to puzzle out in this thread. The bug reporting form is linked directly from the registered developer web page.

Without the launch bounds cuda 6.5 uses 34 registers and 7.0 uses 167 (and can’t launch with 1K threads per block, not that my hacky code shows that error). Now if I reduce the threads to 64, such that they both can launch 7.0 is faster at 16ms and 6.5 at 19ms. Of course the far greater occupancy of the 6.5 one will still win when I start many of these kernels at once. I suspect the optimizer has a new trick that doesn’t work well under register pressure.

I can verify your CUDA 6.5 GTX 980 running time (in Windows though via nvprof):

Took 3.423360 ms
==7128== Profiling application: ConsoleApplication1.exe
==7128== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
205.78ms  3.4096ms              (8 1 1)      (1024 1 1)        32        0B        0B         -           -  GeForce GTX 980         1         7  lookup_kernel(query_t, int const *, int*) [187]
209.20ms  4.5430us                    -               -         -         -         -  32.768KB  7.2129GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]

If you would like to see the sass output let me know. Probably should be the same despite the different OS.

Using finer granularity thread blocks is usually a good idea to maximize occupancy. As a sanity check, if you use the smaller thread blocks plus launch_bounds to squeeze the code into 32 registers per thread, do you get back an occupancy of 1.0 for CUDA 6.5, and get identical performance to the 2x1024 thread configuration? Or does the thread block granularity interfere with the address pattern presented to the memory controller.

I do not have an explanation for your observations regarding the behavior of the CUDA 7.0 toolchain, the compiler behavior in CUDA 7.0 seems bizarre, more like a plain old bug somewhere rather than a misguided heuristic. In checking out the source code, I did not spot anything out of whack, like invoking undefined or implementation defined behavior. Definitely worthy of a bug report.

Thanks I expect the SASS to look like mine since the performance is similar with 6.5.

With 64 threads and 128 blocks (to max out occupancy).

no launch bounds
6.5 1.998816
7.0 3.493888 ms

__launch_bounds__(64, 32) //32 registers per thread
6.5 1.977312 ms
7.0 24.513376 ms

Surprisingly the 7.0 isn’t as bad as I thought, though it still take 75% longer which is pretty bad. In the real app running with 1024 threads offers some additional gains not represented in my example.

I’ll file a bug report once my registration has been processed.

Is running with the 64-thread blocks and without the launch_bounds attribute possible in the context of the real kernel, and thus an acceptable workaround for now? 1.75x slowdown is better than 4x slowdown.

There definitely seems to be something out of whack with the CUDA 7.0 compiler, but I have no idea what that could be, the code from the reduced test case seems quite straightforward. Just to confirm, you are seeing this behavior with CUDA 7.0 final release, not the previously available CUDA 7.0 release candidate?

Approval for the registered developer program usually happens within one business day (relative to US Pacific or Mountain time). Since this is the weekend, and depending on which time zone you are in you will probably hear back sometime Tuesday.

Opps I need more for occupancy

no launch bounds (1024 blocks x 64 threads)
6.5 2.183168
7.0 2.449824 ms

The amount of work per thread reduced so 6.5 became worse than 128 blocks. 7.0 got better, though 6.5 still has a noticeable advantage. I’m not finding a situation where 7.0 makes more efficient use of the resources with this code.

This is Cuda 7.0 final. I’ll just run 6.5 until the newest version matches or exceeds it’s performance for my use case. Of course filling a bug report is the best way to make that happen. 1.75x slow down isn’t worth the trade off for me. It’ll almost double costs and this code is intended to be run continuously for years and the costs are rather high (though still much cheaper than the CPU variant).

I’m also seeing a HUGE performance regression. I have 38 kernels in one module that all went from zero spills (on CUDA 5.5, 6.0, 6.5 and 7.0 RC) to 270+ bytes of spills. The module hasn’t been touched in many months.

A 7x performance drop. Not good at all.

The target is a Maxwell GTX 750 Ti.

I’m also seeing unexplained Nsight debug failures (posted in the Nsight forum).

does this affect only Maxwell hardware or also Kepler ?

would be bad for VS 2013 users (like us) if that is a ‘general’ bug affecting a lot of kernels… because using Cuda Toolkit 6.5 is not possible (at least when using texture objects and maxwell hardware) because of this bug:

To get broader test coverage, maybe nVidia should consider distributing toolkit previews and beta releases as a public beta program.

Nvidia are also distributing beta drivers publicly (all that is required is setting a checkbox in the download form), so why not also extend this to the toolkit.


I’m surprised at the performance drop since I was happily running the 7.0.18 “beta/rc” with few problems.

It looks like a ptxas regression to me.

Just submitted bug report.