CUDA assembler woes

My kernel is taking a lot more registers than I anticipated. I took a look at the disassembly with nvdisasm. It’s less smart than I expected.

Exhibit A:
FMUL R11, RZ, R4;
Multiply a value by zero, yielding a zero. I’ve seen add with zero, too.

The main problem I’m having is that I have a lot of parameters to pass to my kernel. So I moved the parameters to a structure which I pass directly to the kernel. Here’s the resulting disassembly.

MOV R19, c[0x0][0x15c];
STL.64 [R0+0x8], R10;
MOV R22, c[0x0][0x160];
STL.64 [R0+0x10], R14;
MOV R23, c[0x0][0x164];
STL.64 [R0+0x18], R18;
MOV R26, c[0x0][0x168];
STL.64 [R0+0x20], R22;
MOV R27, c[0x0][0x16c];
MOV R30, c[0x0][0x170];
STL.64 [R0+0x28], R26;

followed by 3 local loads for the 3 values that are actually needed (I commented out most of the kernel).

In other words, the compiler takes values from constant memory and moves them to local memory and then loads them back from local memory. I guess that means I’ll have to pass every parameter explicitly.

I lost some confidence in the compiler. I’d like some pointers as to what I can expect the compiler to get right.

Question 1)

Can the compiler resolve static memory accesses in constant memory correctly, as in the following (same index for all threads)?
Does it still work if I use a dynamic index that is the same for all threads in the warp?

global void kernel(int foo[3])
{
int val = foo[0];
}

Question 2)

Can the compiler fold all constants if I use static iteration counts in several nested loops with monotonically increasing indices for register array accesses?

In which cases do I need to unroll loops manually using macros?

Question 3)

Does the compiler fail to optimize when it sees computations involving uninitialized registers? I’ve seen some weird behavior in the disassembly.

Thanks.

Did you disassemble an optimized build or a debug build?

Debug builds tend to use local memory for just about anything.

Christian

-g O2

To answer my own question 1).

No, you can’t pass int foo[3] to the kernel, because either the array name is a pointer in global memory or host memory, not an array in constant memory.

What’s the proper way to copy an array to constant memory? It isn’t a constant array, so don’t tell me about constant.

  1. you can pass a struct containing foo[3] by value as kernel argument or as a device function argument

  2. cudaMemcpyToSymbol is the proper way

Why don’t you just load all these kernel parameters into shared memory for access by each thread in the block? That way you would have more control over how the local access patterns.

Each SM (Maxwell,Pascal) has 96K shared memory per SM so that gives you quite a bit of room.

The compiler folds constants aggressively. Can you show a particular example where you expected constant folding to occur, but it doesn’t happen? Note that under IEEE-754 semantics, some folding cannot happen with floating-point arithmetic. For example, x*0.0 is not always 0.0, because x could be infinity or NaN, in which case the correct result would be NaN, or x=-0, in which case the correct result would be -0. [See the ISO C99 standard section F.8.2 for additional examples; C99 floating-point bindings were inherited by C++ 11). Maintaining proper IEEE-754 semantics is a Good Thing ™: it allows for robust numerical algorithms and programmer sanity. If you desire loosey-goosey treatment of floating-point arithmetic, I would suggest looking into CUDA Fortran, as Fortran has traditionally allowed all mathematically equivalent tranformations (which may well not be equivalent under IEEE-754 semantics)

In my experience there is almost never a need to manually instruct the compiler to unroll loops. If you do want to do that you would use #pragma unroll N directly before the loop where N is the desired unrolling factor (or 1 if you want to prevent the compiler from unrolling).

Why would you want to use uninitialized data, that doesn’t seem to make much sense? Uninitialized data may lead to undefined behavior at the C++ level, which may prompt the compiler to remove all affected computation. Once an instance of undefined behavior is detected, anything can happen (see the (in)famous “nasal daemons”).

My kernel needs to receive something like this:
void my_kernel(float *buf_array[3]) { … }

some_buf_array is indexed by blockIdx.z. This is the actual buffer to use for the current warp in the kernel.

I don’t want this access to be resolved from global memory: it would be slow.

I can’t put some_buf_array in shared memory: the kernel needs to receive the array buffer somehow. Also, I’m using shared memory for other stuff.

I don’t want to use cudaMemcpyToSymbol(): AFAICT this will fail if two CPU threads invoke the same kernel concurrently. This usage must work even if it’s not optimal.

Passing a structure destroys performance. As I wrote before, the compiler actually copies the whole structure to local memory and then resolve the accesses from there. Can anyone confirm that behavior?

That’s all the ways to pass the array that I can think of. It looks like I’m screwed.

Thanks for the replies for constant unfolding. From the disassembly, it does seem the compiler is mostly able to do it. I can’t explain stuff like val += 0 and val *= 0 though (edit AFAIK the compiler can prove there’s no NaNs involved here – but maybe I’m wrong).

For the uninitialized registers: the first few iterations of my main loop compute values that aren’t all useful (8x8 DCT partially overlapping the block). Hence it shouldn’t matter whether I initialize those registers or not.

I don’t think the CUDA compiler uses range-tracking(*) for numerical data. Very few compilers do, and I know of none that do it for floating-point data. Generally speaking, the required analysis tends to be more costly than is warranted by the achievable performance gains. Note that the case of *=0 requires checking for infinities, negative zero, or NaN, not just NaN. Single-precision FLOPS on GPUs are “too cheap to meter”, and usually not the bottleneck of real-life applications, so I wouldn’t be concerned about an FMUL with zero, unless the profiler indicated it is contributing to a bottleneck.

(*) I am not sure “range tracking” is the correct technical term, it has been several years since I last looked into this.

Do you mean uninitialized variables? There is no way to directly initialize registers from the HLL level. I can’t envision a scenario where the use of uninitialized data would lead to more code being emitted. If there is any impact at all, I would expect it to go in the opposite direction, i.e. code elimination. In general, I don’t think it is a good idea to leave data uninitialized, this has a tendency to bite the programmer in their behind later on as code changes are applied.

That makes sense. Thanks!

edit Yeah, I meant uninitialized variables. I’ll be a good citizen and initialize those variables.

It’s not at all clear to me why you can’t use constant memory. The kernel parameter passing ends up in the same physical resource.

Agree about the physical location. Correct me if I’m wrong, but if two CPU threads try to write to the same constant area because each of them want to launch the same kernel (with different parameters), there will be a conflict.

In other words, constant introduces a global, which breaks MT.

Yes, that’s an issue. Apparently you don’t want to use a mutex. It might not be that painful, though, with asynchronous streams. Define a wrapper function that you use to call the kernel and its associated cudaMemcpyToSymbol operation

wrapper()
{
acquire mutex/lock
cudaMemcpyToSymbolAsync(… streamX);
kernel<<<…,streamX>>>(…);
release lock
}

The lock would only be held for as long as it takes to queue up the operations. Since you are always launching this kernel using this wrapper into the same stream, its guaranteed that the copy and kernel launch operations won’t get out of order.

OK, that’s a workaround.

It’d be nice if the compiler would be fixed to properly handle structures passed as constant arguments to the kernel.

Thanks for the help.

Can you specify “properly”, consistent with normal C++ semantics? What exactly is a “constant argument”?

Unless I am missing something, the current semantics of passing arguments to a kernel are exactly what one would expect in C++, with the complication that there are now two address spaces (host and device), something that does not exist in C++, where “a poiner is a pointer is a pointer”.

struct foo
{
    float *array[3];
};


void kernel(foo f)
{
    // This should be efficient. No copying 'f' to local mem to resolve the access to f.array.
    float *ptr = f.array[blockIdx.z];
}

Consider filing an enhancement request (RFE) with NVIDIA. You can do that via the normal bug reporting form, simply prefix the synopsis with “RFE:”.

It is not clear to me why struct arguments to kernels are moved from .param space to .local space prior to first use. I find that this behavior is consistent going back several CUDA versions. I suspect it might have something with the requirements of the CUDA ABI for struct arguments in general. Since the ABI is (to my knowledge) not publicly documented, it is impossible for me to check that hypothesis.

As a workaround, I would suggest flattening the struct manually, by passing three individual pointers to the kernel instead of a struct comprising three pointers. Other than for (possibly) kernels of extremely short duration, I would not expect a performance difference in excess of noise level (2%) though.

Bug filed, #1887093.

It might indeed be an ABI issue. My kernel is a one-liner that invokes a helper function defined with forceinline to pass a constant, i.e. several kernels call the same helper function. The compiler might “resolve” the inner access to the structure by yanking it to local memory. The compiler should still be able to resolve this access gracefully with proper static analysis, of course.

By curiosity, do you (njuffa) work for Nvidia?

Side note to nvidia: it would be a lot easier to debug these problems if the native assembly was documented. Yeah, it changes all the time, but your own employees have already documented it for their own use, right? The PTX is not good enough as we don’t see the actual register usage.

I was one of the “founding members” of the CUDA team in 2005 and was associated with CUDA in various capacities until 2014, when I left NVIDIA.

Problem confirmed.

Last comment from NVIDIA (Fancy Fan - 03/13/2017 3:20 AM):
Hi there,
Thanks for your reporting!
In fact we have caught the issue internally already. Our development team will fix this as soon as possible.

Thanks to everyone who helped.