Kernel unnecessarily copies the whole const parameter to local space

I just saw a very confusing behavior when a kernel gets a large struct parameter with an array in it.

First, the sane case:

struct BigStruct { float data[100]; };

__global__ void op_hardcoded(const BigStruct s, int N, float *A)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    float val = A[index];
    for (int i = 0; i < 100; i++) val += s.data[i];
    A[index] = val;
}

When I compile with nvcc -ptx -O3, it unrolls the whole loop, calling add.f32 100+ times. OK, maybe a bit excessive, but so far so good…

But if I change the code so that it cannot be unrolled at compile time:

__global__ void op_dynamic(const BigStruct s, int N, int k, float *A)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    float val = A[index];
    for (int i = 0; i < k; i++) val += s.data[i];
    A[index] = val;
}

Now each thread starts by allocating 400 byte of local space, copies the whole BigStruct to there (basically each thread making its own copy), and does the addition by reading from local memory.

It has a disastrous effect on performance: on 1M elements, the first kernel takes ~38 us, while the second takes ~3660 us (when k == 100) on my GTX 1080.

Is it a known issue? And is there any way to work around it, preferably without updating to a newer version of CUDA?

(I believe this affects the performance of pointwise operations in PyTorch for certain tensor dimensions, so it would be nice if I could work around it in CUDA 8.0.)

  • My CUDA version is 8.0:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

I would claim that passing large structs in or out of any function (including a kernel) constitutes abuse of function-calling conventions, on any platform. For large data, pass a pointer instead.

You could always file an enhancement request with NVIDIA (via the regular bug reporting form, prefix the synopsis with “RFE:”), but my hunch is that the response from NVIDIA’s compiler engineers would be similar to my response above.

Well, the difference is less dramatic, but the same thing happens if “BigStruct” has eight elements instead of 100. (Now it’s only ~5 times slower, instead of ~100 times.)

I think it’s reasonable to consider a 32-byte argument a legitimate use of an ABI.