CUFFT_INVALID_DEVICE when creating cufft plan on HPC

I am testing the following code on my own local machines (both on Archlinux and on Ubuntu 16.04 using nvidia driver 390 and cuda 9.1) and on our local HPC clusters:

#include <iostream>
#include <cufft.h>

int main(){
    // Initializing variables
    int n = 1024;
    cufftHandle plan1d;
    double2 *h_a, *d_a;

    // Allocation / definitions
    h_a = (double2 *)malloc(sizeof(double2)*n);
    for (int i = 0; i < n; ++i){
        h_a[i].x = sin(2*M_PI*i/n);
        h_a[i].y = 0;

    cudaMalloc(&d_a, sizeof(double2)*n);
    cudaMemcpy(d_a, h_a, sizeof(double2)*n, cudaMemcpyHostToDevice);
    cufftResult result = cufftPlan1d(&plan1d, n, CUFFT_Z2Z, 1);

    // ignoring full error checking for readability
    if (result == CUFFT_INVALID_DEVICE){
        std::cout << "Invalid Device Error\n";

    // Executing FFT
    cufftExecZ2Z(plan1d, d_a, d_a, CUFFT_FORWARD);

    //Executing the iFFT
    cufftExecZ2Z(plan1d, d_a, d_a, CUFFT_INVERSE);

    // Copying back
    cudaMemcpy(h_a, d_a, sizeof(double2)*n, cudaMemcpyDeviceToHost);


I compile with nvcc -lcufft

On both of my local machines, the code works just fine; however, I have tried using the same code on our HPC clusters and it will return the CUFFT_INVALID_DEVICE error on that hardware / configuration. Here’s the hardware and driver configuration for those devices.

For one cluster, we have several P100's available and are using nvidia driver version 384.90 with cuda version 8.0.61.
On the second cluster, we are using K80's with nvidia driver version 367.44 and cuda version 8.0.44. As a note, when running the code with cuda version 7.5.18 on this hardware, the above code will still return an error, but this will not actually affect the execution of the code (so far as I am aware).

According to this, the cuda versions should be fine with the driver versions available; however, I receive a similar error when I had my drivers and cuda installations incorrect on my local ubuntu machine before.

I am completely baffled at how to continue here and can only think of a few things:

There is some difference between the consumer hardware I am using on my local machines (Titan X, pascal and GTX 970) and the cluster HPC hardware.
There is some driver configuration problem that I have not considered. I did what I could to try out different cuda versions, but none of them seemed to work, except for 7.5.18, which returned the same error, but did not seem to affect performance.
There is some change to cufft after cuda 7.5.18 that I was not made aware of.

As a note: this is just an example, but I have a larger codebase that does not seem to run due to this error and I am trying to figure out how to solve that issue currently.

Thanks for reading and let me know if you have any ideas on how to proceed!