Issue with calling PyCuda function (LogicError: cuFuncSetBlockShape failed: invalid resource handle)

Firstly, I’ll say in advance that I’ve gone through all the threads on here as well as the PyCuda forums (and StackOverflow too) regarding the given error message, and have tried all the given solutions, and yet I continue to come across the following error:

[b]–> 382 func._set_block_shape(*block) 383 handlers, arg_buf = _build_arg_buf(args) 384

LogicError: cuFuncSetBlockShape failed: invalid resource handle[/b]

and the error originating at the function call:

114 func(np.int32(nsimul), np.int32(nlayers_mclr1), np.int32(nlayers_mclr2), np.int32(nlayers_mclr3),…)
(full function definition given below)

The PyCuda function call is as follows:

func = simulCUDA.get_function("simulcuda")

func(np.int32(nsimul), np.int32(nlayers_mclr1), np.int32(nlayers_mclr2), np.int32(nlayers_mclr3), np.int32(totallayers), cuda.In(x1.astype(np.float32)), cuda.In(y1.astype(np.float32)), cuda.In(x2.astype(np.float32)), cuda.In(y2.astype(np.float32)), cuda.In(x3.astype(np.float32)), cuda.In(y3.astype(np.float32)), np.float32(multiplier), cuda.In(out_ran.astype(np.float32)), np.int32(y3.shape[1]), np.int32(y2.shape[1]), np.int32(y1.shape[1]), cuda.In((y3@x3).astype(np.float32)), cuda.In((y2@x2).astype(np.float32)), cuda.In((y1@x1).astype(np.float32)), cuda.Out(bmps), cuda.Out(bmps_imp), cuda.Out(bmps_fc), cuda.Out(bmps_smth), grid=(10, 10, 1), block=(20, 2, 1))

and the function definition:

simulCUDA = SourceModule(
"""

 __device__ int rint_arr(float arr)
{
 int final;
 final=rint(arr);
 return final;
}

__global__ void simulcuda(int nsimul,int nlayers_mclr1,int nlayers_mclr2,int nlayers_mclr3,int totallayers,float *x1,float *y1,float *x2,float *y2,float *x3,float *y3,float multiplier, float *random_arr, int y3shape, int y2shape, int y1shape, float *z3x3, float *z2x2, float *z1x1, float **bmps, float **bmps_imp, float **bmps_fc, float **bmps_smth)
{
    //const size_t ttl = totallayers;
    //const size_t nsimult = nsimul;
. . . .

What all I’ve tried: The goal is to have 4000 threads running in parallel, and the data types of each are typecasted as suggested by a solution I read elsewhere. Yet does not work. I tried running it simply for one thread (block=(1,1,1)), and that spit the same error. The C++ corresponding code however, compiles well. Just to be clear, I’m fairly sure the failure happens right at invocation of function, not during run time. I tried uploading the arguments individually through memcpy, and it does not segault, so I don’t think it’s a memory issue either. I’m really confused, if I transfer the variables this way, is it different fro how they’re allocated in memcpy?

Reposting the error:

LogicError: cuFuncSetBlockShape failed: invalid resource handle)

I’ll post more of the code if required. Thanks in advance!