I created a few Fortran device kernels with a number of input parameters. Rather than passing all of these variables individually across from the host to the device, I experimented with creating a derived type containing all of the variables. In my experiments, this seems to add a good bit of overhead, is this typical? Has anyone else experimented using derived types in Fortran, or simply using structs in CUDA C?
The low level CUDA launch actually builds a struct to pass arguments from host to device, so you may just be duplicating work that is already tuned within the CUDA runtime. Also, most arguments are passed by value, bypassing CUDA device memory. Perhaps in your implementation the runtime thinks it has to copy the struct to CUDA device memory and pass a pointer to the struct, slowing things down.