You only need to use cudaMallocHost if you’re allocated pinned memory for the host. This increases the Host <—> Device transfer speed, but it disallows any other program on the system to use that memory. In many cases, you’re probably better off using a normal malloc() call.
No, I believe that if you’re using cudaMemcpyAsync, you have to use pinned host memory allocated with cudaMallocHost(). If you’re doing a synchronous transfer, you can just use normal memory allocated with malloc().
You can’t transfer pointers from the host to the device…host pointers point to host memory, device pointers point to device memory.
You need to allocate device memory according to the size of your structs, copy the data over from the host, generate a device pointer to the struct data in device memory, and go from there. When your computations are done, you copy the data back to the host and generate a new pointer on the host which points to the data in host memory.