As stated in the title, I have a question regarding vars(parameters) in global memory and kernels.
Let’s say in my kernel, I need to use a input: a_gpu. The input’s value is gotten from the host. Can I do either one of the followings to achieve this goal?
1>>
CPU---->>>>
int a, a_gpu;
...
a=0;
...
cudaMalloc((void**)&a_gpu, sizeof(int));
cudaMemcpy(a_gpu,a,sizeof(int),cudaMemcpyHostToDevice);
...
Kernel<<< grid, threads >>> ( );
..
GPU---->>>>
__global__ void Kernel(){
...
//get a_gpu value directly from global memory
int tmp = a_gpu;
...
}
2>>
CPU---->>>>
int a, a_gpu;
...
a=0;
...
//Do I still need to allocate memory and copy value for the GPU side?
cudaMalloc((void**)&a_gpu, sizeof(int));
cudaMemcpy(a_gpu,a,sizeof(int),cudaMemcpyHostToDevice);
...
Kernel<<< grid, threads >>> (a_gpu);
...
If I have ten inputs that I need for the kernel, do I need to allocate memory for them ahead in host side manually? Please let me know if I am doing the right thing and help me to correct these code.
The CUDA runtime API will copy a_gpu to the device for you if you call your your kernel with pass-by-value. In case #2, all you need on the CPU is:
int a_gpu;
a_gpu=0;
Kernel<<< grid, threads >>> (a_gpu);
You can do this with any parameters (including structs) that you can pass-by-value. To pass arrays to a kernel, you need to use cudaMalloc and cudaMemcpy.
Getting output back from a kernel also requires cudaMalloc/cudaMemcpy, even if you only want to return a simple type, like an int or float.
However, I would not recommend using a device variable for a kernel parameter. As all threads are likely to read it simultaneously, the read will not be coalesced and the performance of your kernel will suffer a huge hit. Use a constant variable instead, they are efficient when all threads read the same memory location at once. You can initialize a constant variable from the host with cudaMemcpyToSymbol.