Vars in global memory vs. Parameters in kernels

Hi,

As stated in the title, I have a question regarding vars(parameters) in global memory and kernels.

Let’s say in my kernel, I need to use a input: a_gpu. The input’s value is gotten from the host. Can I do either one of the followings to achieve this goal?

1>>

 CPU---->>>>
int a, a_gpu;

       ...

       a=0;

       ...

       cudaMalloc((void**)&a_gpu, sizeof(int));

       cudaMemcpy(a_gpu,a,sizeof(int),cudaMemcpyHostToDevice);

       ...

       Kernel<<< grid, threads >>> ( );

       ..

GPU---->>>>

__global__ void Kernel(){

          ...

          //get a_gpu value directly from global memory

          int tmp = a_gpu;

          ...

       }

2>>

  CPU---->>>>
int a, a_gpu;

       ...

       a=0;

       ...

      //Do I still need to allocate memory and copy value for the GPU side?

       cudaMalloc((void**)&a_gpu, sizeof(int));

       cudaMemcpy(a_gpu,a,sizeof(int),cudaMemcpyHostToDevice);

       ...

       Kernel<<< grid, threads >>> (a_gpu);

       ...

GPU---->>>>

__global__ void Kernel(int a_gpu){

          ...

          int tmp = a_gpu;

          ...

       }    

If I have ten inputs that I need for the kernel, do I need to allocate memory for them ahead in host side manually? Please let me know if I am doing the right thing and help me to correct these code.

Thx a bunch!!

The CUDA runtime API will copy a_gpu to the device for you if you call your your kernel with pass-by-value. In case #2, all you need on the CPU is:

int a_gpu;

a_gpu=0;

Kernel<<< grid, threads >>> (a_gpu);

You can do this with any parameters (including structs) that you can pass-by-value. To pass arrays to a kernel, you need to use cudaMalloc and cudaMemcpy.

Getting output back from a kernel also requires cudaMalloc/cudaMemcpy, even if you only want to return a simple type, like an int or float.

Thx.

But I still a little bit confused.

  1. If I have copied a var to device, can I use this var directly without passing it by parameter?

  2. Can I new a kernel without parameters?

  1. Yes
  2. Yes

However, I would not recommend using a device variable for a kernel parameter. As all threads are likely to read it simultaneously, the read will not be coalesced and the performance of your kernel will suffer a huge hit. Use a constant variable instead, they are efficient when all threads read the same memory location at once. You can initialize a constant variable from the host with cudaMemcpyToSymbol.