Inside the kernel, clearly if variables are declared as shared, then they go to shared memory. Is that true if variables are not declared, they directly are on global memory?
What if we call a variable in kernel, which is a macros or global variables defined outside kernel at the very beginning of the whole program? Where do the variables or pointers passed into kernel as arguments together with the call reside in kernel?
What variables will be stored on registers and local memory?
A variable declared Inside a kernel can only go into a register or local memory. Because local memory is just some address space in global memory set aside by the driver for thread-local storage (and therefore slow), the compiler tries to keep all variables in registers as much as possible.
Variables defined at global scope must have a device or constant modifier to be visible to an executing kernel. In both cases, the variable is physically located in global memory, but the constant modifier makes the kernel access the variable through the constant cache.
Kernel arguments are stored in different places depending on the architecture. Pre-Fermi, the arguments are placed in shared memory by the driver before the kernel starts. With Fermi, the driver switches to using constant memory for kernel arguments.
Thanks very much for the reply. But how is the macros I defined without using the device or constant modifier at the very beginning of the program. For example,
define ABC 1<<15
Then I actually directly get the value of ABC inside the kernel. Which way is better if I use the defined macros directly inside kernel or pass it into the kernel as one of the arguments in terms of performance?
Another question, I’m using Fermi, GTX480, do you mean kernel arguments are stored in constant memory? If so, I don’t need to copy the frequently visited arrays in argument to shared memory again just as what I did before, because constant memory is fast enough. Is this understanding correct?
Preprocessor macros are expanded before compilation, so this #define would be handled as if you typed the literal directly into the kernel source. I see no reason to pass something like this as an argument to the kernel.
By the way, it would be a good idea to enclose to write this as:
#define ABC (1<<15)
Otherwise, you run the risk of C operator precedence doing something you don’t want if you put ABC next to a higher precedence operator than <<.
Generally speaking, the constant cache is fast and should be fine for kernel arguments. Since you mention arrays, I should point out that it is only the argument values that go into the constant cache, not the memory referred to by any pointer arguments. For example, when you invoke this kernel:
global test_kernel(int *my_array)
The address of the first element of my_array is put into the constant cache (and then copied into a register when you use it), but not any of the memory that my_array points to. Since you almost certainly created that memory using cudaMalloc(), the array contents are in global memory. In that case, you may want to consider copying parts of my_array into shared memory if you will need to access the elements of my_array more than once.