global variable in CUDA

Hello All,

I am new to CUDA and to programming as well. Please help me.

I want to define a global variable count and then use it in different CUDA device functions. So i do something like this.

device int count; // Declare global variable here

main() {
do {
core_kernel<<<5, 32>>>(some parameters); //Call the CUDA kernel
//// What is the value in count variable here???

} while (count != 16);

}

global void core_kernel( some parameters){

calldevicefunction(); // global function calls the device function
}

device void calldevicefunction ( ) {

count = count +1; // use the global variable directly here.

}

I assume device functions is executed by several threads, after the core_kernel is called, how many times function increments the value in count variable. Does the value of count after single kernel call is incremented just once or by the number of total threads/blocks??

Also, is it okay to use count variable this way in calldevicefuntion without declaring or passing it as a function parameter. Do you see any conflict this way by several threads trying to work on the same variable.

Thanks for your help.

I’m not sure where to begin.

Before the kernel call, count will be some random value - like in C/C++, you’ve declared a variable but not assigned anything to it. In this case, it will contain whatever random junk is in the device global memory address in which “count” is stored.

After the kernel call (assuming you don’t error out), count will be 16, since your code will loop until it is 16, but it may be a very, very long wait before it is equal to 16 and count may have been incremented literally trillions of times. Each thread will attempt to add 1 to the value of count. Each thread may or may not add 1 to the value of count it saw during the time it read count and the time it added 1 to it (in count = count + 1, the sequence of events within one thread can be approximately modeled as read the value of count [ count = count + 1 ], increment that value by 1 [ count = count + 1 ], store that value back to count [ count = count + 1 ]. Unfortunately, after any one of those steps count could have been changed by another thread). Lastly, count is not checked for 16 until after presumably thousands of threads have added to it.

I’m therefore not at all sure what you’re trying to do here, but I suspect you’re heading down the wrong path. Perhaps you could explain.

Note, even if you set count=0 before calling calldevicefunction(), AND you only had one block of say 2 threads ( which would be useless, but would have a half-decent chance of reaching 16 in 8 loops), the count = count + 1 (or even count += 1) is not atomic, so you’re going to get strange behavior. See atomicAdd() if you really want to do something like count the number of times an operation has been performed by any thread on the device; however, it appears that you are trying to do in CUDA what you might do in a single-threaded environment like traditional C/C++. May I suggest running through a tutorial on CUDA and/or following an on-line lecture series (Standford and the University of Illinois both have great ones).