Device variable question

Hello, I’m brand new to CUDA and GPU programming in general so bear with me.

Here’s what I am trying to do. I have a variable that is used by all my global functions.
The code is too large to post here, but this is the general idea:

I need there to be some variable float fMax that is used by all of my functions. Not only that but it can’t be constant because I will need to change it.

global foo1(void *A)
{
//The maximum value of A (a 1d array already copied to the GPU) is found. The average is stored in fMax. Basically, this function changes fMax.
}

global foo2(void *A, void *B)
{
//Operations on A and B (1d arrays already copied to the GPU) are performed. This function uses fMax to perform those operations, but does not need to change its value.
}

So where/how can I declare fMax so that these functions can use it like this since am under the impression that you can’t pass fMax by reference?

My first thought is that I would have to do this:
float * fMaxHost = 0;
float * fMax = (float )malloc(sizeof(float));
cudaMalloc((void
*)&fMax,sizeof(float));
cudaMemcpy(fMax, fMaxHost, sizeof(float), cudaMemcpyHostToDevice);

And change the kernel functions to something like
global void foo1(void *A, float * fMax)
global void foo2(void *A, void *B, float * fMax)

But that seems like overkill for one float. Is there a better way to do this?
I know this seems like an inefficient use of the GPU, but just humor me. :)
Thanks for any help!

If fmax is simply needed as an input to the function, simply pass it as a float, just like in ordinary C:

__global__ func1 (float fmax, [other arguments as needed]);

If the kernel is going to modify this variable, you will have to pass a pointer to it, just like like in ordinary C:

__global__ func2 (float *fmax, [other arguments as needed]);

In this case you will have to allocate space on the device for fmax using cudaMalloc() prior to invoking the kernsl, as you are already doing.

Ok. I was hoping there was a simpler way, but thanks!

Hi rocket_psyentist,

If your program use only one block, then you can use shared memory.

Best regards!