UPDATE AT THE END ==============================

Good morning, all.

I have a wrapper function that declares 2 cudaMallocManaged variables, which need to be passed by reference to kernel_func_1, as it will populate them, and then kernel_func_2 is called, using the results in these variables.

The wrapper function is defined as:

```
__declspec(dllexport) void A_Wrapper_Func(void) // Wrapper function in DLL
{
double *scale_factor_1; // Shared between kernel functions
size_t *scale_factor_2; // Shared between kernel functions
cudaError_t cuda_error;
cudaMallocManaged(&scale_factor_1, sizeof(double)); // Just 1 element, it is an accumulator
cudaMallocManaged(&scale_factor_2, sizeof(size_t)); // Just 1 element, it is an accumulator
// Initializes both variables with 0 and prints errors, if any (but no error is returned)
if (cudaMemset(scale_factor_1, 0, sizeof(double)) != cudaSuccess)
{
cuda_error = cudaGetLastError();
cudaGetErrorString(cuda_error);
std::cout << cuda_error << std::endl;
}
if (cudaMemset(scale_factor_2, 0, sizeof(size_t)) != cudaSuccess)
{
cuda_error = cudaGetLastError();
cudaGetErrorString(cuda_error);
std::cout << cuda_error << std::endl;
}
// Prints the sizes of the variables before kernel call, returns 8 and 8 as expected
std::cout << "Sizes before kernel: " << sizeof(scale_factor_1[0]) << " " << sizeof(scale_factor_2[0]) << std::endl;
// Serial part of the computation, populates scale factors 1 and 2
kernel_func_1 <<< 1, 1 >>> (array_1, scale_factor_1, scale_factor_2, array_length);
cudaDeviceSynchronize();
// Parallel part of the computation (scaling array_1 and writing to array_2)
kernel_func_2 <<< 200, 256 >>> (array_1, array_2, scale_factor_1, scale_factor_2, array_length);
cudaDeviceSynchronize();
cudaFree(scale_factor_1);
cudaFree(scale_factor_2);
}
```

The kernel functions are declared as:

```
__global__ void kernel_func_1(float *array_1, double *scale_fact_1, size_t *scale_fact_2, size_t array_len);
__global__ void kernel_func_2(float *array_1, float *array_2, double *scale_fact_1, size_t *scale_fact_2, size_t array_len);
```

Function 2 just takes 1 more argument than Function 1, which is a second array where array_1, scaled by factors 1 and 2, is written to.

From inside these kernel functions I do a:

```
printf("%i %i %i %i\n", sizeof(scale_fact_1[0], scale_fact_2[0], sizeof(double), sizeof(size_t));
```

It prints 8 0 8 0, when the expected is 8 8 8 8, indicating that the size_t variable as well as sizeof(size_t) are returning 0.

The code will run a couple of times without complaining (but providing incorrect results in array_2, which I am printing somewhere else in the program. As I increase the size of the arrays, an error is caught when allocating them (which I am doing and treating in other part of the program), even if there is more than enough free memory in the device.

If I comment the calls to the kernel functions in the wrapper, I can call it as many times as I want with array sizes as large as +45% (each) of the free device memory. When I uncomment any call to kernel function, then the problems arise.

So my questions are:

1 - What is wrong here? Why do these kernel functions consider a size_t as 0 bytes? Yes, I’ve rebooted the computer to make sure there was no memory address messed up.

2 - When I define a variable with cudaMallocManaged, it is visible/usable by the device so no need to pass anything by reference, correct?

3 - We are stuck with the 1-element array notation to refer to the variable. No way around it?

Thanks a lot.

UPDATE ========================================

I have changed scale_factor_2 to double as well as the parameters for the functions just to see what happens, and it has the same problem. So it is NOT a problem of size_t, as the size of scale_factor_2[0] and double inside the kernel functions is still 0.

There is something wrong with this second scaling factor being passed to the functions, but I am unable to figured out what, as it is just another argument.

UPDATE 2 ========================================

The functions were originally declared as:

```
__global__ void kernel_func_1(float *array_1, double *scale_fact_1, size_t *scale_fact_2, size_t array_len);
__global__ void kernel_func_2(float *array_1, float *array_2, double *scale_fact_1, size_t *scale_fact_2, size_t array_len);
```

Then I changed the order of the double and size_t arguments, making them:

```
__global__ void kernel_func_1(float *array_1, size_t *scale_fact_2, double *scale_fact_1, size_t array_len);
__global__ void kernel_func_2(float *array_1, float *array_2, size_t *scale_fact_2, double *scale_fact_1, size_t array_len);
```

Now it is the “double *scale_factor_1” that has problems. That is, if inside a kernel I print the sizeof(scale_factor_1[0]), and strangely sizeof(double) too, they come as 0.

So it seems to be third argument is function_1 and forth argument in function_2, regardless of type.

I honestly have no idea of what is happening, as no errors are returned from the CUDA calls or the kernels…