Question about clearing memory

Hello all

I am currently experimenting with CUDA trying to understand how to pass data to the device and do operations on it. I am currently using the following code to experiment with and learn from. It compiles and runs somewhat okay but one thing have me annoyed. I start out by giving it not very large arrays to work with, just length under N = 10 and then increase the blocksize until it can get the correct value of 3 in all entries prior to making the sum. However if I then increase N to a higher value and then decrease it again, then for some reason the previous value is still stored in device memory and copied back with the new results making them incorrect.

So my question is this. Am I doing something wrong in my cleanup step? And is there a way to make sure that old entries are reset after each run.

It may not be relevant for this code, but I am planing on expanding it to be able to operate on arrays of different size, and it is a problem if the old values are never reset to 0 or just not accesed.

[codebox]#include <stdio.h>

#include <cuda.h>


using namespace std;

//Test Values

#define Nblocks 1

#define Blocksize 1

#define N 2

// Kernel

global void Vecop(float *A, float *B, float *C, float *D)


int i = threadIdx.x;

D[i] = A[i] + B[i] + C[i];


int main()


//Declare memory pointers

float *A_h, *B_h, *C_h, *D_h; // Host side

float *A_d, *B_d, *C_d, *D_d; // Device side

// Declare array sizes and memory locations.

A_h = (float *) (malloc(sizeof(float) * N));

B_h = (float *) (malloc(sizeof(float) * N));

C_h = (float *) (malloc(sizeof(float) * N));

D_h = (float *) (malloc(sizeof(float) * N));

cudaMalloc( (void **) &A_d, sizeof(float) * N);

cudaMalloc( (void **) &B_d, sizeof(float) * N);

cudaMalloc( (void **) &C_d, sizeof(float) * N);

cudaMalloc( (void **) &D_d, sizeof(float) * N);

cout << “Memory Allocated Succesfully …\n”;

// Fill up the data arrays with values

for (int i = 0; i < N; i++)


A_h[i] = 1.0;

B_h[i] = 1.0;

C_h[i] = 1.0;


// Copy data from host to device

cout << “Preparing Kernel call …\n”;

cudaMemcpy(A_d, A_h, sizeof(float) * N, cudaMemcpyHostToDevice);

cudaMemcpy(B_d, B_h, sizeof(float) * N, cudaMemcpyHostToDevice);

cudaMemcpy(C_d, C_h, sizeof(float) * N, cudaMemcpyHostToDevice);

cout << “GPU Working …\n”;

// Call the kernel

Vecop<<< Nblocks, Blocksize >>>(A_d, B_d, C_d, D_d);

cout << “All done copying back data …\n”;

// Copy data from device to host

cudaMemcpy(D_h, D_d, sizeof(float) * N, cudaMemcpyDeviceToHost);

// Calculate the sum

float S = 0.0;

for (int i = 0; i < N ; i++)


cout << D_h[i] << endl ;

S = S + D_h[i];


cout << "The Sum is " << S << endl;

// Cleanup

free(A_h); free(B_h); free(C_h); free(D_h);

cudaFree(A_d); cudaFree(B_d); cudaFree(C_d);cudaFree(D_d);

cout << “All Done!\n”;

return 0;


output would look like this

Memory Allocated Succesfully …

Preparing Kernel call …

GPU Working …

All done copying back data …


1 <<<<< This is what I am talking about, value from previous run not reset.

The Sum is 4

All Done!

Thanks in advance for any input.

yes the memory is not automatically zeroed out, so you have to account for that in your code if your results depend on this. you can make this call inside your setup and/or cleanup code to zero-out the results buffer:

cudaMemSet( (void **) &D_d, 0, sizeof(float) * N);

Many thanks RoBiK

Would not proceed if it was in fact a memory deallocation problem, since I plan on using it on rather large arrays, and do not want problems later!

The cudaMemSet() function is very convenient but you have to pay attention if your array is big.
In this case, cudaMemSet() takes a lot of time.

The cudaMemSet() function is very convenient in your case.
However, you should pay attention if your array is big. In this case, cudaMemSet() takes a lot of time.