Hello all
I am currently experimenting with CUDA trying to understand how to pass data to the device and do operations on it. I am currently using the following code to experiment with and learn from. It compiles and runs somewhat okay but one thing have me annoyed. I start out by giving it not very large arrays to work with, just length under N = 10 and then increase the blocksize until it can get the correct value of 3 in all entries prior to making the sum. However if I then increase N to a higher value and then decrease it again, then for some reason the previous value is still stored in device memory and copied back with the new results making them incorrect.
So my question is this. Am I doing something wrong in my cleanup step? And is there a way to make sure that old entries are reset after each run.
It may not be relevant for this code, but I am planing on expanding it to be able to operate on arrays of different size, and it is a problem if the old values are never reset to 0 or just not accesed.
[codebox]#include <stdio.h>
#include <cuda.h>
using namespace std;
//Test Values
#define Nblocks 1
#define Blocksize 1
#define N 2
// Kernel
global void Vecop(float *A, float *B, float *C, float *D)
{
int i = threadIdx.x;
D[i] = A[i] + B[i] + C[i];
}
int main()
{
//Declare memory pointers
float *A_h, *B_h, *C_h, *D_h; // Host side
float *A_d, *B_d, *C_d, *D_d; // Device side
// Declare array sizes and memory locations.
A_h = (float *) (malloc(sizeof(float) * N));
B_h = (float *) (malloc(sizeof(float) * N));
C_h = (float *) (malloc(sizeof(float) * N));
D_h = (float *) (malloc(sizeof(float) * N));
cudaMalloc( (void **) &A_d, sizeof(float) * N);
cudaMalloc( (void **) &B_d, sizeof(float) * N);
cudaMalloc( (void **) &C_d, sizeof(float) * N);
cudaMalloc( (void **) &D_d, sizeof(float) * N);
cout << “Memory Allocated Succesfully …\n”;
// Fill up the data arrays with values
for (int i = 0; i < N; i++)
{
A_h[i] = 1.0;
B_h[i] = 1.0;
C_h[i] = 1.0;
}
// Copy data from host to device
cout << “Preparing Kernel call …\n”;
cudaMemcpy(A_d, A_h, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B_h, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(C_d, C_h, sizeof(float) * N, cudaMemcpyHostToDevice);
cout << “GPU Working …\n”;
// Call the kernel
Vecop<<< Nblocks, Blocksize >>>(A_d, B_d, C_d, D_d);
cout << “All done copying back data …\n”;
// Copy data from device to host
cudaMemcpy(D_h, D_d, sizeof(float) * N, cudaMemcpyDeviceToHost);
// Calculate the sum
float S = 0.0;
for (int i = 0; i < N ; i++)
{
cout << D_h[i] << endl ;
S = S + D_h[i];
}
cout << "The Sum is " << S << endl;
// Cleanup
free(A_h); free(B_h); free(C_h); free(D_h);
cudaFree(A_d); cudaFree(B_d); cudaFree(C_d);cudaFree(D_d);
cout << “All Done!\n”;
return 0;
}[/codebox]
output would look like this
Memory Allocated Succesfully …
Preparing Kernel call …
GPU Working …
All done copying back data …
3
1 <<<<< This is what I am talking about, value from previous run not reset.
The Sum is 4
All Done!
Thanks in advance for any input.