Memory management in the device Is there any caching in device's memory?

Hello everyone,

I’m starting in NVIDIA’s CUDA platform and I started doing some tests. I’m basically testing a vector sum using the following code (only showing what I think are relevant parts):

global void vecAdd(float * A, float * B, float * C) {
int i = threadIdx.x;
C[i] = A[i]+B[i];

int main(int argc, char ** argv) {
cudaError_t error;
//Declare local stuff
float A = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
float B = {2.0f, 4.0f, 6.0f, 8.0f, 10.0f};
float C[5];

//Move stuff to device
float * Ad, * Bd, * Cd;
cudaMalloc((void **)&Ad,5*sizeof(float));
cudaMemcpy(Ad, A, 5*sizeof(float), cudaMemcpyHostToDevice);
cudaMalloc((void **)&Bd,5*sizeof(float));
cudaMemcpy(Bd, B, 5*sizeof(float), cudaMemcpyHostToDevice);
cudaMalloc((void **)&Cd,5*sizeof(float));

//Do stuff
vecAdd<<<1,4>>>(Ad, Bd, Cd);

//Check for errors
error = cudaGetLastError();
if (error != cudaSuccess) {
	printf("Error: %s\n", cudaGetErrorString(error));

//Retrieve stuff
cudaMemcpy(C, Cd, 5*sizeof(float), cudaMemcpyDeviceToHost);

printf("C: {%.2f,%.2f,%.2f, %.2f, %.2f}\n",C[0], C[1], C[2], C[3], C[4]);


return 0;


Code compiles fine, no error on execution neither but the strange behavior is as follows: If I modify the source decreasing the number of threads, I still get the same results on vector C. I noticed that was strange so after modifying the number of threads for values above and below 5 without any changes, I even changed the content of array B. Nothing changed neither. That makes me think of a previously allocated result in the device which I need to erase to have expected behavior, but I can’t find any of this theory in the documentation. Am I missing something? Some other interesting experiments are: I rebooted the computer and compiled the code with only one thread, and it threw me a reasonable result with nan in the all values but the first one; then I changed it to two threads, and also gave nan on all values but the first two ones; then I tried with 3, everything normal, but when I started decreasing the numbers, it wouldn’t change.

Additional system info which might be very helpful:

  • Nvidia Driver: 177.67 x86_64
  • CUDA Toolkit: 2.0 for Ubuntu 7.10 x86_64
  • System: Frugalware Linux x86_64
  • Device: GeForce 8800GT 256MB

Thanks for any help you might give!

David E. Narváez

Due to the way the hardware is build: 8-way-SMP Multiprocessors that execute a batch of 32 threads in 4 clock cycles. Therefore there is a finite granularity to how many threads you can specify. I assume it’s multiples of 32 threads, but it might be 16. Actually I would have suspected your code to cause an “unspecified launch error”, as it will reference non-allocated memory. IMO you should always have some (template) parameter N and a check idx < N at the beginning of your code.

well, it is quite simple.

the memory allocated is the same each time. So when you have had 5 threads write 5 values, you have 5 values in memory.
If after that you have only 3 threads write values, the last 2 values remain the same. If you copy all 5 of them back to host, you will see the same values as before on the last 2 positions.