global memory lost

Here i have a very simple test program which gives some weird behavior. Hope someone could help me figure it out!

#include <stdio.h>

#include <string.h>

#include <stdlib.h>

#include <math.h>

#include <cutil.h>

#define blockSize 512

#define length   1000

void __global__ localAdd(double* data, int N){

int counter=blockIdx.x*blockSize+threadIdx.x;

  double temp[length],s;

for(int i=0; i<length; i++)

    temp[i]=1.0;

for(int i=0; i<length; i++)

    s+=temp[i];

data[counter]=s;

}

main(){

unsigned int N=512*512,M1=1<<30,M3=1<<30,M4=1<<29;

  double *data, *temp1, *temp3,*temp4;

/*

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp1,M1));

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp3,M3));

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp4,M4));

  */

CUDA_SAFE_CALL(cudaMalloc((void**)&data,sizeof(double)*N));

  localAdd<<<512,blockSize>>>(data,N);

CUDA_SAFE_CALL(cudaMalloc((void**)&temp1,M1));

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp3,M3));

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp4,M4));

}

my GPU has 2817982464 bytes global memory.

my kernel function allocates a large local array for each thread. Before calling this kernel i can allocate 2.5 GB array in global memory, but after calling this kernel, it seems that i lost at least 128MB global memory (can not allocate 2.5GB).

These local arrays are private to threads, so they should be cleared after the kernel finishes. It seems to me that this kernel should not result in any global memory consumption when it is DONE. can any one try it and see why is that?

many thanks!

Here i have a very simple test program which gives some weird behavior. Hope someone could help me figure it out!

#include <stdio.h>

#include <string.h>

#include <stdlib.h>

#include <math.h>

#include <cutil.h>

#define blockSize 512

#define length   1000

void __global__ localAdd(double* data, int N){

int counter=blockIdx.x*blockSize+threadIdx.x;

  double temp[length],s;

for(int i=0; i<length; i++)

    temp[i]=1.0;

for(int i=0; i<length; i++)

    s+=temp[i];

data[counter]=s;

}

main(){

unsigned int N=512*512,M1=1<<30,M3=1<<30,M4=1<<29;

  double *data, *temp1, *temp3,*temp4;

/*

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp1,M1));

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp3,M3));

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp4,M4));

  */

CUDA_SAFE_CALL(cudaMalloc((void**)&data,sizeof(double)*N));

  localAdd<<<512,blockSize>>>(data,N);

CUDA_SAFE_CALL(cudaMalloc((void**)&temp1,M1));

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp3,M3));

  CUDA_SAFE_CALL(cudaMalloc((void**)&temp4,M4));

}

my GPU has 2817982464 bytes global memory.

my kernel function allocates a large local array for each thread. Before calling this kernel i can allocate 2.5 GB array in global memory, but after calling this kernel, it seems that i lost at least 128MB global memory (can not allocate 2.5GB).

These local arrays are private to threads, so they should be cleared after the kernel finishes. It seems to me that this kernel should not result in any global memory consumption when it is DONE. can any one try it and see why is that?

many thanks!

That kernel will increase your stack size, which is not necessarily cleared immediately on completion of the kernel.

That kernel will increase your stack size, which is not necessarily cleared immediately on completion of the kernel.

how to clear it up right after the kernel completion? like some function call can fix this problem?

how to clear it up right after the kernel completion? like some function call can fix this problem?

try with CudaThreadExit(). but Keep in mind will reset the whole CUDA stack

try with CudaThreadExit(). but Keep in mind will reset the whole CUDA stack

thanks a lot! It DOES work :)

thanks a lot! It DOES work :)