A few quick general questions I couldn't find answers to

I am working on a CUDA project right now that I am trying to do some debugging on and I was wondering if someone could help me with some minor problems I’ve been having and perhaps give me some tips.

  1. I am trying to time a kernel I wrote using cudaEventRecord. The events work fine and return a seemingly valid result except for certain large cases. I am actually running the kernel through a loop and recording the event time after each iteration of the loop and accumulating the result in another variable. Here is the code
cudaEvent_t start, stop;

float time;



double totalTime = 0;

for(int i = 0; i<nTimes; i++)


     cudaEventRecord(start, 0);


     cudaEventRecord(stop, 0);



     cudaEventElapsedTime(&time, start, stop);

     totalTime = totalTime + time;


std::cout << "Execution Time is " << totalTime << " ms" << std::endl;

After the nTimes of the for loop gets past a certain number (in my case 184), for every iteration of the for loop the value of time is 0. Any less iterations (183 or less) I get a valid result. My guess is that floats do not have enough storage space to hold larger numbers so it just outputs zero but I am not sure. Any ideas?

  1. For anyone who has used the Thrust library, I am trying to use a device_vector ((the documentation can be found here) and I am declaring it with the c++ new operator. I am not having any problems but I want to delete the device_vectors after I use them so I am not wasting global memory space on the device. However, as you can see in the documentation there is no destructor, which means the default constructor will be used. I am assuming the default constructor will only perform a “shallow delete” which means there will still be memory being used up from this device_vector on host and device global memory. How do I fully delete these objects?

  2. Also I wanted to know if there was a cuda function I could use to print out the amount global memory on the device that is occupied at any given time. Also what is the function to print the total global memory of the device?

Thanks in advance to anyone who can help me out.


 I don't see anything syntactically wrong with your code, so I am wondering if there is an error occurring on the device at loop iterations over 184.  As a test, if you could put a cudaThreadSynchronize call after the kernel call and then use a cudaGetLastError and see if the return value is something other than cudaSuccess (which numerically is 0).  You can also check the return status of the cudaMemcpy and event calls for error.

 I am not familiar with Thrust, but for the device memory question, you can use the cudaMemGetInfo call--you can see the details of the call the cuda API reference manual (it will show you used an free global memory).