I have two cudaMalloc statements in my program to declare the device input and output pointers. I had my whole program running for 22 ms and figured that the malloc statements are take the time. The first malloc takes a full 20 ms and I am clueless as to why this happens.

Does CUDA allocate the memory every time a device pointer is created or does it fill the already malloced pointers with data int he subsequent calls?

The first call does some initialization (create a context e.g.)