Why does cudaMalloc time depends on kernel calling? cudaMalloc takes more time if you call a kernel

I understand of other issues with cudaMalloc, where the first call to cudaMalloc will take very long, but why is it that it takes longer depending on how many times I call a kernel?
I have code that makes a single call to cudaMalloc, then I call a kenel. So the first thing i do is do a cudaMalloc call. Then I start calling the kernel. Now, if in code I only call the kernel once, the cudaMalloc takes 1 second to complete. But if I call the kernel 4 times, the cudaMalloc takes 5 seconds to complete. If I call the kernel 16 times, the cudaMalloc takes 20 seconds to complete. If I call the kernel 32 times, the cudaMalloc takes over 3 minutes to complete.
So why is the time that is takes cudaMalloc to complete dependent on how many times I call a kernel, when I call the kernel after I made the cudaMalloc call?
I have the full code posted at the following link:

cudaMalloc has to wait for the card to finish executing the kernel before it can allocate memory.

Tim’s correct. In your source code, your timing loop is not even measuring your kernel time, it’s measuring kernel queuing time.
You don’t even have a timer around the cudaMalloc() call.

Try a cudaThreadSynchronize() call right after your kernelmain() call to make your kernel timings have some validity.

Hmmm, well after restructuring my code, I have found that it’s the kernel itself that is slow. One execution of the kernel takes 1200 ms by itself. The problem was I made the assumption that the malloc function I used, and the kernel call I used were the syncronous versions of the functions, and not the asyncronous. So therefore, I was expecting it to not return untill it was done allocating the memory, or untill all the threads terminated. I was not execting ALL of them to be asyncronous. This through me off because exists explicit functions for asycronous.