cudaMemset3D vs. own kernel why is cudaMemset3D slower than a kernel?


I am a bit puzzeled. I have to null a large float array in global memory, allocated via cudaMalloc3D.
The operation is time critical, so I used cudaMemset3D() do do the job assuming its optimized.

However, a simple kernel actually setting the float elements to 0.0f individually is performing the same job about 25% faster after slightly optimizing the number of blocks and threads.
(It can also take 3 times longer with wrong thread/block settings).

If that is so, where can I find THE optimized way to do this?
Or do I have to live with the fact that cudaMemset3D() has just good performance on average?
Any hints?