cudaMemset2D is incredibly slow

I’m trying to use cudaMemset2D to zero some memory I have allocated using cudaMallocPitch. It seems to be incredibly slow. It takes about 0.5s to fill about 500MB. I’m using CUDA 1.1. Anyone else noticed this?

Are you sure that this memory allocated in card memory?
If that is in system memory, it would be very very slow.

Doesn’t cudaMallocPitch() allocate device memory by definition? I’ve more or less solved my problem anyway by replacing N calls to cudaMemset2D() with 1 call to cudaMemset2D() and N-1 calls to cudaMemcpy().