cudaMemset bug cudaMemset, is it really so slow ??

Hey,

I was surprised when i tried out to check the performance of cudaMemset(), it consistently clocks ~ 7-8 GB/s (on the Quadro 5600Fx)

(a sample code which measures the performance is attached)

[codebox]int n = 1 << 27;

if (argc > 1)

	n = atoi(argv[1]);



int size = n * sizeof(int);

int* d_idata;

    cutilSafeCall( cudaMalloc( (void**) &d_idata, size));

unsigned int timer;

    cutCreateTimer(&timer);



cutResetTimer(timer);

cudaThreadSynchronize();

    cutStartTimer(timer);

//warmup

int it_wrmup = 27;

for (int i=0; i < it_wrmup; i++)



    cutilSafeCall( cudaMemset( d_idata, 0, size));

// check if kernel execution generated and error

    cutilCheckMsg("Kernel execution failed");

cudaThreadSynchronize();

    cutStopTimer(timer);

    float time = cutGetTimerValue(timer)/it_wrmup;

Data-size:134217728 ints(0.500000 GB) Time:74.76 ms Through-put:7.181636 GB/s


[/codebox]

This, is awfully slow for such a primitive operation, I would expect performances comparable to cudaMemcpy () (~ 60 GB/s)!!

I guess a simple two-line fill-kernel, would be faster…

Is this a bug with the cudaMemset() or some other issue ??
cudamemset_test.cu (1.2 KB)

I haven’t tested it recently but I’ve certainly experienced the same thing in the past. I’m guessing I was probably using CUDA 2.0 at the time. I wrote a simple kernel to do it instead (as you suggest) and I haven’t really worried about it since.