CudaMemset internally calls a kernel?

I wanted to ask if cudaMemset(…), actually initialises an array on the device by calling a kernel and doing it in parallel, or does it use the PCI E bus to copy elements individually to the device array.

Although the latter makes little sense to do, I could not find the specifics mentioned anywhere.

If anyone knows how cudaMemset(…) is implemented, if you would please shed some light on this.

Thank you.

Yes, typically it calls a kernel. This is because it is far more efficient (i.e. faster) to do it this way than to copy the entire buffer over the PCIE bus. The kernel can access device memory at speeds typically in excess of 100GB/s. The PCIE bus speed varies by generation but may be in the 6GB/s to 25GB/s range, currently (or 50GB/s for Gen5, I guess.)

This isn’t documented anywhere that I know of, its an implementation detail. You can inspect the behavior yourself with a GPU profiler.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.