Is there any corresponding function for device like cudaMemset2D, or any memset can handle memory in WORD/DWORD ?


What do you mean exactly? cudaMemset works on arrays of any datatype.

But cudaMemset can only be call by Host. I want it work in device code. in kernel function there’s no such a version.

cudaMemsetAsync can be used within a kernel.

But you could simply memset your array manually in the kernel, could you not?

The address space to be memset is computed dynamicly in a kernel, each thread has its unique memory space which allocated dynamicly, so there’s no way to pre-memset by a standalone kernel manually, indeed it can be done by creating a nested kernel, but I don’t want do that because the amount of memory to set for each thread may differ too much, May it cuase performance problem? any how, if no better methods, I have to do that.

Use of memset() is a bit of a code smell. I will allow for exceptions, but whatever the question is, memset() is unlikely to be the best answer or even an appropriate answer. In forty years of programming I have used it maybe a dozen times. Initializing a buffer of floating-point data to NaNs, as part of test scaffolding, is one case I specifically recall.

This does not sound like a recipe for a high-performance GP-accelerated software design. I would suggest looking at the use case afresh.

Why can’t you just do the following in each thread?

int numBytes = /* compute bytes of thread*/
for(int i = 0; i < numBytes; i++){
   bytes[i] = /* your value */;

Or to make it slightly faster, for each warp collaborate.
E.g. a for loop over the space to be cleared for the 32 threads of a warp. Within it distribute (with shuffle) the memory address and size. And then collaboratively set it to 0 or any other value? Then you have coalesced memory accesses.