Memset?

Is there any corresponding function for device like cudaMemset2D, or any memset can handle memory in WORD/DWORD ?

urgent!

What do you mean exactly? cudaMemset works on arrays of any datatype.

But cudaMemset can only be call by Host. I want it work in device code. in kernel function there’s no such a version.

cudaMemsetAsync can be used within a kernel.

But you could simply memset your array manually in the kernel, could you not?

The address space to be memset is computed dynamicly in a kernel, each thread has its unique memory space which allocated dynamicly, so there’s no way to pre-memset by a standalone kernel manually, indeed it can be done by creating a nested kernel, but I don’t want do that because the amount of memory to set for each thread may differ too much, May it cuase performance problem? any how, if no better methods, I have to do that.

Use of memset() is a bit of a code smell. I will allow for exceptions, but whatever the question is, memset() is unlikely to be the best answer or even an appropriate answer. In forty years of programming I have used it maybe a dozen times. Initializing a buffer of floating-point data to NaNs, as part of test scaffolding, is one case I specifically recall.

This does not sound like a recipe for a high-performance GP-accelerated software design. I would suggest looking at the use case afresh.

Why can’t you just do the following in each thread?

int numBytes = /* compute bytes of thread*/
for(int i = 0; i < numBytes; i++){
   bytes[i] = /* your value */;
}

Or to make it slightly faster, for each warp collaborate.
E.g. a for loop over the space to be cleared for the 32 threads of a warp. Within it distribute (with shuffle) the memory address and size. And then collaboratively set it to 0 or any other value? Then you have coalesced memory accesses.

I guess this can cause hight cost to GPU becuase the loops will be thousands, and I have couped with the problem by a memset loop by log2(itemCount) times, as each item has the same value to fill. I don’t know how’s the performance of this method.

template
device void setMemory(T* address, int num, const T& value)
{
int times = __log2f(num);
int copyNum = 1;
T* copyFrom = address;
*copyFrom = value;
for (int i = 0; i < times; i++)
{
memcpy(copyFrom + copyNum, copyFrom, copyNum * sizeof(T));
copyNum *= 2;
}
memcpy(copyFrom + copyNum, copyFrom, (num - copyNum) * sizeof(T)); // the remains
}

I would not use floating point calculations and conversions.

Look at __clz
https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html#group__CUDA__MATH__INTRINSIC__INT_1gd45548f57952ed410fa5d824984f16d0
Or here (with several methods):
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

Most important for speed is that the threads in the warps create coalesced memory accesses.
So they have to cooperate to access related (e.g. consecutive) addresses.