I am looking for some general tips on how to deal with arrays in a CUDA kernel. The computation I want has
Inputs: 2-D array d (size 4K, 8-bit integers) and 3-D array a (size 5K, 32-bit integers)
Outputs: 2-D array s (size 1K, 64-bit integers) and 3-D array p (size 4K, 64-bit integers)
The calculation of s begins, in the C++ pgm which I am converting to run under CUDA, begins with
memset(s, 0, sizeof(s));
Obviously, we should do this only once, so s should be in a shared device memory, and perhaps initialized by a memset-type call from the host?