That’s why I added the link here.
The first 2 primary optimization objectives of any CUDA programmer are:
expose enough parallelism: a first order simplification/approximation here is to launch a kernel with a grid that has enough threads to saturate the GPU you are running on.
make efficient use of memory
Both principles may be at play here. If we ignore item 1 above, then I would suggest that your method 1 is preferred, because it has at least the opportunity to do a single load operation per input pixel. A store will still be necessary for each of the 3D batch that that pixel appears in. But your method 2 would require a load for each store. method 1 would not, necessarily. So that could be more efficient use of memory. When you have a memory-bound problem, which this one certainly will be, its critical to focus on efficient use of memory. Related to this would be to make sure you do the best possible job of coalesced access, both for loads and stores. That should not be difficult to arrange.
If we pull in item 1 above, then this raises the next set of questions, around dimensions. Dimensions/data set sizes often matter when we are talking about performance. If we used your method 1, we only have as many threads to launch as there are pixels in the input image. If there are enough pixels/threads, then that may be sufficient to saturate your GPU. If not, method 2 may actually be better, because it may have the potential to launch more threads overall (perhaps many more threads. Again dimensions matter.)
I don’t know of any primitives (which I translate to mean “library calls that would be useful”). On SO, asking questions that are asking for recommendations of libraries to use are explicitly off topic, so its possible that people not answering that question are simply adhering to site expectations.
CUDA to a first order approximation is C++ and libraries. Are there C++ “primitives” that meet your needs? If not, there won’t be any in CUDA, most likely, either. You then would need to look at libraries. I don’t know of any libraries that would be an obvious fit for what you are asking about here. You could certainly use NPP to do image copying, in the general case. I doubt it would be an obvious performance “win”, and it won’t be amenable to the various optimization strategies here.
A very good suggestion (IMO) was given to you in your cross posting: don’t copy the data. Instead just create a mapping for each image in the 3D stack, that maps back that image/pixel to the original source image. GPUs can often do math at a much higher rate than memory accesses. Therefore, if the mapping can be done mathematically/algorithmically, it may overall be faster not to create the 3D batch at all. Simply use the mapping functions to access pixels in the original image. The effectiveness of this suggestion may vary depending on how many times you will subsequently access each pixel in the 3D batch. If you have to access the 3D batch many times per pixel (which will require a read per pixel, no matter what) it may be more expedient to create the 3D batch. If you only need to access those output pixels once, then a mapping may be much better. Creating the output batch minimally requires a read, a write, and a read for the final op. If you use the mapping, there is only the read associated with the final op (if you only read the result once), plus the mapping arithmetic cost. This could be a 3:1 reduction in overall memory bandwidth consumed. That could make as much as a 3:1 difference in performance. OTOH if you access each pixel in the output 3D batch many times, then the bandwidth cost for using the mapping will be approximately the same as the bandwidth cost to use the 3D batch.