It seems plausible to me. When I was pondering this, that was my speculation. However it leaves the door open for questions I can’t explain at the moment, so its not entirely obvious to me what is going on here yet.
At the moment I can’t say anything definitively. I have already filed an internal request (bug) to have the documentation improved with respect to this, and can’t say anything definitive until that gets some traction.
This being the week before GTC, everyone is rather busy/distracted at the moment.
Since I have filed a bug, I won’t “forget” about this. If/when I see movement on the bug that allows me to comment here publicly with something more concrete, I will attempt to do so. Nevertheless, if we view this as a documentation issue, the earliest possible scenario to address it formally would be the next CUDA release, and currently I have no indication that any update here would be included in the next CUDA release.
Let’s suppose our width is w and our height is h. Now let’s suppose we want to transfer data to this array. In the general case, we may wish to transfer data to a specific region within this array, the size of which is called extent. Lets say that the origin of that region is (x,y), the width of that region is sw, and the height of that region is sh. Like this:
To formulate a proper cudaMemcpy2DToArray operation to this destination region in the cudaArray dst from a 2D unpitched host memory allocation src, it would look like this:
cudaMemcpy2DToArray(dst, x, y, src, sw*sizeof(src) , sw*sizeof(src), sh, cudaMemcpyHostToDevice);
Regarding “2D unpitched host memory allocation”, that is effectively a linear allocation of memory. 2D refers to the idea that it logically represents a 2D region, consisting of rows and columns. However it may ordinarily be a 1D allocation, where the “rows” are adjacent to each other. In addtion, the assumption for the above case is that the source allocation effectively has a width of sw, i.e. the entire source allocation corresponds to a data set the size and shape of the green region above.
For the general case of a pitched source (host) allocation, the only difference would be choice of src pointer to point to the beginning of the source region, and modification of the host pitch parameter.
From the above description, it is evident that a pitch is involved when using this api as a replacement for cudaMemcpyToArray (which is deprecated).
Going back to the original question then, when refactoring code that is currently in a realization like this:
it will be necessary to know the width associated with the cudaArray dst. The source line pitch parameter (as well as transfer column width) associated with the cudaMemcpy2DToArray operation must be consistent with (i.e. less than or equal to) the width of the cudaArray (we are considering both widths in elements for this comparison statement, although the widths associated with the cudaMemcpy2DToArray operation are expressed in bytes).
If the width of the cudaArray is not known by inspection of the code, it can be ascertained in-situ at runtime using the cudaArrayGetInfo api:
(I had previously posted that as a response in this thread, but I deleted it because it did not provide a general answer, and without knowing your actual cudaArray allocation, I could not be certain we were talking about that case. In your case, the width parameter is nSamples, and your cudaMempcy2DToArray pitch and width of transfer arguments are consistent with that width.)