I’m not suggesting the only issue is the presence of the ampersands and removing them fixes everything. You stated:
I certainly would not use cudaMalloc3D
for that, I certainly would not use pitched pointers or allocations for that, and if you choose to do so anyway, transferring data via cudaMemcpy
would also be another head-scratcher for me.
For 1D problems, vectorAdd
represents a canonical code design method.
I can certainly massage what you have shown into something that “works”, but before I go any further, let me recommend quite clearly to you and future readers that I do not recommend this approach.
cudaMalloc3D
, like cudaMallocPitch
, creates a pitched allocation. A pitched allocation inherently has a notion of a 2 or 3-dimensional view of data or the problem, and for partly historical reasons, will “pad” each row of data that you request with possibly additional bytes of “space”, in each row, that are not intended to be used, but are present in the allocation. This complicates access to the data.
cudaPitchedPtr
is a structure definition, not a pointer directly/by itself. The structure consists of 4 items: an actual pointer (to the allocated space), the pitch value assigned at the point of allocation by the CUDA runtime, and the xsize and ysize which are basically extracted from the extent
passed to the allocation function (cudaMalloc3D
). The pitched pointer, i.e. the .ptr element of the structure, is the thing you want to use in any cudaMemcpyXXX
variant that you may choose, to reference the allocated space/data. No other form or treatment of the pointer is sensible.
- When we copy pitched allocations from one place to another, it’s important to keep in mind that the pitch value (a number of bytes that makes up the row) is assigned by the allocation function (you don’t get to pick it) and it is not necessarily equal to the number of elements in a row of your data times the bytes per element. It could be any number of bytes equal to or larger than that row-width calculation. I don’t want to go into why. The reasons are mostly historical, although it still finds application today in texturing. A key takeaway is the byte-width called the pitch can be any number.
cudaMemcpy
, by itself (i.e. not cudaMemcpy2D
, not cudaMemcpy3D
, etc) copies contiguous data only/always. So if I wanted to use cudaMemcpy
with a pitched allocation, and I wanted to copy more than one row, I’m going to run into at least 2 issues: A. I will inevitably copy bytes that I had no use for. B. Expressing the size of the copy operation would not take the form of number of elements times element size, like you would expect for contiguous data. In fact, legally, the size of the operation cannot be expressed purely based on element size, because there is no guarantee that the actual row width (called the pitch) is a (whole-number) multiple of the element size, in bytes.
- To dispel a common notion,
cudaMalloc3D
does not create a pointer that is readily able to be multiply-subscripted or multiply-dereferenced. It does not directly create an allocation that can be referenced eg. as A[z][y][x]
, for example. You only get to do A[x]
, and in fact even that does not work, you must use the pitched variant that I previously linked to.
cudaMalloc3D
is a device allocator. That means it allocates space in device (GPU) memory. It is not a host allocator. CUDA doesn’t provide any allocators that behave like cudaMalloc3D
, but for the host.
I’ve never seen an example where anyone ever wanted to use a pitched allocation in host memory. Therefore your attempt to copy from one device allocation to another device allocation using the cudaMemcpyHostToDevice
token is not sensible.
For 1D problems, and indeed all problems nowadays, the above complexity is not usually necessary, and is not something I would recommend in the general case. It certainly makes no sense to me whatsoever if someone is describing their data as 1D
.
Can the code be made to work? Probably. Because of item 6 above, it’s simply not possible to do a host to device transfer using two device side allocations, so I can’t make that work, and will need to change your example slightly in that regard, as well. Furthermore, there is no guarantee that I know of around pitch values from two separate allocations. We could assume that if both allocations have the same width, then they are going to have the same pitch, but I don’t know anywhere that that guarantee is provided. Before we start to say “Gee that’s strange” lets back up and remember that you’re not supposed to be using cudaMemcpy
with pitched allocations. If we used cudaMemcpy2D
(or 3D), it can handle dissimilar pitches between source and destination.
So what follows is the minimum number of changes I could make to get it to compile and not throw a runtime error.
# cat t251.cu
/*******************************************************/
#include <stdio.h>
cudaError_t Error1(int Nl, int Nm) {
cudaError_t cudaStatus;
cudaExtent extent;
cudaPitchedPtr dev_theta1;
cudaPitchedPtr dev_theta_xfr;
size_t width, height, depth, pitch;
int i, j, k1;
width = Nl + 1; height = Nm + 1; depth = 1;
extent = make_cudaExtent(width * sizeof(double), height, depth); //width in elements when referring to array memory. height, depth always in elements
cudaStatus = cudaMalloc3D(&dev_theta1, extent);
if (cudaStatus != cudaSuccess)
printf("cudaMalloc3D failed for dev_theta_1\n");
else
printf("allocation dev_theta_1 successful\n");
cudaStatus = cudaMalloc3D(&dev_theta_xfr, extent);
if (cudaStatus != cudaSuccess)
printf("cudaMalloc3D failed for dev_theta_xfr\n");
else
printf("allocation dev_theta_xfr successful\n");
cudaStatus = cudaMemcpy(dev_theta1.ptr, dev_theta_xfr.ptr, dev_theta1.pitch * dev_theta1.ysize, cudaMemcpyDeviceToDevice);
if (cudaStatus != cudaSuccess) {
printf("For memcpy from dev_theta_xfr to dev_theta1, cudaMemcpy failed!\n");
}
else
printf("memory copy succesful\n");
return cudaStatus;
}
/*******************************************************/
int main(){
Error1(256, 256);
return 0;
}
# nvcc -o t251 t251.cu
t251.cu(10): warning #177-D: variable "pitch" was declared but never referenced
size_t width, height, depth, pitch;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
t251.cu(11): warning #177-D: variable "i" was declared but never referenced
int i, j, k1;
^
t251.cu(11): warning #177-D: variable "j" was declared but never referenced
int i, j, k1;
^
t251.cu(11): warning #177-D: variable "k1" was declared but never referenced
int i, j, k1;
^
# compute-sanitizer ./t251
========= COMPUTE-SANITIZER
allocation dev_theta_1 successful
allocation dev_theta_xfr successful
memory copy succesful
========= ERROR SUMMARY: 0 errors
#