Error when attempting to use cudaMemCpy()

I am having difficulty using cudaMemcpy(). Below is an abbreviated sample. Essentially, I would calling the function from main() as Error1(3, 3). The code below

  1. initialize a one dimensional array dev_theta1 (ok)
  2. Initialize one dimensional array dev_theta_xfr (ok)
  3. Copy from dev_theta_xfr to dev_theta1 (copy error)

Ultimately I want to pass dev_theta1 to my kernel. I realize dev_theta_xfr only has junk in it, this is only a shortened sample for posting.

/*******************************************************/
include “cuda_runtime.h”
include “device_launch_parameters.h”
include <stdio.h>

cudaError_t Error1(int Nl, int Nm) {

cudaError_t cudaStatus;
cudaExtent extent;
cudaPitchedPtr dev_theta1;
cudaPitchedPtr dev_theta_xfr;
size_t width, height, depth, pitch;
int i, j, k1;

width = Nl + 1;  height = Nm + 1;  depth = 1;
extent = make_cudaExtent(width * sizeof(double), height, depth);  //width in elements when referring to array memory.  height, depth always in elements 
cudaStatus = cudaMalloc3D(&dev_theta1, extent);
if (cudaStatus != cudaSuccess)
    printf("cudaMalloc3D failed for dev_theta_1\n");
else
    printf("allocation dev_theta_1 successful\n");


cudaStatus = cudaMalloc3D(&dev_theta_xfr, extent);
if (cudaStatus != cudaSuccess)
    printf("cudaMalloc3D failed for dev_theta_xfr\n");
else
    printf("allocation dev_theta_xfr successful\n");


cudaStatus = cudaMemcpy(&(dev_theta1).ptr, &dev_theta_xfr, dev_theta1.xsize * dev_theta1.ysize, cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    printf("For memcpy from dev_theta_xfr to dev_theta1, cudaMemcpy failed!\n");
}
else
    printf("memory copy succesful\n");

return cudaStatus;

}
/*******************************************************/

when posting code on these forums, please use a method like this:

  1. click the pencil icon to edit your post
  2. select the code
  3. click the </> button at the top of the edit pane
  4. save your changes

Please do this now by editing your post above.

  1. The ampersands in your cudaMemcpy call are almost certainly wrong:

    cudaMemcpy(&(dev_theta1).ptr, &dev_theta_xfr,…
    ^ ^
    We take the address of a pointer like that in cudaMalloc because we want to modify the pointer value. But in cudaMemcpy we are using the pointer value directly, no need for “address-of” operator.

    Please study a cuda sample code like vectorAdd to get the basics of CUDA program development understood.

  2. cudaMalloc3D creates a “pitched” pointer/allocation. You’ll need to become familiar with what that is. It’s very unusual to use cudaMemcpy to copy to or from a pitched allocation. The usual copy operation would be cudaMemcpy2D for a pitched allocation copy. Based on what you have shown so far, unless you know for certain that you need a pitched allocation, I would not use cudaMalloc3D nor cudaMallocPitch, but instead create an ordinary contiguous allocation with cudaMalloc, like what you would find in the aforementioned vectorAdd sample code.

The variable `dev_theta1 was declared as ```

cudaPitchedPtr dev_theta1

so is not a pointer. That is why I used the ampersand. This is one reason I was hesitant to post to the forum. I am not lazy, I have studied a lot of the material. Seen the vectorAdd() example dozens of times.

Are you sure I should be removing the & symbol? I will do so but I don’t think Visual Studio will compile the code. Thanks.

P.S. I checked Visual Studio will not compile the code with & removed.``

I will try to read up on a pitched allocation and also using cudaMalloc instead of cudaMalloc3D. Thanks.

I’m not suggesting the only issue is the presence of the ampersands and removing them fixes everything. You stated:

I certainly would not use cudaMalloc3D for that, I certainly would not use pitched pointers or allocations for that, and if you choose to do so anyway, transferring data via cudaMemcpy would also be another head-scratcher for me.

For 1D problems, vectorAdd represents a canonical code design method.

I can certainly massage what you have shown into something that “works”, but before I go any further, let me recommend quite clearly to you and future readers that I do not recommend this approach.

  1. cudaMalloc3D, like cudaMallocPitch, creates a pitched allocation. A pitched allocation inherently has a notion of a 2 or 3-dimensional view of data or the problem, and for partly historical reasons, will “pad” each row of data that you request with possibly additional bytes of “space”, in each row, that are not intended to be used, but are present in the allocation. This complicates access to the data.
  2. cudaPitchedPtr is a structure definition, not a pointer directly/by itself. The structure consists of 4 items: an actual pointer (to the allocated space), the pitch value assigned at the point of allocation by the CUDA runtime, and the xsize and ysize which are basically extracted from the extent passed to the allocation function (cudaMalloc3D). The pitched pointer, i.e. the .ptr element of the structure, is the thing you want to use in any cudaMemcpyXXX variant that you may choose, to reference the allocated space/data. No other form or treatment of the pointer is sensible.
  3. When we copy pitched allocations from one place to another, it’s important to keep in mind that the pitch value (a number of bytes that makes up the row) is assigned by the allocation function (you don’t get to pick it) and it is not necessarily equal to the number of elements in a row of your data times the bytes per element. It could be any number of bytes equal to or larger than that row-width calculation. I don’t want to go into why. The reasons are mostly historical, although it still finds application today in texturing. A key takeaway is the byte-width called the pitch can be any number.
  4. cudaMemcpy, by itself (i.e. not cudaMemcpy2D, not cudaMemcpy3D, etc) copies contiguous data only/always. So if I wanted to use cudaMemcpy with a pitched allocation, and I wanted to copy more than one row, I’m going to run into at least 2 issues: A. I will inevitably copy bytes that I had no use for. B. Expressing the size of the copy operation would not take the form of number of elements times element size, like you would expect for contiguous data. In fact, legally, the size of the operation cannot be expressed purely based on element size, because there is no guarantee that the actual row width (called the pitch) is a (whole-number) multiple of the element size, in bytes.
  5. To dispel a common notion, cudaMalloc3D does not create a pointer that is readily able to be multiply-subscripted or multiply-dereferenced. It does not directly create an allocation that can be referenced eg. as A[z][y][x], for example. You only get to do A[x], and in fact even that does not work, you must use the pitched variant that I previously linked to.
  6. cudaMalloc3D is a device allocator. That means it allocates space in device (GPU) memory. It is not a host allocator. CUDA doesn’t provide any allocators that behave like cudaMalloc3D, but for the host.
    I’ve never seen an example where anyone ever wanted to use a pitched allocation in host memory. Therefore your attempt to copy from one device allocation to another device allocation using the cudaMemcpyHostToDevice token is not sensible.

For 1D problems, and indeed all problems nowadays, the above complexity is not usually necessary, and is not something I would recommend in the general case. It certainly makes no sense to me whatsoever if someone is describing their data as 1D.

Can the code be made to work? Probably. Because of item 6 above, it’s simply not possible to do a host to device transfer using two device side allocations, so I can’t make that work, and will need to change your example slightly in that regard, as well. Furthermore, there is no guarantee that I know of around pitch values from two separate allocations. We could assume that if both allocations have the same width, then they are going to have the same pitch, but I don’t know anywhere that that guarantee is provided. Before we start to say “Gee that’s strange” lets back up and remember that you’re not supposed to be using cudaMemcpy with pitched allocations. If we used cudaMemcpy2D(or 3D), it can handle dissimilar pitches between source and destination.

So what follows is the minimum number of changes I could make to get it to compile and not throw a runtime error.

# cat t251.cu
/*******************************************************/
#include <stdio.h>

cudaError_t Error1(int Nl, int Nm) {

  cudaError_t cudaStatus;
  cudaExtent extent;
  cudaPitchedPtr dev_theta1;
  cudaPitchedPtr dev_theta_xfr;
  size_t width, height, depth, pitch;
  int i, j, k1;

  width = Nl + 1;  height = Nm + 1;  depth = 1;
  extent = make_cudaExtent(width * sizeof(double), height, depth);  //width in elements when referring to array memory.  height, depth always in elements
  cudaStatus = cudaMalloc3D(&dev_theta1, extent);
  if (cudaStatus != cudaSuccess)
    printf("cudaMalloc3D failed for dev_theta_1\n");
  else
    printf("allocation dev_theta_1 successful\n");


  cudaStatus = cudaMalloc3D(&dev_theta_xfr, extent);
  if (cudaStatus != cudaSuccess)
    printf("cudaMalloc3D failed for dev_theta_xfr\n");
  else
    printf("allocation dev_theta_xfr successful\n");


  cudaStatus = cudaMemcpy(dev_theta1.ptr, dev_theta_xfr.ptr, dev_theta1.pitch * dev_theta1.ysize, cudaMemcpyDeviceToDevice);
  if (cudaStatus != cudaSuccess) {
    printf("For memcpy from dev_theta_xfr to dev_theta1, cudaMemcpy failed!\n");
  }
  else
    printf("memory copy succesful\n");

  return cudaStatus;
}
/*******************************************************/

int main(){

  Error1(256, 256);
  return 0;
}
# nvcc -o t251 t251.cu
t251.cu(10): warning #177-D: variable "pitch" was declared but never referenced
    size_t width, height, depth, pitch;
                                 ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

t251.cu(11): warning #177-D: variable "i" was declared but never referenced
    int i, j, k1;
        ^

t251.cu(11): warning #177-D: variable "j" was declared but never referenced
    int i, j, k1;
           ^

t251.cu(11): warning #177-D: variable "k1" was declared but never referenced
    int i, j, k1;
              ^

# compute-sanitizer ./t251
========= COMPUTE-SANITIZER
allocation dev_theta_1 successful
allocation dev_theta_xfr successful
memory copy succesful
========= ERROR SUMMARY: 0 errors
#

This would be closer to what I would expect for typical usage of pitched allocations:

# cat t252.cu
/*******************************************************/
#include <stdio.h>
template <typename T>
__global__ void k(cudaPitchedPtr d){
  int width = d.xsize/sizeof(T);
  for (int i = threadIdx.x+blockDim.x*blockIdx.x; i < width*d.ysize; i += gridDim.x*blockDim.x){
    T test_val = i%3;
    int my_row = i/width;
    int my_col = i - my_row*width; // modulo
    T my_val = ((T *)(((unsigned char *)d.ptr) + d.pitch*my_row))[my_col];
    if (test_val != my_val) {printf("oops i:%d, test_val:%f, val:%f, my_row:%d, my_col:%d \n", i, test_val, my_val, my_row, my_col); return;}
  }
}
// assumption is that d points to an array of type T, with dimensions Nl (width), Nm (height)
template <typename T>
cudaPitchedPtr Error1(int Nl, int Nm, T *d) {

  cudaError_t cudaStatus;
  cudaExtent extent;
  cudaPitchedPtr dev_theta1;
  size_t width, height, depth;

  width = Nl;  height = Nm;  depth = 1;
  extent = make_cudaExtent(width * sizeof(T), height, depth);  //width in elements when referring to array memory.  height, depth always in elements
  cudaStatus = cudaMalloc3D(&dev_theta1, extent);
  if (cudaStatus != cudaSuccess)
    printf("cudaMalloc3D failed for dev_theta_1\n");
  else
    printf("allocation dev_theta_1 successful\n");

  cudaStatus = cudaMemcpy2D(dev_theta1.ptr, dev_theta1.pitch, d, width*sizeof(T),  width * sizeof(T), dev_theta1.ysize, cudaMemcpyHostToDevice);
  if (cudaStatus != cudaSuccess) {
    printf("For memcpy from dev_theta_xfr to dev_theta1, cudaMemcpy failed!\n");
  }
  else
    printf("memory copy succesful\n");

  return dev_theta1;
}
/*******************************************************/

int main(){
  const int width = 256;
  const int height = 512;
  double *d = new double[width*height];
  for (int i = 0; i < width*height; i++) d[i] = i%3;
  cudaPitchedPtr p = Error1(width, height, d);
  k<double><<<1,1>>>(p);
  cudaDeviceSynchronize();
  return 0;
}
root@hpe-dl385-gen10-005:~/bobc# :q
:q: command not found
root@hpe-dl385-gen10-005:~/bobc# vi t252.cu
root@hpe-dl385-gen10-005:~/bobc# cat t252.cu
/*******************************************************/
#include <stdio.h>
template <typename T>
__global__ void k(cudaPitchedPtr d){
  int width = d.xsize/sizeof(T);
  for (int i = threadIdx.x+blockDim.x*blockIdx.x; i < width*d.ysize; i += gridDim.x*blockDim.x){
    T test_val = i%3;
    int my_row = i/width;
    int my_col = i - my_row*width; // modulo
    T my_val = ((T *)(((unsigned char *)d.ptr) + d.pitch*my_row))[my_col];
    if (test_val != my_val) {printf("oops i:%d, test_val:%f, val:%f, my_row:%d, my_col:%d \n", i, test_val, my_val, my_row, my_col); return;}
  }
}
// assumption is that d points to an array of type T, with dimensions Nl (width), Nm (height)
template <typename T>
cudaPitchedPtr Error1(int Nl, int Nm, T *d) {

  cudaError_t cudaStatus;
  cudaExtent extent;
  cudaPitchedPtr dev_theta1;
  size_t width, height, depth;

  width = Nl;  height = Nm;  depth = 1;
  extent = make_cudaExtent(width * sizeof(T), height, depth);  //width in elements when referring to array memory.  height, depth always in elements
  cudaStatus = cudaMalloc3D(&dev_theta1, extent);
  if (cudaStatus != cudaSuccess)
    printf("cudaMalloc3D failed for dev_theta_1\n");
  else
    printf("allocation dev_theta_1 successful\n");

  cudaStatus = cudaMemcpy2D(dev_theta1.ptr, dev_theta1.pitch, d, width*sizeof(T),  width * sizeof(T), dev_theta1.ysize, cudaMemcpyHostToDevice);
  if (cudaStatus != cudaSuccess) {
    printf("For memcpy from dev_theta_xfr to dev_theta1, cudaMemcpy failed!\n");
  }
  else
    printf("memory copy succesful\n");

  return dev_theta1;
}
/*******************************************************/

int main(){
  const int width = 256;
  const int height = 512;
  double *d = new double[width*height];
  for (int i = 0; i < width*height; i++) d[i] = i%3;
  cudaPitchedPtr p = Error1(width, height, d);
  k<double><<<1,1>>>(p);
  cudaDeviceSynchronize();
  return 0;
}
# nvcc -o t252 t252.cu
# compute-sanitizer ./t252
========= COMPUTE-SANITIZER
allocation dev_theta_1 successful
memory copy succesful
========= ERROR SUMMARY: 0 errors
#