CudaMallocPitch and CudaMemcpy2D

Hi,
I’m currentyly trying to pass a 2d array to cuda with CudaMalloc pitch and CudaMemcpy2D.The only value i get is pointer and i don’t understand why?

This is an exemple of my code:

double** busdata;
	double** linedata;
	int i_bus=4;
	int j_bus=10;
	int i_line=4;
	int j_line=6;
// We initializa the Host matrice
try{
		busdata=new(double*[i_bus]);
		for(int i=0; i<i_bus;i++){
			busdata[i]=new(double[j_bus]);
		}
		linedata=new(double*[i_line]);
		for(int i=0;i<i_line;i++){
			linedata[i]=new(double[j_line]);
		}

	}catch(std::bad_alloc& ba){
		 std::cerr << "bad_alloc caught: " << ba.what() << '\n';
	}

        size_t bus_pitch;
	size_t line_pitch;
	double* device_busdata;
	double* device_linedata;
        //Here I allocate the device matrice
        cudaStatus =cudaMallocPitch(&device_linedata,&line_pitch,j_line*sizeof(double),i_line);
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed.");
		goto Error;
	}
	cudaStatus =cudaMallocPitch(&device_busdata,&bus_pitch,j_bus*sizeof(double),i_bus);
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed.");
		goto Error;
	}
	//Copy of the data from the CPU to the GPU
	cudaStatus=cudaMemcpy2D(device_busdata,bus_pitch,busdata,j_bus*sizeof(double),j_bus*sizeof(double),i_bus,cudaMemcpyHostToDevice);
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMemcpy failed.");
		goto Error;
	}
	
	cudaStatus=cudaMemcpy2D(device_linedata,line_pitch,linedata,j_line*sizeof(double),j_line*sizeof(double),i_line,cudaMemcpyHostToDevice);
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMemcpy failed.");
		goto Error;
	}

That’s how i allocate and send the data in the GPU. But when i try to use it with this code:

__global__ void ComputeN(double* busdata, int buspitch,int* npq, double* Pq ){
	int tid= blockDim.x*blockIdx.x+threadIdx.x;
	double* row= (double*) ((char*) busdata+tid*buspitch);
	double type=roundf(row[1]);
	if(type==3){
		atomicAdd(npq,1);
		Pq[*npq]=type;
	}

}

I always get bad results like 3.83797400846176e-270. I supposed it’s a memory address and not the data i’ve copied.

I know there is a lot of subject on this theme like this one :https://devtalk.nvidia.com/default/topic/521887/cudamallocpitch-cudamemcpy2d-want-to-check-if-the-copy-of-2d-data-between-host-and-dev-is-work/
But I don’t even know why my value don’t go to the GPU…
I’m using a GTX 860 M whith Cuda 7.0

Thank you in advance

Despite the naming, cudaMemcpy2D is not designed to handle double-pointer (**) types (take a look at the documentation for it - none of the parameters are double-pointer types). Therefore this line is not valid:

cudaStatus=cudaMemcpy2D(device_busdata,bus_pitch,busdata,j_bus*sizeof(double),j_bus*sizeof(double),i_bus,cudaMemcpyHostToDevice);

You cannot pass busdata as the 3rd parameter. The function prototype expects a single pointer reference there.

none of the cudaMemcpy type operations, including cudaMemcpy2D, know how to traverse a double-pointer-allocated array such as what you have created:

busdata=new(double*[i_bus]);
		for(int i=0; i<i_bus;i++){
			busdata[i]=new(double[j_bus]);
		}
		linedata=new(double*[i_line]);
		for(int i=0;i<i_line;i++){
			linedata[i]=new(double[j_line]);
		}

One reason for this is that the above allocation method has the potential to create discontiguous allocations for the “lines”. That is linedata[i] and linedata[i+1] may not point to adjacent regions in memory. There is no cudaMemcpy operation that can handle this discontinuity, if it is random or unpredictable, as is the case here.

Instead you need to flatten your array (busdata) into a single contiguous allocation that can be referenced by a single pointer (*).

So if i’m flattening the array like that:

for(int i=0;i<i_line;i++){
		j=0;
		while(j<j_line){
			dataline[j]=linedata[i][j];
			j++;
		}
	}
	double* databus= new(double[i_bus*j_bus]);
	j=0;
	for(int i=0;i<i_bus;i++){
		j=0;
		while(j<j_bus){
			databus[i]=busdata[i][j];
			j++;
		}
	}

It supposed to work when i passed databus to the memcpy2D?

I haven’t validated every aspect of your code. You haven’t provided a compilable example that I can copy, paste, compile, and run, without having to add anything or change anything to test. So I’m not going to say “your code works”. But I can say that your previous method was flawed. Flattening (in some fashion) is necessary. Your most recent posting appears to be flattening correct. However I may have missed something and there may be any number of other issues with your code.

You can find cuda sample codes that demonstrate complete and proper usage of cudaMemcpy2D:

http://docs.nvidia.com/cuda/cuda-samples/index.html#abstract

And there are a great many questions and answers both here and on stackoverflow which demonstrate any number of approaches to “flattening” as well as correct usage of cudaMemcpy2D.

Since this is a pet peeve of mine: cudaMemcpy2D() is appropriately named in that it deals with 2D arrays. The issue is with host code that tries to pass off a collection of non-contiguous row vectors (or column vectors) as a 2D array. Not the same thing.

I am not sure who popularized this storage organization, but I consider it harmful to any code that wants to deal with matrices efficiently. Just consider the common scenario of having to operate on a sub-matrix of a larger matrix. So, if at all possible, use contiguous storage (possibly with row or column padding) for 2D matrices in both host and device code.

If for some reason you must use the collection-of-vectors storage scheme on the host, you will need to copy each individual vector with a separate cudaMemcpy*().

I didn’t say cudaMemcpy2D is inappropriately named. I said “despite the naming”. The simple fact is that many folks conflate a 2D array with a storage format that is doubly-subscripted, and also, in C, with something that is referenced via a double pointer. If the naming leads you to believe that cudaMemcpy2D is designed to handle a doubly-subscripted or a double-pointer referenceable object in C, you frequently will be mistaken*.

Of course it’s evident from the function prototype in the documentation that that is not the case, but nevertheless this type of question is prevalent.

  • The exception I am aware of is multiply-subscripted non-variable-dimension (i.e. dimension known at compile time) arrays. In fact, one of the dimensions can be variable, as long as the width dimension is known either via a constant or a constant in a typedef. In that case, the compiler knows the correct indexing rules and will generate them, even if a doubly-subscripted statically or dynamically allocated pointer is passed to cudaMemcpy. This is, without a doubt, a complicated topic. Nevertheless, OP’s code demonstrates improper usage. And none of this discussion contravenes the idea that cudaMemcpy operations require either contiguous allocations or allocations with “regular” discontiguities.

I did not mean to imply that you consider cudaMemcpy2D inappropriately named. I am merely saying that anybody who thinks “2D” in the name of this function implies collection-of-vectors storage is wide off the mark, and through no fault of the engineer who decided on the name of this API call (no, it wasn’t me :-)

Maybe someone can pinpoint the (text)book that lead to a conflation of 2D arrays and collections of vectors in programmers’ minds.

The attempted use of cudaMemcpy2D with collection-of-vectors storage leads to questions with such frequency that I think it would warrant the addition of a sticky post to this forum. I have answered variants of that question at least half a dozen times, probably more than that.

Thank you guys. It was just the way i was flattening my array which were wrong.
Here is a better code:

int j;
	int flat_index=0;
	for(int i=0;i<i_line;i++){
		j=0;
		while(j<j_line){
			dataline[flat_index]=linedata[i][j];
			j++;
			flat_index++;
		}
	}
	flat_index=0;
	double* databus= new(double[i_bus*j_bus]);
	for(int i=0;i<i_bus;i++){
		j=0;
		while(j<j_bus){
			databus[flat_index]=busdata[i][j];
			j++;
			flat_index++;
		}
	}