I am trying to build a kernel that employs 2D array to to implement a function. I have one constraint that the array should be used in a[i][j] form in kernel for sake of code reusability.
Here what i tried with simplified demonstartion;
__global__ void mykernel (int ** array1, int ** array2, int row, int col){
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if( i < row && j < col)
array2[i][j] = array1[i][j]+1;
}
main code parts
int row, col; // assign some value
int ** array1, **array2;
cudaMalloc((void**)&array1, row * sizeof(int *));
cudaMalloc((void**)&array2, row * sizeof(int *));
for(int k= 0; k < col; k++){
cudaMalloc((void**)&array1[k], col * sizeof(int));
cudaMalloc((void**)&array2[k], col * sizeof(int));
}
cudaMemcpy2D(array1, col* sizeof(int), array_host, col* sizeof(int), col* sizeof(int), row,cudaMemcpyHostToDevice);
mykernel<<<N,N >>> (array1, array2, row, col);
cudaMemcpy2D(array_host2, col* sizeof(int), array2, col* sizeof(int), col* sizeof(int), row,cudaMemcpyDeviceToHost);
This code does not crash, but does not work either. Clearly kernel does not have a valid array, there is no way to print its content…
Despite the name, usage of cudaMemcpy2D is not correct here. cudaMemcpy2D is primarily intended for pitched or strided allocations. It does not handle double pointers (**) nor does it facilitate doubly-subscripted arrays.
There are a number of ways to refactor your code to make it easier. If you don’t want to do that (i.e. if you want to retain doubly-subscripted access in the kernel) then you have a “deep-copy” operation, i.e. copying data that involves pointers pointing to pointers, or pointers pointing to data structures that contain pointers.
I thank for the reply and it solved issue. I actually searched 2D array for CUDA, but many replies i found did not work at all… The problem is, compiler does not give any warning-error in most cases. It simply does not work…
What i wonder is if it is about GPU architecture. ? Why cant 2D-3D arrays be used straightforwardly as in CPUs…?
This is all just regular C++, and there is no difference in the way arrays are handled on the CPU and the GPU. Your data structure in this case is an array of separately allocated row (or column) vectors. Just like you cannot copy such a data structure with a simple memcpy() on a CPU, you cannot copy such a data structure with cudaMemcpy2D() between host and device, because cudaMemcpy2D assumes a single contiguous data object, not a two-tiered collection of data objects.
I am not sure who or what (book) popularized storing 2D matrices as collections of row/column vectors, but it is very poor practice as far as I am concerned. The non-contiguous storage is often harmful to performance. In addition, useful abstractions such as accessing arbitrary sub-matrices of a larger matrix, are complicated unnecessarily. Avoid. This advice also applies to code that never runs on a GPU, i.e. host-only code.