I am learning CUDA at the moment, and I am trying to use a kernel that modifies the elements of a 2d matrix. This is how I allocate the memory for the array in the device and copy the matrix:
int *d_A;
size_t pitch;
cudaMallocPitch((void**)&d_A, &pitch, sizeof(int)*cols, rows);
cudaMemcpy2D(d_A, pitch, A, sizeof(int)*cols, sizeof(int)*cols, rows, cudaMemcpyHostToDevice);
where cols and rows are the number of columns and rows in the matrix A.
However when I try to modify the elements in d_A I get segmentation fault, and if I print the elements
__global__ void kernel(int *d_A, size_t pitch, int rows, int cols){
//compute the row
int r = blockIdx.y*blockDim.y+threadIdx.y;
//compute the column
int c = blockIdx.x*blockDim.x+threadIdx.x;
if((r < rows) && (c < cols)){
// // update the pointer to point to the beginning of the row
int *Row = (int*)((char*)d_A + r*pitch);
int elem = Row[c];
printf("%d ", elem);
}
}
I do not get the values stored in the original matrix.
I thank you in advance for any help you can give me.
You would probably need to provide a complete code, along with the compile command you are using, and the device you are running on. Also, you don’t appear to be doing any proper cuda error checking. You should add that to your code any time you are having trouble with a CUDA code. There is nothing wrong with the code you’ve shown. I built a sample app out of it, here is a fully worked example:
If you provide a fully worked example like that, showing what you are doing, I’m sure someone can help.
Don’t forget to add proper cuda error checking (google: “proper cuda error checking”)
#define BLOCK_WIDTH 16
__global__ void kernel(int *d_A, size_t pitch, int rows, int cols){
//compute the row
int r = blockIdx.y*blockDim.y+threadIdx.y;
//compute the column
int c = blockIdx.x*blockDim.x+threadIdx.x;
if((r < rows) && (c < cols)){
// // update the pointer to point to the beginning of the row
//int *Row = (int*)((char*)d_A + r*pitch);
int *Row = (int*)((char*)d_A);
int elem = Row[c];
printf("%d ", elem);
}
}
void test(int **A, int rows, int cols){
int *d_A;
size_t pitch;
cudaMallocPitch((void**)&d_A, &pitch, sizeof(int)*cols, rows);
cudaMemcpy2D(d_A, pitch, A, sizeof(int)*cols, sizeof(int)*cols, rows, cudaMemcpyHostToDevice);
//Define grid and block size
int Yblocks = rows / BLOCK_WIDTH;
if(rows % BLOCK_WIDTH) Yblocks++;
int Xblocks = cols / BLOCK_WIDTH;
if(cols % BLOCK_WIDTH) Xblocks++;
// cout << Yblocks << "," << Xblocks << endl;
dim3 dimGrid(Yblocks, Xblocks, 1);
dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH, 1);
//Run kernel
kernel<<<dimGrid, dimBlock>>>(d_A, pitch, rows, cols);
cudaMemcpy2D(A, sizeof(int)*cols, d_A, pitch, sizeof(int)*cols, rows, cudaMemcpyDeviceToHost);
cudaFree(&d_A);
}
int main(){
int rows = 2;
int cols = 2;
int **A;
A = new int*[rows];
for(int i = 0; i < rows; ++i){
A[i] = new int[cols];
for(int j = 0; j < cols; ++j)
A[i][j] = i+2;
}
test(A, rows, cols);
for(int i = 0; i < rows; ++i){
for(int j = 0; j < cols; ++j)
cout << A[i][j] << " ";
cout << "\n";
}
for(int i = 0; i < rows; ++i) delete[] A[i];
delete[] A;
return 0;
}
I compile simply with “nvcc name.cu -o name” and the device is GTX 550 Ti (2.1 computing capability). I wanted to take a 2D C array as imput, instead of a 1D array. Don’t know if that is the problem, but when I omit the kernel the cudaMemcpy2D seems to copy correctly A to d_A and back.
Yup, that’s right. I did now cuda-memcheck as you show in your example and obtain
========= CUDA-MEMCHECK
33661504 33661504 0 0 2 2
3 3
========= Program hit cudaErrorInvalidDevicePointer (error 17) due to "invalid device pointer" on CUDA API call to cudaFree.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ef613]
========= Host Frame:./malloctest [0x3a2c6]
========= Host Frame:./malloctest [0x2868]
========= Host Frame:./malloctest [0x2960]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
========= Host Frame:./malloctest [0x2609]
=========
========= ERROR SUMMARY: 1 error
Well, it seems the memcheck error was do to the cudaFree(&d_A), which should be cudaFree(d_A).
But the program still prints values which are not the original A matrix values.
I guess you cannot pass an array of arrays to any cudaMalloc, or you can but then finding where are the original values is a real problem. A plain 1d array is the only way to go for the moment.
That’s correct. cudaMemcpy expects a single pointer (*) to a contiguous allocation. Your allocation method does not guarantee that the data are all contiguous, and you are passing a pointer-to-pointer (**) to cudaMemcpy which won’t work.
I’m not sure if this helpful or not but I think, at least for me, it’s easier to just allocate a 1D array and do the bookkeeping on it yourself. All memory in computers is 1D anyway and I’ve found dealing with CUDA in this regard to be more trouble than it’s worth but that honestly just be because I’m bad at CUDA lol.