Problems with double pointers on cuda

I am having trouble working with double pointer using cuda. The following source is returning SIGSEGV, however, if compile with -deviceemu the program runs normally.

[codebox]#include <cuda.h>

#define N 4

global void foo( int ** mat )

{ mat[0][0]++;


int main()

{ int i, j;

int ** mat_h = NULL;

int ** mat_d = NULL;

// Alloc host memory

mat_h = (int **) malloc( N * sizeof( int * ) );

for( i = 0; i < N; i++ )

    mat_h[i] = (int *) malloc( N * sizeof( int ) );

// Fill host matrix

for( i = 0; i < N; i++ ) 

    for( j = 0; j < N; j++ )

        mat_h[i][j] = j * i;

// Alloc device memory

cudaMalloc( (void**) &mat_d, N * sizeof( int * ) );

for( i = 0; i < N; i++ )

    cudaMalloc( (void**) &mat_d[i], N * sizeof( int ) );

// Copy memory from host to device

for( i = 0; i < N; i++ )

    cudaMemcpy( mat_d[i], mat_h[i], N * sizeof( int ), cudaMemcpyHostToDevice );

foo<<< 1, 1 >>>( mat_d );

return 0;


Your problems is that you allocate N ints device side, and then you attempt to allocate again on the allocated adresses.
You are trying to write directly to device memory from the host side, this is not allowed.

I have a c++ template that helps with sorting that out, but I also suspecting that it may be the cause of a problem I’m having, so I can’t recommend using it until I’ve figured out why my code segfaults.
Let me know if you wanna look at it.

I’m also gonna look at to see if those can solve similar problems, they might be of interest to you.

FYI I think you’ve posted in the wrong forum. This question is not linux specific, and should be in the programming section.

Letharion, thanks for your response!

It seems that now I can not move this topic to the programming section, sorry for the initial error.

I took a look at cudatemplate documentation. I wanted something in C that can solve my problem.

Someone have a code example to solve this problem ? (using C only)

Non-ansi solution is to use int * mat_d[N]; instead of int ** mat_d;

It’s quite possible for you to solve this yourself using c only. It just requires some thinking.
First you need to allocate two arrays of N size. N is your number of arrays. One array host side (A) and one device side (B)
Then to each host side element, cudamalloc.
Lastly, copy the content from A to B, and send B into the kernel.

I hope that helps :)

Yes, helped. Actually my problem, in real case, is similar to what was shown. however i have an array of pointers and must initially allocate an array of poiters and then allocate a struct for each poiters in the array. however each pointer in array have a address in device and this generate a segfault on host. Then i have to think a solution to store temporaly the poiters. Tank’s !!!