seems that cuda doesn't support pointer to pointer problem report

Hi, it seems that cuda doesn’t support pointer to pointers, like:
int** a;
cudaMalloc(&a, sizeof(int*) * N));
for(int i = 0; i < N; ++i)
cudaMalloc(&a[i],size)); //memory violation here

above psuedocode can’t pass Device, we have to use continuous 1d array or cudaMalloc2D.

Any advice? Thanks a lot!

Anybody met this issue? Waiting for reply hurrily…

I didn’t say anything :P

Anyway, I think CUDA does not support pointers, so I suppose you will have to replace pointers by indexes or something similar. You have to keep in mind that after cudaMalloc( &a, … ), a is “pointing” to device memory, not host memory. And as soon as you dereference a, you are doing something that has no sense, *a is not memory in host space.

If you just need an array of device memory chunks, the first malloc must be a regular one. But keep in mind that you will not be able to dereference it in the device, you can use it for things like:

cudaDeviceFunction<<<…>>>( a[1], a[2], … );

not as a parameter by itself.

You cannot dereference a device pointer in host code (see last sentence of Section In the same way, you cannot have cudaMalloc write to a location pointed to by a device pointer.

You need to explicitly copy to device memory with something like this:

int** a; 

cudaMalloc(&a, sizeof(int*) * N));

int* ha[N];

for(int i = 0; i < N; ++i) 


cudaMemcpy(a, ha, sizeof(a), cudaMemcpyHostToDevice);

Thank you very much! I will try.

I have tried this (using floats), but in the kernel, the compiler appears confused about how to dereference the pointer.

Can you tell me what’s wrong below, or post the kernel code corresponding to your host code?


// one row of the array per thread block...

__global__ void kernelTestP2P(float **g_array, float *g_Out) 


    unsigned int bid = blockIdx.x;

    unsigned int tid = threadIdx.x;

   // get the pointer to this thread block's row

    __shared__ float *g_row;    // or should this be:  float *g_row; ....

    if (tid == 0)          // ... and eliminate this line???  neither seems to work...

        g_row = g_array[bid];


   shmem[tid] = g_row[tid];


Can you provide an example of the kernel code?

What about using cudaMalloc2D? Although I haven’t personally been able to do much with it, it seems like the thing that you’re looking for.

I think g_row is a variable in shared memory in your code. When accessing g_row[tid], the compiler probably tries to access shared memory.

You should have a look at your .ptx file. There is probably something like


when loading something from global memory to shared memory. Maybe in your ptx file is something like



then you know, where your Problem is ;-)

The code I wrote earlier had a typo; here’s the right code for the copy to work correctly:

wow I dont believe facing a problem solved 4 years ago!!!

this saves me from headbanging further LOL

Now, is there a new easier way to do this problem since 4 years pasted !

This page helped me a lot. so I just add a simple example to show clearly how we can use pointer to pointers in cuda.
#include <stddef.h>
#include <cuda.h>
#include <math_functions.h>
static void HandleError( cudaError_t err,const char *file, int line ) {
if (err != cudaSuccess) {
printf( “%s in %s at line %d\n”, cudaGetErrorString( err ),
file, line );
#define HANDLE_ERROR( err ) (HandleError( err, FILE, LINE ))
float Randomizer () { return (float)rand()/(float)RAND_MAX; }

global void kernelTestP2P(float **in, float **out)
unsigned int bid = blockIdx.x;
unsigned int tid = threadIdx.x;

// get the pointer to this thread block's row
extern __shared__ float shmem[];

shmem[tid] = in[bid][tid];


int main(){
int B=14,N=32;
srand ( unsigned ( time(NULL) ) );

float **in_dd,**in_hd,**in_hh;

//alocate 1st part, B count of float*
HANDLE_ERROR(cudaMalloc(&in_dd, sizeof(float
) * B));
//alocate 2nd part, N count of float
for(int i = 0; i < B; ++i) {
)malloc(Nsizeof(float));//allocate or connect the input data to it
//copy float
part from hd to dd
HANDLE_ERROR(cudaMemcpy(in_dd, in_hd, sizeof(float
)*B, cudaMemcpyHostToDevice));

float **out_dd,**out_hd,**out_hh;

*)malloc(sizeof(float*) * B);
HANDLE_ERROR(cudaMalloc(&out_dd, sizeof(float*) * B));

for(int i=0;i<B;i++){
)malloc(Nsizeof(float));//just alocate, NO input

size_t shMemSize=N*sizeof(float);
dim3 GridDim=dim3(B,1,1),BlockDim=dim3(N,1,1);


for(int i=0;i<B;i++){
HANDLE_ERROR(cudaMemcpy(out_hh[i], out_hd[i], sizeof(float)*N, cudaMemcpyDeviceToHost));

std::ofstream f_out;“output.txt”);

for(int i=0;i<B;i++) {
for(int j=0;j<N;j++) f_out<<out_hh[i][j]<<" “;
return 0;