Correct Use of CUDA shared memory for Non-contiguous data access

I have a basic question related to Two-dimensional thread access. Actually these functions are embedded within MPI functions. To make it clear, I just describe the CUDA related parts.
I want to copy the non-contiguous data into contiguous buffer and the use of cuda memcopy can be illustrated as:

void pack_cuda(float *dstbuf, IOV *srciov, int num_iov)
      int i;
      float *ptr;
      ptr = buf;
      for (i = 0; i < num_iov; i++) {
        cudaMemcpy(ptr, srciov[i].bufaddr, srciov[i].len, cudaMemcpyDefault);
      ptr = (char *)ptr + srciov[i].len;
  • srciov stores the start memory address and length of each non-contiguous data in an array of structure.
  • dstbuf will store the packed contiguous data after the completion of the function.

Now, I want to implement it using CUDA kernels.

__global__ void pack_cuda(float *dstbuf, IOV *srciov, int num_iov)
      int i = blockIdx.x * blockDim.x + threadIdx.x;
      int j = blockIdx.y * blockDim.y + threadIdx.y;
      int k;
      extern __shared__ size_t tmpdbuflen[16*3]; //suppose num_iov is 16
      if ( j == 0 ){
        if ( i < 16 ){
         tmpdbuflen[i] = (srciov[i].len);   //store length to calculate presum
         tmpdbuflen[i+16] = tmpdbuflen[i];  //store length
         tmpdbuflen[i+32] = ((srciov+i)->bufaddr) - (srciov->bufaddr); //store addr difference

        for ( k = 0; k < i; k++)
          tmpdbuflen[i] += srciov[k].len;


      if ( i < 16 && j < srciov[i].len ){  //wondering whether this is correct use
        dst[tmpdbuflen[i] + j] = *(src + tmpdbuflen[i+32] + j);


Kernel invocation part:

dim3 dimblock(16, 16);  //the length of each non-contiguous data is less than 16 
dim3 dimgrid(1,1);
const unsigned int shm_size = sizeof(size_t) * 16 * 3;
pack_cuda<<<dimgrid, dimblock, shm_size, 0>>>(dstbuf, srciov, num_iov);

However, it seems that I cannot pack all needed datas into dst buffer (especially combined with the MPI_Send and MPI_Recv calls).
Sometimes only j = 0 and 1 (with corresponding various i) get packed.

I think the one possible problem is the usage of shared memory. I only use column 0 threads (threadIdx.y == 0) to copy information onto the shared memory. Then all threads (no restriction on threadIdx.y) will access and read information in shared memory.
How to modify the code enable such function?

One additional question is, what changes are needed if this kernel pack_cuda function will be called multiple times on host?

I’d appreciate it if anyone can figure out my problems.