thrust::copy host_vector to device_vector segfaults

// Estimator.h
struct SeamTemplate
{
  Eigen::Matrix3f rot;
  std::bitset<BALL_AREA> seam;
};

class Estimator
{
  //...
}

// Estimator.cu
#include "Estimator.h"

void
Estimator::loadTemplates(std::istream& is)
{
  thrust::host_vector<SeamTemplate> tmp;
  thrust::host_vector<SeamTemplate>::size_type size = 0;

  is.read((char *) &size, sizeof(size));
  tmp.resize(size);
  // templates is device_vector and private member of Estimator class 
  templates.resize(size);

  is.read((char *) &tmp[0], tmp.size() * sizeof(SeamTemplate));
  std::cout << "loaded" << std::endl;
  // running well here  

  thrust::copy(tmp.begin(), tmp.end(), templates.begin());
  // segfaults before printing "loaded"
  std::cout << "loaded" << std::endl;
}

Any idea how I can read this vector from binary file?

I used to save and load the templates without CUDA using:

// templates is std::vector<SeamTemplate> here
// Estimator.cpp
void
Estimator::saveTemplates(std::ostream& os)
{
  std::vector<SeamTemplate>::size_type size = templates.size();
  os.write((char*)&size, sizeof(size));
  os.write((char*)&templates[0], templates.size() * sizeof(SeamTemplate));
}

void
Estimator::loadTemplates(std::istream& is)
{
  std::vector<SeamTemplate>::size_type size = 0;
  is.read((char*)&size, sizeof(size));
  templates.resize(size);
  is.read((char*)&templates[0], templates.size() * sizeof(SeamTemplate));
}

You can’t use std::bitset in a thrust::device_vector (or in CUDA device code). You should be getting warnings.

Other than that (and I didn’t use eigen either) your general approach seems to work for me with out any seg fault:

$ cat t1455.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <bitset>
#include <iostream>
#include <fstream>

//const int BALL_AREA=8;

struct SeamTemplate
{
  float3 rot;
  //std::bitset<BALL_AREA> seam;
};

class Estimator
{
  public:
  thrust::device_vector<SeamTemplate> templates;
  void saveTemplates(std::ostream&);
  void loadTemplates(std::istream&);
};


void
Estimator::saveTemplates(std::ostream& os)
{
  std::vector<SeamTemplate> my_templates(templates.size());
  thrust::copy(templates.begin(), templates.end(), my_templates.begin());
  std::vector<SeamTemplate>::size_type size = my_templates.size();
  os.write((char*)&size, sizeof(size));
  os.write((char*)&my_templates[0], my_templates.size() * sizeof(SeamTemplate));
}

void
Estimator::loadTemplates(std::istream& is)
{
  thrust::host_vector<SeamTemplate> tmp;
  thrust::host_vector<SeamTemplate>::size_type size = 0;

  is.read((char *) &size, sizeof(size));
  tmp.resize(size);
  // templates is device_vector and private member of Estimator class
  templates.resize(size);

  is.read((char *) &tmp[0], tmp.size() * sizeof(SeamTemplate));
  std::cout << "loaded" << std::endl;
  // running well here

  thrust::copy(tmp.begin(), tmp.end(), templates.begin());
  // segfaults before printing "loaded"
  std::cout << "loaded" << std::endl;
}

int main(){
  std::filebuf fb;
  fb.open ("test",std::ios::out);
  std::ostream os(&fb);
  Estimator e;
  e.templates.resize(10);
  e.saveTemplates(os);
  fb.close();
  if (fb.open ("test",std::ios::in))
  {
    std::istream is(&fb);
    e.loadTemplates(is);
    fb.close();
  }
  std::cout << e.templates.size() << std::endl;
  return 0;
}
$ nvcc -o t1455 t1455.cu
$ ./t1455
loaded
loaded
10
$

Thank you!! I got it to work by replacing my bitset with int array. Eigen was usable in kernel after following https://eigen.tuxfamily.org/dox-devel/TopicCUDA.html and I achieved some nice performance improvement. However, I am trying to cast int array to long long array inside my kernel to optimize it more but stuck again.

// Working kernel code
__global__ void                                                                                                                                                                                                                                                                                                                                      
estimateKernel(SeamTemplate *templates, int *target,                                                                                                                                                                                                                                                                                                 
               int templateSize, int *scores)                                                                                                                                                                                                                                                                                                        
{                                                                                                                                                                                                                                                                                                                                                    
  int tid = blockDim.x * blockIdx.x + threadIdx.x;                                                                                                                                                                                                                                                                                                   
  if(tid >= templateSize)                                                                                                                                                                                                                                                                                                                            
    return;                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                     
  int *proposal = templates[tid].seam;                                                                                                                                                                                                                                                                                                               
  int score = 0;                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                     
  for(int i = 0;i < 313;i++){                                                                                                                                                                                                                                                                                                                        
    score += __popc(proposal[i] & target[i]);                                                                                                                                                                                                                                                                                                        
  }                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                     
  scores[tid] = score;                                                                                                                                                                                                                                                                                                                               
} 

// Not working. It causes
// CUDA Error: misaligned address
// cuda-memcheck:
// ========= Invalid __global__ read of size 8
// =========     at 0x00000128 in estimateKernel(SeamTemplate*, int*, int, int*)
// for all threads
__global__ void
estimateKernel(SeamTemplate *templates, int *target,
               int templateSize, int *scores)
{
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  if(tid >= templateSize)
    return;

  long long *proposal = reinterpret_cast<long long *>(templates[tid].seam);
  long long *targetll = reinterpret_cast<long long *>(target);

  int score = 0;
  for(int i = 0;i < 156;i++){
    score += __popcll(proposal[i] & targetll[i]);
  }

  scores[tid] = score;
}

Any idea how I should cast my int array to long array? Any help is appreciated. Thank you. Also, how do I nest kernels? CUDA capability 6.1

what does your SeamTemplate struct definition look like? (I can’t read your mind.) Why don’t you just declare the seam item to be long long to begin with?

nesting kernels is called cuda dynamic parallelism. There are many questions on the web about it, cuda sample codes, and a whole section in the programming guide that covers it.

I cannot change it because of others parts in my code. It will be great if I can just read two 32-bit int as 1 64-bit long as I only care about its bit-values. If its not possible, I can reimplement everything but I would like to avoid it and just change few lines.

//Estimator.h
...
struct SeamTemplate
{
  Eigen::Matrix3f rot;
  int seam[313];
};
...

I have tried nesting kernels but it gives

CMakeFiles/estimate_rotation.dir/estimate_rotation_generated_main.cu.o: In function `__sti____cudaRegisterAll()':
/tmp/tmpxft_0000675d_00000000-5_main.cudafe1.stub.c:42: undefined reference to `__cudaRegisterLinkedBinary_39_tmpxft_0000675d_00000000_6_main_cpp1_ii_cd69331a'
CMakeFiles/estimate_rotation.dir/estimate_rotation_generated_Estimator.cu.o: In function `__sti____cudaRegisterAll()':
/tmp/tmpxft_00006732_00000000-5_Estimator.cudafe1.stub.c:182: undefined reference to `__cudaRegisterLinkedBinary_44_tmpxft_00006732_00000000_6_Estimator_cpp1_ii_05dd6786'
collect2: error: ld returned 1 exit status

//CMakeLists.txt
cmake_minimum_required(VERSION 2.8)

project(estimate_rotation)

find_package(CUDA REQUIRED)
find_package(PCL 1.8.1 REQUIRED)
find_package(OpenCV 4.1.1 REQUIRED)

include_directories(${PCL_INCLUDE_DIRS})
include_directories(${OpenCV_INCLUDE_DIRS} )
link_directories(${PCL_LIBRARY_DIRS})
add_definitions(${PCL_DEFINITIONS})

set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -arch=sm_35 -rdc=true -lcudadevrt")

cuda_add_executable (estimate_rotation main.cu Estimator.cu)
target_link_libraries (estimate_rotation ${PCL_LIBRARIES} ${OpenCV_LIBS})

I just have main.cu Estimator.cu Estimator.h and main.cu Estimator.cu includes Estimator.h. Any idea how I should link properly?

It was working before I added the nvcc flags. I had to add them to call nested kernel.

If you want to cast seam to a pointer to a 64-bit type, I think this should allow you to do that:

struct __align__(8) SeamTemplate    // note change
{
  int seam[313];                    // note change in order
  Eigen::Matrix3f rot;
};
...

Refer to this section in the programming guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

Of course it won’t be valid to try to access the entire length of a int array of length 313 with a long-long pointer, but I don’t know if it matters in your code. The last element (at int index 312) would not be legally accessible that way.

I spend as little time as possible trying to wrestle with CMake. CMake went through a change in how CUDA works somewhere around the 3.8 area, so the version you are using matters. (find_package(CUDA) is part of the old regime, so I guess you are using CMake < 3.8) CDP (CUDA Dynamic Parallelism) codes require specific compiling and linking steps:

  1. Compile for a cc3.5 or higher architecture -arch=sm_35
  2. Compile and link with relocatable device code -rdc=true
  3. Link against the cuda device runtime library -lcudadevrt

For basic Makefile usage, I refer you to any of the cuda sample projects that use CDP (just look for cdp in the project name). Study the associated Makefile

Perhaps someone else will be able to tell you how to get CMake to bend to your will. Google may also be your friend.

https://gist.github.com/srivathsanmurali/c4da1f2b3d531c385ffcce3f799c25a6

Casting to long long worked without any issue after changing my struct. 40% performance improvement thanks to you ^^