Thank you!! I got it to work by replacing my bitset with int array. Eigen was usable in kernel after following https://eigen.tuxfamily.org/dox-devel/TopicCUDA.html and I achieved some nice performance improvement. However, I am trying to cast int array to long long array inside my kernel to optimize it more but stuck again.
// Working kernel code
__global__ void
estimateKernel(SeamTemplate *templates, int *target,
int templateSize, int *scores)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if(tid >= templateSize)
return;
int *proposal = templates[tid].seam;
int score = 0;
for(int i = 0;i < 313;i++){
score += __popc(proposal[i] & target[i]);
}
scores[tid] = score;
}
// Not working. It causes
// CUDA Error: misaligned address
// cuda-memcheck:
// ========= Invalid __global__ read of size 8
// ========= at 0x00000128 in estimateKernel(SeamTemplate*, int*, int, int*)
// for all threads
__global__ void
estimateKernel(SeamTemplate *templates, int *target,
int templateSize, int *scores)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if(tid >= templateSize)
return;
long long *proposal = reinterpret_cast<long long *>(templates[tid].seam);
long long *targetll = reinterpret_cast<long long *>(target);
int score = 0;
for(int i = 0;i < 156;i++){
score += __popcll(proposal[i] & targetll[i]);
}
scores[tid] = score;
}
Any idea how I should cast my int array to long array? Any help is appreciated. Thank you. Also, how do I nest kernels? CUDA capability 6.1
what does your SeamTemplate struct definition look like? (I can’t read your mind.) Why don’t you just declare the seam item to be long long to begin with?
nesting kernels is called cuda dynamic parallelism. There are many questions on the web about it, cuda sample codes, and a whole section in the programming guide that covers it.
I cannot change it because of others parts in my code. It will be great if I can just read two 32-bit int as 1 64-bit long as I only care about its bit-values. If its not possible, I can reimplement everything but I would like to avoid it and just change few lines.
Of course it won’t be valid to try to access the entire length of a int array of length 313 with a long-long pointer, but I don’t know if it matters in your code. The last element (at int index 312) would not be legally accessible that way.
I spend as little time as possible trying to wrestle with CMake. CMake went through a change in how CUDA works somewhere around the 3.8 area, so the version you are using matters. (find_package(CUDA) is part of the old regime, so I guess you are using CMake < 3.8) CDP (CUDA Dynamic Parallelism) codes require specific compiling and linking steps:
Compile for a cc3.5 or higher architecture -arch=sm_35
Compile and link with relocatable device code -rdc=true
Link against the cuda device runtime library -lcudadevrt
For basic Makefile usage, I refer you to any of the cuda sample projects that use CDP (just look for cdp in the project name). Study the associated Makefile
Perhaps someone else will be able to tell you how to get CMake to bend to your will. Google may also be your friend.