solution for misaligned address read?!

To sum it up, I need a way to read an int from an address that is not aligned to an int. Is there any way to do that very fast?
My actual use case is outlined below

A major part of my application is comparing sequences to find the length for which they are identical. For example “werE3” and “werF3” will return ‘3’. These sequences are a part of a larger sequence and their starting indices are given, so the function looks something like:

int getCommonLength(const char*ptr, index1, int index2)
{
static const int max_match = 128;
i = 0;
for (; i <= max_match; ++i)
if (ptr[index1+i] != ptr[index2+i])
return i;
}

This code is too slow. Even using texture fetch, unrolling etc. still too slow. A lot of warp divergence and a lot of comparisons as the matches can be long.
I was hoping to be able to read 4 bytes at a time (int) or more and thus compare 4 bytes at a time. The problem is that the index1, index2 are not guaranteed to be 4 bytes aligned…

Any ideas?

Classic approach, as used in tuned memcmp() implementations on RISC platforms with no support for misaligned accesses:

Proceed with byte-wise comparison until ptr1 aligned. Depending on (ptr2 % 4) now select one of four different code paths. All four paths can simply load straight from ptr1. Only the code path for (ptr2 % 4) == 0 can read straight from ptr2. For loads from ptr2 (suitably aligned down), the other three code paths need to use a 32-bit variable as a buffer for merging bytes from the previous aligned load with bytes from the current aligned load. The byte-merge pattern (1,3 or 2,2 or 3,1) is fixed for each code path. Watch out for end cases as you approach the comparison length limit.

Unfortunately, fast memcpy() implementations for RISC processors were typically written in assembly language, otherwise you might be able to reuse open source code from C library implementations for one of those platforms.

Operating on character strings that can be anywhere in memory is not the best match to GPU processing. You might want to go one level up in the design process to investigate whether there are alternative data representations that make an efficient GPU implementation easier.

Thanks!
This approach seems ideal for serial implementation. The 4 code paths sound bad for warp divergence, but seem possible as they only differ in the bit shift constants.

Another possibility: think parallel. Let a thread block tile (i.e. warp) handle this.

$ cat t1404.cu
#include <cooperative_groups.h>
#include <iostream>

using namespace cooperative_groups;

template <typename T>
__device__ int getCommonLength(T g, const char *ptr, int index1, int index2)
{
    int lane = g.thread_rank();
    int i = 0;
    bool done = false;
    static const int max_match = 128;
    int return_val = max_match;
    while ((i < max_match)&&(!done)){
      const char b1 = ptr[index1 + lane + i];
      const char b2 = ptr[index2 + lane + i];
      unsigned bcmp = g.ballot(b2!=b1);
      unsigned fs = __ffs(bcmp);
      if (fs > 0) {done = true;  return_val = i + fs - 1;}
      i += g.size();
      }
    return return_val;
}

__global__ void k(const char *ptr, int index1, int index2, int *match_length){
  auto tile = tiled_partition<32>(this_thread_block());
  int my_match_length = getCommonLength(tile, ptr, index1, index2);
  if (tile.thread_rank()==0) *match_length = my_match_length;
}

int main(){
  const int dsz = 256;
  char *d_data, *h_data;
  int *d_ml, h_ml;
  h_data = new char[dsz];
  cudaMalloc(&d_data, dsz);
  cudaMalloc(&d_ml, sizeof(int));
  for (int i = 0; i < 66; i++) {h_data[i+1] = i; h_data[i+100] = i;}
  h_data[67] = 1;
  cudaMemcpy(d_data, h_data, dsz, cudaMemcpyHostToDevice);
  k<<<1,32>>>(d_data, 1, 100, d_ml);
  cudaMemcpy(&h_ml, d_ml, sizeof(int), cudaMemcpyDeviceToHost);
  std::cout << "match length: " << h_ml << std::endl;
  k<<<1,32>>>(d_data, 2, 100, d_ml);
  cudaMemcpy(&h_ml, d_ml, sizeof(int), cudaMemcpyDeviceToHost);
  std::cout << "match length: " << h_ml << std::endl;
  return 0;
}
$ nvcc -std=c++11 t1404.cu -o t1404
$ cuda-memcheck ./t1404
========= CUDA-MEMCHECK
match length: 66
match length: 0
========= ERROR SUMMARY: 0 errors
$

compare 32 bytes at a time, greatly reduced loop trip count, no divergence in getCommonLength whatsoever

https://devblogs.nvidia.com/cooperative-groups/

I cannot post the code as this is from a commercial application, but I am sharing the results for others.

And the (surprising) results, from fastest to slowest are:

  1. combine 2 aligned integer reads into a single “misaligned” integer and compare integers (my original implementation) - fastest!

  2. first run on bytes till one of the indexes is aligned, then continue using one integer read and one combined read (as suggested by njuffa above). Slower, probably due to the extra work performed on bytes. This is only faster for very long sequences.

  3. The parallel approach suggested by txbob is actually the slowest. while the memory reads are aligned, they are of bytes. More importantly, I believe __ballot and __ffs are slower instructions with less throughput then general compute. I would also add that in my data, the addresses are semi-local and (indexes are likely to be 1-32KB apart), which make caching more efficient. I believe this version might be faster for completely random addressing over large distances.

  4. running a naive loop on the bytes as in my original getCommonLength is the slowest.

Not exactly the results that I expected…
Thank you for your help!