parallel calculate character frequency in a string

xiaoxi · June 30, 2011, 6:32am

Anybody have idea about how to find the frequency of each char in a string?
For example: In aabbcd, I need to figure out frequency of characters: a: 2, b: 2, c: 1, d: 1

My implementation is as below, but it doesn’t work, and there are multiple threads waiting for writing distribution[index], it is not so efficient.
Does anybody have better idea? Thanks a lot

global
void countblock(unsigned char *d_buf, unsigned int length, unsigned *output )
{
unsigned tid = threadIdx.x +
blockIdx.x * blockDim.x +
blockIdx.y * blockDim.x * gridDim.x;

unsigned char index = d_buf[tid];

atomicAdd(&distribution[index], 1);


__syncthreads();
if(tid < 257)
output[tid] = distribution[tid];

__syncthreads();

}

brano · June 30, 2011, 8:37am

Anybody have idea about how to find the frequency of each char in a string?

For example: In aabbcd, I need to figure out frequency of characters: a: 2, b: 2, c: 1, d: 1

My implementation is as below, but it doesn’t work, and there are multiple threads waiting for writing distribution[index], it is not so efficient.

Does anybody have better idea? Thanks a lot

global

void countblock(unsigned char *d_buf, unsigned int length, unsigned *output )

{
unsigned tid = threadIdx.x +

                      blockIdx.x * blockDim.x + 

                      blockIdx.y * blockDim.x * gridDim.x;

unsigned char index = d_buf[tid];

atomicAdd(&distribution[index], 1);

__syncthreads();

if(tid < 257)

output[tid] = distribution[tid];

__syncthreads();
}

Hi,

you can take a look at this:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf

http://csce.uark.edu/~jgauch/library/papers/Shams.2007.pdf

I guess you have 3 different approaches:

Atomic on global
Atomic on shared
Sort → Run Length Coding ( two kernels and no use of atomic)

If you know anything about your data-set you could improve the algorithm.

Is it possible for the string to contain all possible characters? or only alphabetic?

One thing to notice is that the ASCII ‘a’ is not 0, so in your example you need to allocate the memory needed otherwise you will hit outside the array.

In your example “aabbcd” you will get [2,2,1,1] in position [97,98,99,100].

Gert-Jan · June 30, 2011, 10:48am

As brano already said, your problem looks a lot like a histogram problem. So you can also take a look at our paper about GPU histogramming available via: [url=“http://dx.doi.org/10.1145/1964179.1964181”]http://dx.doi.org/10.1145/1964179.1964181[/url]. You can find an example implementation on our website: [url=“PARsE | Research | Algorithms and tools”]http://parse.ele.tue.nl/research/tools[/url].

JaredHoberock · July 1, 2011, 5:42am

Sort the string, and then do a keyed reduction. Here’s some example code using thrust [1]:

#include <thrust/device_vector.h>

#include <thrust/sort.h>

#include <thrust/reduce.h>

#include <thrust/iterator/constant_iterator.h>

#include <string>

#include <iostream>

int main()

{

  using namespace thrust;

  std::string h_input = "aabbcd";

// move the input string to the device

  device_vector<char> d_input(h_input.begin(), h_input.end());

// sort the input to bring similar keys together

  sort(d_input.begin(), d_input.end());

// allocate storage for the resulting histogram

  device_vector<char> d_keys(d_input.size());

  device_vector<int>  d_count(d_input.size());

// count the frequency of each character key

  pair<device_vector<char>::iterator, device_vector<int>::iterator> new_end = 

    reduce_by_key(d_input.begin(), d_input.end(),

                  thrust::make_constant_iterator(1),

                  d_keys.begin(),

                  d_count.begin());

// eliminate extra unneeded storage in histogram

  d_keys.erase(new_end.first, d_keys.end());

  d_count.erase(new_end.second, d_count.end());

// print the result

  std::cout << "character frequency:" << std::endl;

  for(int i = 0; i < d_keys.size(); ++i)

  {

    std::cout << d_keys[i] << ": " << d_count[i] << std::endl;

  }

}

Here’s the output:

yucky:~ jared$ nvcc character_frequency.cu -run

character frequency:

a: 2

b: 2

c: 1

d: 1

[1] Google Code Archive - Long-term storage for Google Code Project Hosting.

xiaoxi · July 9, 2011, 10:36pm

Thank you all, your suggestions are helpful :)