Counting characters whats the best strategy?

Rodrigo_Lopes · May 8, 2009, 3:57am

Hi there, I am studying some way to beat the CPU performance counting characters.

For example, 5MB file my CPU count in 22ms and GPU in 225ms with my best implementition.

Take a look at my kernel:

[codebox]/5mb → 225ms/

global void

gpuinputKernel( char* g_idata, long* g_odata, unsigned int buffer_end, unsigned int len)

{

    int inicio = ( blockDim.x * blockIdx.x + threadIdx.x ) * len;

    int fim = MIN(buffer_end, inicio+len);        

    int out_offset = ( blockDim.x * blockIdx.x + threadIdx.x ) * N_LITERAIS;

    while (inicio < fim) {

        g_odata[out_offset+(g_idata[inicio++]&0xFF)] += 2;

	}

}[/codebox]

I have already tried to mantain a counting table in shared memory but it wasn’t so good.

Anyone have another idea?

Thks

vincefn · May 8, 2009, 7:19pm

I have not an extended cuda experience but it seems that this kernel code suffers from several issues:

The code probably spends most of its time doing memory transfers, which takes a LOT of time. You should definitely make sure that memory transfer is only made when necessary. For the output data it should definitely be made only once per character, when you have finished looking at all the data.
You should make sure all memory read/write is coalesced - in your kernel when you are writing “g_odata[…]+=2” the results are written in non-predictable parts of the memory - and certainly not in memory addresses following each other. The profiler should be a great help to evaluate that.
The “while (inicio < fim)” loop may be a bad idea (not 100% sure here) because the compiler cannot predict exactly when it will end and therefore the parallelization may not be so great.
For memory transfers there may be issues with the type of your data- chars are only stored on 1 byte and IIRC memory transfers must be done by groups of 4 bytes (a float) - but definitely check the CUDA documentation on that.

Vincent

cvnguyen · May 10, 2009, 11:46am

Hi,

There will be no chance to speed up your kernel function. The reason is simple, your problem/algorithm requires only one arithmetic operation (addition) per 2 memory accesses (1 for characters and 1 for counters), one of which is non-cacheable. GPU speed-up ability only makes sense for computing-intensive jobs but not such a simple counting work.

CVN

Dan_Lo · May 13, 2009, 7:35pm

Hi there, I am studying some way to beat the CPU performance counting characters.

For example, 5MB file my CPU count in 22ms and GPU in 225ms with my best implementition.

Take a look at my kernel:

[codebox]/5mb → 225ms/

global void

gpuinputKernel( char* g_idata, long* g_odata, unsigned int buffer_end, unsigned int len)

{
    int inicio = ( blockDim.x * blockIdx.x + threadIdx.x ) * len;

    int fim = MIN(buffer_end, inicio+len);        

    int out_offset = ( blockDim.x * blockIdx.x + threadIdx.x ) * N_LITERAIS;

    while (inicio < fim) {

        g_odata[out_offset+(g_idata[inicio++]&0xFF)] += 2;

	}
}[/codebox]

I have already tried to mantain a counting table in shared memory but it wasn’t so good.

Anyone have another idea?

Thks

You will have some performance enhancement if you let threads work on your data the following way:

T_i : g_idata[j*len+i], j=0 to total_len/num_threads

Note: you should type case char to unsigned int and read in to a local variable (register). Then do the statistics. Since a segment (256 longs) of g_odata is used for a thread, you should allocate shared memory (cached) to keep your statistics. Please post your results after the improvement.

Topic		Replies	Views
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29484	April 23, 2010
Help me with very, very poor performance CUDA Programming and Performance	6	3951	May 8, 2008
Massive "simple" computation with CUDA CUDA Programming and Performance	14	8672	December 7, 2009
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2246	May 28, 2009
How to optimize my simple code CUDA Programming and Performance	2	1102	April 1, 2010
How would you do this? CUDA Programming and Performance	12	4532	August 5, 2008
Slow Performance CUDA Programming and Performance	19	8245	November 24, 2008
Performance Questions CUDA Programming and Performance	4	4688	April 21, 2009
number of registers of a GPU processor 5 general purpose registers in a x86 CPU CUDA Programming and Performance	11	16231	January 7, 2009
finding frequency of letters CUDA Programming and Performance	9	3524	April 14, 2010

Counting characters whats the best strategy?

Related topics