I have not an extended cuda experience but it seems that this kernel code suffers from several issues:
The code probably spends most of its time doing memory transfers, which takes a LOT of time. You should definitely make sure that memory transfer is only made when necessary. For the output data it should definitely be made only once per character, when you have finished looking at all the data.
You should make sure all memory read/write is coalesced - in your kernel when you are writing “g_odata[…]+=2” the results are written in non-predictable parts of the memory - and certainly not in memory addresses following each other. The profiler should be a great help to evaluate that.
The “while (inicio < fim)” loop may be a bad idea (not 100% sure here) because the compiler cannot predict exactly when it will end and therefore the parallelization may not be so great.
For memory transfers there may be issues with the type of your data- chars are only stored on 1 byte and IIRC memory transfers must be done by groups of 4 bytes (a float) - but definitely check the CUDA documentation on that.
There will be no chance to speed up your kernel function. The reason is simple, your problem/algorithm requires only one arithmetic operation (addition) per 2 memory accesses (1 for characters and 1 for counters), one of which is non-cacheable. GPU speed-up ability only makes sense for computing-intensive jobs but not such a simple counting work.
You will have some performance enhancement if you let threads work on your data the following way:
T_i : g_idata[j*len+i], j=0 to total_len/num_threads
Note: you should type case char to unsigned int and read in to a local variable (register). Then do the statistics. Since a segment (256 longs) of g_odata is used for a thread, you should allocate shared memory (cached) to keep your statistics. Please post your results after the improvement.