First approach to cuda, need optimization or work are slow?


I’m new in the world of CUDA and I want to try it. Ihaving watched a bit of tutorials, programming exploiting parallelism.

As the first program I tried to do on a cuda is a program that i have do on php.

I have a very long string of characters (near 100k char) and I need to find if it contains a set of other char all the same length.
To do this I tried to use cuda and take advantage of multithreading, one for each sequence of char that i need find.

The base code is the following:

#define N 1000
#define DIM 50

//a = text of 70k char, B=all search therm, all length 50, C=return result, D= length of A
__global__ void helloKern( char* a, char *b, int* c, int d){
	int tid = threadIdx.x;

	for(int k=0; k<d; k++){
		bool uguali = false;
		if(a[k] == b[(tid*DIM)]){   //for every trhead i find my search string
			uguali = true;
			for(int i=1; i<DIM; i++){
				if(a[k+i] != b[(tid*DIM)+i]){
					uguali = false;
			c[tid] = k;	

int main( void ){
	char *frase ="lorem i...... ecc 70kchar";
	int trovati[N];
	char *dev_frase;
	char *dev_ricerca;
	int *dev_trovati;
	char *ricerca = "find1find2find3...find50"; //every find have 50char
	// Cuda malloc and copy
	cudaMalloc((void **)&dev_ricerca, N*DIM*sizeof(char));
	cudaMalloc((void **)&dev_frase, strlen(frase)*sizeof(char));
	cudaMalloc((void **)&dev_trovati, N*sizeof(int));
	cudaMemcpy( dev_ricerca, ricerca, N*DIM*sizeof(char), cudaMemcpyHostToDevice);
	cudaMemcpy( dev_frase, frase, strlen(frase)*sizeof(char), cudaMemcpyHostToDevice);
	// Do 1000 thread, one for every word
	helloKern <<< 1, N >>> (dev_frase, dev_ricerca, dev_trovati, strlen(frase));
	cudaMemcpy( trovati, dev_trovati, N*sizeof(int), cudaMemcpyDeviceToHost);
	// Then Print out

I try compiling, work fine but I have see that memcopy are very fast (microseconds), but the algorithm itself is quite slow, on a text of 70k characters, searching in parallel 1000 words of 50 characters each, it takes about 20 seconds.
I was expecting less time, also considered that such a php regex is applied 1000 times faster (on normal processor)

You have for loops and if statements which produce branching. also your memory acceses are not coalesced. You are not using shared memory.

It’s always a good idea to do some tutorials first, understand the architecture and some of the basic SDK samples before trying to throw real-world algorithms onto the CPU.

A lot of people start porting their code right away, and then they are disappointed when the results don’t match the expectations.

You probably want to learn more about these topics

  • cached read access using textures (1D linear memory, 2D cudaArrays, _ldg() on Compute 3.5)
  • coalesced reads from global memory
  • shared memory and its limitations (bank conflicts)

Probably you also want to put your search string “ricerca” into constant memory, assuming that it never exceeds 64K in size. Otherwise linear memory bound to a texture reference can be used.

Oh wait… Every thread compares with a different part of the search string (50 characters size per thread). That is not really be an efficient way of accessing constant memory - constant memory is fastest when all threads read the same location. But your dev_frase is accessed at the same position a[k] by all threads simultaneously. So this one would benefit the most from being in constant memory. However it is longer than 64kb (currently 70k). So I guess this one should be accessed through a texture reference. Texture size limits will be an issue here. Very long texts would have to be chopped into smaller segments.

You’re only launching 1 thread block.

If your GPU has e.g. 15 multiprocessors (Geforce 780 Ti), you will only ever get 1/15th of its peak capacity. Only 1/15th of the hardware ever gets to compute something.