Sorta serial code to GPU


I’m moving some portion of the CPU code to the GPU in order to save the host->device overhead. The code is “simple”, its basicaly serial since

the results of item X depends on X - sliding_window … X - 1 items calculated before it.

for( int iTime = iTimeBegin; iTime < iTimeEnd; iTime++ )


				// The +/- 20 is the sliding window and can get up to 100.

	int iTimePlusWindow  = __min( iTime + 20, iTimeEnd-1 );

	int iTimeMinusWindow = __max( iTime - 20, 0 );

	bool is_best = true;	

	float dCurrentCor = pCorTemp[ iTime ];

	for ( int iSliceTime = iTimeMinusWindow; iSliceTime <= iTimePlusWindow; iSliceTime++ )


		if ( m_ppCor_Temp_ca_best[ i_rCre ][ iSliceTime ] > dCurrentCor )


			is_best = false;




	if ( is_best )


		for ( int iSliceTime = iTimeMinusWindow; iSliceTime < iTimePlusWindow; iSliceTime++ )


			  // Here I update the same array I use for comparation.

			m_ppCor_Temp_ca_best[ i_rCre ][ iSliceTime ]= pCorTemp[ iSliceTime ];

			m_ppST_Temp_ca_best[ i_rCre ][ iSliceTime ] = pST_Temp_ca[ iSliceTime ]; 




Is there some sort of parallel algorithm similiar to what I do here?

Any assistance would be greatly appriciated… :)



One thing I can think of is that perhaps you can use CUDA to calculate many of these sliding window streams at the same time.

Thanks for the response.

Actually I cant, because every item depends on the previously calculated item in the array.

so for item 100 I need to iterate over items 80 to 120, while items 80 to 99 have already been calculated

(item 80 by looking at items 60 to 100 and item 99 was calculated by looking at item 79 to 119 )


What I mean is that if you have streams a-z, you can look at item 1 in all streams in parallel, then go to item 2 in all streams in parallel etc.

for (int i = 0; i < 120; i++)

 streams[threadIdx.x][i] = calculateFromLast(streams[threadIdx.x][i-1])

Does that make sense? Should get you some parallelism …

I’m not excluding that the calc. of a single stream can be parallelised. I just don’t know enough to say …

Best of luck!


Thanks for the suggestion. I’m not sure whether this is what you’ve suggested but I do have an additional loop above

the code posted in the first post. I think I’ll try to change the outer loop to run on different threads while each thread

will run serial code. Maybe that would give me the requested boost.

Now need to write, debug and find how to put all this data inside the shared mem… :)

CUDA is real fun :)

thanks again


Yep! It sounds like we’re on the same page, an outer loop would precisely fit to what I was trying to describe :)

And yes, enjoy :) Hard work, much satisfaction