Deoptimizing Optimization

Hey all.

I’m busy attempting to optimize a function of my code, and have run into an interesting unintuitive change.

I am trying to find whether the score of any ‘window’ of my sequence exceeds a threshold. Fortunately the calculations of the score is nicely parallelizable, but the window calculations is not. This is on a SM_13 card, GTX260.

First I am wondering if there is a better way to do the following than switching to single-thread mode, and secondly, why the marked lines SLOW DOWN instead of help speed up my loop. There is obviously something I am not getting, and I am not adapt at reading ptx code. The following snippet has been slightly edited for clarity.

Sequences can be up to 1000 long, tho usually only 300. The window tends to be sized 100. Anything shorter than the window size already gets filtered.

// Calculate scores here

uint val = ((threadIdx.x<WINDOWSIZE)&&(seq[threadIdx.x])>0)?1:0;

reduction_add<uint>(temp_result, val);

if(threadIdx.x==0) score=temp_result[0];

__syncthreads();

	

if(threadIdx.x==0)

{

	int i = 0;

	for(i = min(CUDATHREADS,WINDOWSIZE); i<words; i++)

	{

		if(seq[i])

			score++;

		

		if((seq[i-WINDOWSIZE])>0)

		{

			score--;

		}

		maxscore = max(maxscore,score);

			

		// The following 2 lines actually SLOW execution by 0.3 secs

		if(score>=HEURISTICTHRESHOLD)  	// !

			break;					// !

	

	}

	maxscore = max(maxscore,score);

}

__syncthreads();

For clarity, I add up the first WINDOWSIZE parameters using parallel reduction, but since every window shift requires you know the score of the last window, it is difficult to parallize. There is some SM_20 features that may help, but I do not have one… yet.

Hey all.

I’m busy attempting to optimize a function of my code, and have run into an interesting unintuitive change.

I am trying to find whether the score of any ‘window’ of my sequence exceeds a threshold. Fortunately the calculations of the score is nicely parallelizable, but the window calculations is not. This is on a SM_13 card, GTX260.

First I am wondering if there is a better way to do the following than switching to single-thread mode, and secondly, why the marked lines SLOW DOWN instead of help speed up my loop. There is obviously something I am not getting, and I am not adapt at reading ptx code. The following snippet has been slightly edited for clarity.

Sequences can be up to 1000 long, tho usually only 300. The window tends to be sized 100. Anything shorter than the window size already gets filtered.

// Calculate scores here

uint val = ((threadIdx.x<WINDOWSIZE)&&(seq[threadIdx.x])>0)?1:0;

reduction_add<uint>(temp_result, val);

if(threadIdx.x==0) score=temp_result[0];

__syncthreads();

	

if(threadIdx.x==0)

{

	int i = 0;

	for(i = min(CUDATHREADS,WINDOWSIZE); i<words; i++)

	{

		if(seq[i])

			score++;

		

		if((seq[i-WINDOWSIZE])>0)

		{

			score--;

		}

		maxscore = max(maxscore,score);

			

		// The following 2 lines actually SLOW execution by 0.3 secs

		if(score>=HEURISTICTHRESHOLD)  	// !

			break;					// !

	

	}

	maxscore = max(maxscore,score);

}

__syncthreads();

For clarity, I add up the first WINDOWSIZE parameters using parallel reduction, but since every window shift requires you know the score of the last window, it is difficult to parallize. There is some SM_20 features that may help, but I do not have one… yet.

That if … break statement represents an additional branch evaluation, it probably isn’t surprising that it slows things down. The other if statements in the loop can be evaluated by conditional execution, which isn’t much of a penalty at all.

That if … break statement represents an additional branch evaluation, it probably isn’t surprising that it slows things down. The other if statements in the loop can be evaluated by conditional execution, which isn’t much of a penalty at all.