Irregular errors at larger Block sizes

I have no idea anymore. Been trying to find my error for a week now.

My project deals with genetic sequences, aka ‘A’ ‘C’ ‘T’ ‘G’ strings. The first stage of processing involves decomposing each string into its component words. aka mark if there is a AAA present in the first memory address, mark if there is a AAC present in the second etc. It used to be counting the amount of each word, but atomic operations are not supported on chars and this suits my need either way. Doing the {A,C,T,G} → {0,1,2,3} happens on the CPU.

WORDSIZE is the length of the words I’m looking for. In this case I’m looking for words up to length 10.

JOBSIZE is equal to my block size, and its the amount of strings I attempt to process at once.

The maximum string length I have in my dataset is 764.

I have a GTX260 with 27 multiprocessors. For this reason I try keep the JOBSIZE a multiple of 27, up to 216(27*8).

The output data is size of 4^WORDSIZE*JOBSIZE, so thats a maximum allocated memory for output of 216MB for jobs of size 216, well below the addressable limit.

At WORDSIZE of 7+ I start getting errors, less than that not so much. The errors seem to be focussed on the last sequence in the list, but isn’t always so. The errors also seem to be different with every run and at different places, giving my app different output every run. When I had irregular errors in the past it was usually due to synchronization issues and race conditions, but I do not beleive that is the problem this time.

At WORDSIZE of 10 I managed to stabilize it by making my JOBSIZE 54 and my THEADCOUNT 32. If I increase either more than this the errors appear. Naturally these low numbers are undesired due to terrible occupancy, especially a problem due to this being a memory bandwidth-limited application.

This function currently only takes 1.5% of the GPU time of my application, so speed isn’t too important, but since its results get reused throughout its result accuracy are.

If anyone can tell me what it is I am doing wrong it would be appreciated.

[codebox]template

device void setMemory(T* data, uint len, T val)

{

int threadpos = threadIdx.x+blockDim.x*blockIdx.x;

int increment = gridDim.x*blockDim.x;

while(1)

{

    if(threadpos>=len)

    {

        break;

    }

    data[threadpos]=val;

    threadpos+=increment;

}

__syncthreads();

}

template

device void copyMemoryt(T* dst, T* src, uint len)

{

int threadpos = threadIdx.x;

while(1)

{

    if(threadpos>=len)

    {

        break;

    }

    dst[threadpos]=src[threadpos];

    threadpos+=blockDim.x;

}

__syncthreads();

}

device void singlewordcount(uchar* seq, uint seqlength, uchar* count)

{

uint words = seqlength-WORDSIZE+1;

uint threadpos = threadIdx.x;

uint address;

while(1)

{

	if(threadpos>=words)

		return;

	address = 0;

	for(int j=0; j<WORDSIZE; j++)

	{

		address+=(seq[threadpos+j]<<((WORDSIZE-j-1)<<1));

	}

	//atomicAdd(&count[address],1);

	count[address]=1;

	threadpos+=blockDim.x;

}

}

global void cWordCounts(uchar* odata, uchar* seqbuffer, uint* posbuffer, uint* lengthbuffer, uint num)

{

const uint wordcountSize = 1<<(WORDSIZE<<1);  // 4^WORDSIZE

__shared__ uchar seq[800];

setMemory(odata, wordcountSize*num, (uchar)0);

uint blockpos = blockIdx.x;

while(1)

{

    if(blockpos>=num)

        break;

	copyMemoryt<uchar>(seq, &seqbuffer[posbuffer[blockpos]], lengthbuffer[blockpos]);

	//singlewordcount(&seqbuffer[posbuffer[blockpos]], lengthbuffer[blockpos], &odata[blockpos*wordcountSize]);

	singlewordcount(seq, lengthbuffer[blockpos], &odata[blockpos*wordcountSize]);

blockpos+=gridDim.x;

}

}

host void wordCounts_gold(uchar* odata, uchar* seqbuffer, uint* posbuffer, uint* lengthbuffer, uint num)

{

ulonglong wordcountSize = 1<<(WORDSIZE<<1); // 4^WORDSIZE

#pragma omp parallel for

for(int i = 0; i<num; i++)

{

    unsigned long long seqlength = lengthbuffer[i];

    unsigned long long words = seqlength-WORDSIZE+1;

unsigned long long address;

    for(unsigned long long k=0; k<words; k++)

    {

        address = 0;

        for(unsigned long long j=0; j<WORDSIZE; j++)

        {

            address+=seqbuffer[posbuffer[i]+k+j]*(1<<((WORDSIZE-j-1)<<1));

        }

			odata[address+i*wordcountSize]=1;

    }

}

}[/codebox]

I have no idea anymore. Been trying to find my error for a week now.

My project deals with genetic sequences, aka ‘A’ ‘C’ ‘T’ ‘G’ strings. The first stage of processing involves decomposing each string into its component words. aka mark if there is a AAA present in the first memory address, mark if there is a AAC present in the second etc. It used to be counting the amount of each word, but atomic operations are not supported on chars and this suits my need either way. Doing the {A,C,T,G} → {0,1,2,3} happens on the CPU.

WORDSIZE is the length of the words I’m looking for. In this case I’m looking for words up to length 10.

JOBSIZE is equal to my block size, and its the amount of strings I attempt to process at once.

The maximum string length I have in my dataset is 764.

I have a GTX260 with 27 multiprocessors. For this reason I try keep the JOBSIZE a multiple of 27, up to 216(27*8).

The output data is size of 4^WORDSIZE*JOBSIZE, so thats a maximum allocated memory for output of 216MB for jobs of size 216, well below the addressable limit.

At WORDSIZE of 7+ I start getting errors, less than that not so much. The errors seem to be focussed on the last sequence in the list, but isn’t always so. The errors also seem to be different with every run and at different places, giving my app different output every run. When I had irregular errors in the past it was usually due to synchronization issues and race conditions, but I do not beleive that is the problem this time.

At WORDSIZE of 10 I managed to stabilize it by making my JOBSIZE 54 and my THEADCOUNT 32. If I increase either more than this the errors appear. Naturally these low numbers are undesired due to terrible occupancy, especially a problem due to this being a memory bandwidth-limited application.

This function currently only takes 1.5% of the GPU time of my application, so speed isn’t too important, but since its results get reused throughout its result accuracy are.

If anyone can tell me what it is I am doing wrong it would be appreciated.

[codebox]template

device void setMemory(T* data, uint len, T val)

{

int threadpos = threadIdx.x+blockDim.x*blockIdx.x;

int increment = gridDim.x*blockDim.x;

while(1)

{

    if(threadpos>=len)

    {

        break;

    }

    data[threadpos]=val;

    threadpos+=increment;

}

__syncthreads();

}

template

device void copyMemoryt(T* dst, T* src, uint len)

{

int threadpos = threadIdx.x;

while(1)

{

    if(threadpos>=len)

    {

        break;

    }

    dst[threadpos]=src[threadpos];

    threadpos+=blockDim.x;

}

__syncthreads();

}

device void singlewordcount(uchar* seq, uint seqlength, uchar* count)

{

uint words = seqlength-WORDSIZE+1;

uint threadpos = threadIdx.x;

uint address;

while(1)

{

	if(threadpos>=words)

		return;

	address = 0;

	for(int j=0; j<WORDSIZE; j++)

	{

		address+=(seq[threadpos+j]<<((WORDSIZE-j-1)<<1));

	}

	//atomicAdd(&count[address],1);

	count[address]=1;

	threadpos+=blockDim.x;

}

}

global void cWordCounts(uchar* odata, uchar* seqbuffer, uint* posbuffer, uint* lengthbuffer, uint num)

{

const uint wordcountSize = 1<<(WORDSIZE<<1);  // 4^WORDSIZE

__shared__ uchar seq[800];

setMemory(odata, wordcountSize*num, (uchar)0);

uint blockpos = blockIdx.x;

while(1)

{

    if(blockpos>=num)

        break;

	copyMemoryt<uchar>(seq, &seqbuffer[posbuffer[blockpos]], lengthbuffer[blockpos]);

	//singlewordcount(&seqbuffer[posbuffer[blockpos]], lengthbuffer[blockpos], &odata[blockpos*wordcountSize]);

	singlewordcount(seq, lengthbuffer[blockpos], &odata[blockpos*wordcountSize]);

blockpos+=gridDim.x;

}

}

host void wordCounts_gold(uchar* odata, uchar* seqbuffer, uint* posbuffer, uint* lengthbuffer, uint num)

{

ulonglong wordcountSize = 1<<(WORDSIZE<<1); // 4^WORDSIZE

#pragma omp parallel for

for(int i = 0; i<num; i++)

{

    unsigned long long seqlength = lengthbuffer[i];

    unsigned long long words = seqlength-WORDSIZE+1;

unsigned long long address;

    for(unsigned long long k=0; k<words; k++)

    {

        address = 0;

        for(unsigned long long j=0; j<WORDSIZE; j++)

        {

            address+=seqbuffer[posbuffer[i]+k+j]*(1<<((WORDSIZE-j-1)<<1));

        }

			odata[address+i*wordcountSize]=1;

    }

}

}[/codebox]

Something seems to have changed… either because of the cleanup of the code to post it, or because my GPU is now running hot.

The irregular errors seems to have been severaly reduced at JOBSIZE of 216 now. Theres still a handful of errors every 10 runs or so, but nothing significant and acceptable enough to work with. I still can’t increase my threadcount however without introducing ALOT more errors.

Something seems to have changed… either because of the cleanup of the code to post it, or because my GPU is now running hot.

The irregular errors seems to have been severaly reduced at JOBSIZE of 216 now. Theres still a handful of errors every 10 runs or so, but nothing significant and acceptable enough to work with. I still can’t increase my threadcount however without introducing ALOT more errors.

Looks to me like you need to change setMemory() so that each block exactly sets the memory it later uses itself. Otherwise you would need inter-block synchronization which does not exist.

Looks to me like you need to change setMemory() so that each block exactly sets the memory it later uses itself. Otherwise you would need inter-block synchronization which does not exist.

Thank you. A REALLY stupid mistake, especially since I had already intentionally avoided that issue with my copyMemory implementation.

All word counts are now correct, irrespective of the amount of threads or blocks I launch.

Still having some random errors LATER in the application, but since I know the information I feed it is correct it makes it easier to track down. Experimenting with randomly inserting __syncthreads() or discovering if there is anywhere else I may have not assumed block independence.

EDIT: __syncthreads() inside branches is bad bad bad. Everything works now.

Thank you. A REALLY stupid mistake, especially since I had already intentionally avoided that issue with my copyMemory implementation.

All word counts are now correct, irrespective of the amount of threads or blocks I launch.

Still having some random errors LATER in the application, but since I know the information I feed it is correct it makes it easier to track down. Experimenting with randomly inserting __syncthreads() or discovering if there is anywhere else I may have not assumed block independence.

EDIT: __syncthreads() inside branches is bad bad bad. Everything works now.