I have no idea anymore. Been trying to find my error for a week now.
My project deals with genetic sequences, aka ‘A’ ‘C’ ‘T’ ‘G’ strings. The first stage of processing involves decomposing each string into its component words. aka mark if there is a AAA present in the first memory address, mark if there is a AAC present in the second etc. It used to be counting the amount of each word, but atomic operations are not supported on chars and this suits my need either way. Doing the {A,C,T,G} → {0,1,2,3} happens on the CPU.
WORDSIZE is the length of the words I’m looking for. In this case I’m looking for words up to length 10.
JOBSIZE is equal to my block size, and its the amount of strings I attempt to process at once.
The maximum string length I have in my dataset is 764.
I have a GTX260 with 27 multiprocessors. For this reason I try keep the JOBSIZE a multiple of 27, up to 216(27*8).
The output data is size of 4^WORDSIZE*JOBSIZE, so thats a maximum allocated memory for output of 216MB for jobs of size 216, well below the addressable limit.
At WORDSIZE of 7+ I start getting errors, less than that not so much. The errors seem to be focussed on the last sequence in the list, but isn’t always so. The errors also seem to be different with every run and at different places, giving my app different output every run. When I had irregular errors in the past it was usually due to synchronization issues and race conditions, but I do not beleive that is the problem this time.
At WORDSIZE of 10 I managed to stabilize it by making my JOBSIZE 54 and my THEADCOUNT 32. If I increase either more than this the errors appear. Naturally these low numbers are undesired due to terrible occupancy, especially a problem due to this being a memory bandwidth-limited application.
This function currently only takes 1.5% of the GPU time of my application, so speed isn’t too important, but since its results get reused throughout its result accuracy are.
If anyone can tell me what it is I am doing wrong it would be appreciated.
[codebox]template
device void setMemory(T* data, uint len, T val)
{
int threadpos = threadIdx.x+blockDim.x*blockIdx.x;
int increment = gridDim.x*blockDim.x;
while(1)
{
if(threadpos>=len)
{
break;
}
data[threadpos]=val;
threadpos+=increment;
}
__syncthreads();
}
template
device void copyMemoryt(T* dst, T* src, uint len)
{
int threadpos = threadIdx.x;
while(1)
{
if(threadpos>=len)
{
break;
}
dst[threadpos]=src[threadpos];
threadpos+=blockDim.x;
}
__syncthreads();
}
device void singlewordcount(uchar* seq, uint seqlength, uchar* count)
{
uint words = seqlength-WORDSIZE+1;
uint threadpos = threadIdx.x;
uint address;
while(1)
{
if(threadpos>=words)
return;
address = 0;
for(int j=0; j<WORDSIZE; j++)
{
address+=(seq[threadpos+j]<<((WORDSIZE-j-1)<<1));
}
//atomicAdd(&count[address],1);
count[address]=1;
threadpos+=blockDim.x;
}
}
global void cWordCounts(uchar* odata, uchar* seqbuffer, uint* posbuffer, uint* lengthbuffer, uint num)
{
const uint wordcountSize = 1<<(WORDSIZE<<1); // 4^WORDSIZE
__shared__ uchar seq[800];
setMemory(odata, wordcountSize*num, (uchar)0);
uint blockpos = blockIdx.x;
while(1)
{
if(blockpos>=num)
break;
copyMemoryt<uchar>(seq, &seqbuffer[posbuffer[blockpos]], lengthbuffer[blockpos]);
//singlewordcount(&seqbuffer[posbuffer[blockpos]], lengthbuffer[blockpos], &odata[blockpos*wordcountSize]);
singlewordcount(seq, lengthbuffer[blockpos], &odata[blockpos*wordcountSize]);
blockpos+=gridDim.x;
}
}
host void wordCounts_gold(uchar* odata, uchar* seqbuffer, uint* posbuffer, uint* lengthbuffer, uint num)
{
ulonglong wordcountSize = 1<<(WORDSIZE<<1); // 4^WORDSIZE
#pragma omp parallel for
for(int i = 0; i<num; i++)
{
unsigned long long seqlength = lengthbuffer[i];
unsigned long long words = seqlength-WORDSIZE+1;
unsigned long long address;
for(unsigned long long k=0; k<words; k++)
{
address = 0;
for(unsigned long long j=0; j<WORDSIZE; j++)
{
address+=seqbuffer[posbuffer[i]+k+j]*(1<<((WORDSIZE-j-1)<<1));
}
odata[address+i*wordcountSize]=1;
}
}
}[/codebox]