I get strange intermitant behaviour from a 295GTX.
The first indication was every so often it produced different answers from a Tesla.
Device 0 produces the same answers as the Tesla and Device 1 sometimes produces
a difference of 1 (plus or minus) and once a difference of two.
Currently I am running identical calculations on device 0 and device 1 of the 295 GTX
and comparing the results they give.
I think the problem may lie in the code at the end of the outer loop which adds
the values of “fitness” produced by different threads together and returns the sum
to the host Linux PC. This is because if each thread writes its own partial sum to
global memory and they are added together on the host the problem is never seen.
Similarly if all the calculations are done by one thread (so there is no reduction phase)
again the problem is never seen.
I had been using Harris’ log(n) reduction code but the following code is simplier
(if less efficient?) and shows the problem just as my copy of Harris’ code did.
With a block size of one or 32 no errors are seen.
With a block size of 4 or 256 every so often Device 1 returns an answer 1 bigger
or 1 smaller than the answer returned by Device 0.
(offs = 14; was added to see if I could provoke the error into being more frequent.
No difference is seen)
[codebox]extern shared unsigned int shared_array;
…
static global void evalpopKernel(
int *d_Output,
unsigned char* Pop,
const int prog0,
unsigned int N,
int gen, int ntrain, const unsigned int npar2,
const int order, const int LEN)
{
const int npar = 1<<npar2; //number of parallel fitness cases
const int ntrainhalf = ntrain>>1;
const unsigned int first_pid = (MUL(blockDim.x, blockIdx.x) + threadIdx.x)>>npar2;
const unsigned int threadN = MUL(blockDim.x, gridDim.x) >> npar2;
for(unsigned int prog = first_pid; prog < N; prog += threadN){
int fitness = 0;
for(unsigned int index = threadIdx.x & (npar-1); index < ntrain; index+=npar){
...lots of access to shared memory...
fitness += stuff...
}//endfor testcases
__syncthreads(); //wait until everyone has finished with shared memory
int* shared = (int*) shared_array;
const int offs = 14;
shared[offs+threadIdx.x] = fitness;
__syncthreads();
if(threadIdx.x==0) { //assume only one prog per block
int all = 0;
for(int i = 0; i<npar ;i++) all += shared[offs+i];
d_Output[prog] = all;
}
__syncthreads(); //dont let other threads use shared memory until all ready
}
}
…
evalpopKernel<<<grid_size, block_size, block_size15sizeof(unsigned int)>>>(d_Output, Pop, 0, 262144, gen, 256, npar2, 37, 1024);
[/codebox]
npar2 is set to log2(block_size), eg for block_size=32 npar2 will be 5.
grid_size is set so that grid_size*block_size = 262144 (up to the max 65535)
With a block_size of 32, the kernel takes up to 57 seconds.
The max temps observed are 67C (device 0) and 70 C (device 1).
Cf. http://forums.nvidia.com/lofiversion/index.php?t99422.html
Has anybody seen any thing similar?
On GTX 295? Or other GPU?
Could this be a hardware failure?
Any suggestions for checking the card itself?
Any help of comments would be most welcome
Bill
Dr. W. B. Langdon,
Department of Computer Science,
King's College London,
Strand, London, WC2R 2LS, UK
<a target='_blank' rel='noopener noreferrer' href='"http://www.dcs.kcl.ac.uk/staff/W.Langdon/"'>http://www.dcs.kcl.ac.uk/staff/W.Langdon/</a>
FOGA 2011 http://www.sigevo.org/foga-2011/
CIGPU 2010 http://www.cs.ucl.ac.uk/external/W.Langdon/cigpu
A Field Guide to Genetic Programming
<a target='_blank' rel='noopener noreferrer' href='"http://www.gp-field-guide.org.uk/"'>http://www.gp-field-guide.org.uk/</a>
RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet
GP EM http://www.springer.com/10710
GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/