Kernel runs successfully a few times, then crashes

I’ve been working on a bug in my program for a long time, and I still can’t figure it out; it just gets more and more bizarre.

I’m writing a program that runs the CUDA SDK Mersenne Twister on the device, then passes the pointer to another kernel that we wrote which uses the random numbers generated by the Mersenne Twister.

The program is able to iterate 3-4 times, then crashes, telling me either ‘no cuda-capable device available’ or ‘invalid device symbol.’ The error happens when I call the function to seed the RNG (which uses cudaMemcpytoSymbol), but I think the error must be originating somewhere else.

At first I thought this must be a memory leak of some kind, but I’ve been over the code a thousand times and can’t find any instance of my allocating memory that I don’t free. I even tried page-locking the host memory that I copy the results of the program back into.

Also, I tried just running one iteration (running the kernel only once), and it works fine, no matter how many times I try it in a row. (I launch the program from the terminal again each time it returns)

I think maybe I need to do something to start a new GPU context each time the program iterates. I tried, but I can’t seem to do it right. Could anyone give me ideas or pointers as to what my problem might be, and maybe give a tip or two about context management?

Do you use multiple GPUs in your code?

If you post your code, I will try it out on my machine.

I had similar problem but that comes from multiple GPU usage.

-gshi

The plan is to have it run on multiple GPUs eventually, but it crashes on its own without me ever using my multiple GPU code. I’m posting a test-driver I wrote that runs the kernel multiple times, like our program eventually will (albeit with the same initial parameters each time) in a simple for() loop. It runs successfully and prints out the result a few times, then crashes.

We run it on a system using CentOS and Tesla GPUs.

[codebox]//trial driver for the ported calc-flux routine

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include <time.h>

#include <cuda.h>

//Needed for RNG

//NOTE: MersenneTwister.h and MersenneTwister_kernel.cu are found in the CUDA SDK

#include “MersenneTwister.h”

#include “MersenneTwister_kernel.cu”

#include “cutil_inline.h”

#include “cutil.h”

__constant__ int M = 		1000;		//number of particles

__constant__ double Vo =	0.0002;		//Random velocity of particles

__constant__ double Vc_o = 	0.0;		//Drift velocitty of particles

shared extern double array;

global void calcFlux (double cg, double dcdx, long rank, double* result, float* rand_d, long rand_n){

/* This routine takes a given concentration and concentration gradient

 and calculates the flux by creating particles and letting them move

 randomly.  This routine has its own computational domain, the box is

 assumed to be centered at the origin and of length 1, and the buffer

 extends on either side of the box.  This routine is originally based on

 work by Victor. */



/* This is the slowest part of the code.  Maybe someday write this in 

 assembly to save time?  */



/* Separate functions should not be made for the random number

 generator and the velocity function, it significantly slows down the code.

 I've tried lots of different tweaks and this seems to be as fast as I can

 get it to go. Finding a fast way to compute random numbers would help the

 most. */



/************************************************************

*****************/

/************************************************************

*****************/

double* Length = (double*) array;

double* V = (double*) &Length[1];

double* jxt = (double*) &V[1];

double* x_even = (double*) &jxt[nSteps];

double* x_odd = (double*) &x_even[M];

double* vx_even = (double*) &x_odd[M];

double* vx_odd = (double*) &vx_even[M];

double  xtemp; 

double* dNL = (double*) &vx_odd[M];

double* dNR = (double*) &dNL[1];

double* temp = (double*) &dNR[1];

double* rem_dNL = (double*) &temp[1];

double* rem_dNR = (double*) &rem_dNL[1];

double* Vc = (double*) &rem_dNR[1];

double* BL = (double*) &Vc[1];

double* BR = (double*) &BL[1];

double* xL = (double*) &BR[1];

double* vL = (double*) &xL[1];

double* xR = (double*) &vL[1];

double* vR = (double*) &xR[1];

double new_x;

int* N = (int*) &vR[1];

int t;

int i;

int NL;

int NR;

int* bottom_dNL = (int*) &N[1];

int* bottom_dNR = (int*) &bottom_dNL[1];

int* xcount_even = (int*) &bottom_dNR[1];

int* xcount_odd = (int*) &xcount_even[1];

int u;

int v;

int w;

int A;

int B;

double* jtemp = (double*) &xcount_odd[1];

double vtemp;

double* subjxt= (double*) &jtemp[120];



*N = M / 2;

/* This sets the random initial string, must be inside calcFlux otherwise

 all processors will give the same numbers */

/* The random string is generating using clock which varies more quickly

 than time and the rank of the processor. The multiplicative factors I

 picked with out any special purpose */



int ind = (rand_n/blockDim.x)*threadIdx.x;	



/* This is the defining length scale of the micro box, Vo should be less

 than the maximum of this */

*Length = (double) *N / cg;



/* This normalizes the velocity so it is measured relative to the box size. */         

*V = Vo / *Length;  

*Vc = Vc_o / *Length; 



/* This calculates the number of particles in each buffer, as a decimal,

 and their remainder */

*BL = *V + *Vc;

*BR = *V - *Vc;

*dNL = cg* (*BL)*(*Length) - 0.5 * (*Length) * (*Length) * dcdx * (*BL+ (*BL)*(*BL));

*dNR = cg* (*BR)*(*Length) + 0.5 * (*Length) * (*Length) * dcdx * (*BR- (*BR)*(*BR));

*bottom_dNL = floor(*dNL);

*bottom_dNR = floor(*dNR);

*rem_dNL = *dNL - (double) *bottom_dNL;

*rem_dNR = *dNR - (double) *bottom_dNR;



/* This takes half of the particles and places them randomly within the box */



u = *N / 120;

v = u + 1;

w = *N % 120;

if (threadIdx.x < w)

{

	A = v * threadIdx.x;

	B = v * threadIdx.x + v;

}

else if (threadIdx.x < 120)

{

	A = w + u * threadIdx.x;

	B = w + u * threadIdx.x + u;

}

	for (i = A; i < B; i++) {

		x_odd[i] = rand_d[ind++] - 0.5;

	}

		

*xcount_odd = *N - 1;



for (t = 0; t < nSteps; t++) {

	jtemp[threadIdx.x] = 0;

	

	

	if ( t % 2 == 0)

	{

		*xcount_even = -1;

		/*  Move the particles in the box */

		u =(*xcount_odd + 1) / 120;

		v = u + 1;

		w = (*xcount_odd + 1) % 120;

		

		if (threadIdx.x < w){

			A = v * threadIdx.x;

			B = v * threadIdx.x + v;

			}

		else if (threadIdx.x < 120){

			A = w + u * threadIdx.x;

			B = w + u * threadIdx.x + u;

			}

		for (i = A; i < B; i++) {

	 		/* Randomly assigns positive or negative velocity */

			vtemp = *Vc+*V*(2.0*( (int) (2.0*rand_d[ind++])) - 1.0);

			//random_count++;

			xtemp = x_odd[i] + vtemp;

			/* If the particle leaves the box, then flux is only proportional 

			 to the amount of time spent in the box */          

			if (xtemp > 0.5) {

				jtemp[threadIdx.x] += 0.5 - x_odd[i];

			}

			else if  (xtemp < -0.5){   

				jtemp[threadIdx.x] += -0.5 - x_odd[i];

			}

			/* If the particle remains in the box, then whole flux recorded */

			else {

				atomicAdd(xcount_even, 1);

				x_even[*xcount_even]= xtemp;

				//*xcount_even++;

				jtemp[threadIdx.x] += vtemp;

			}

		} //End for

		

		/*  Fill left buffer and take a step  */

		if ( rand_d[ind++] < *rem_dNL){

			NL = *bottom_dNL + 1;

		}

		else  {

			NL =  *bottom_dNL;

		}

		

		u = NL / 120;

		v = u + 1;

		w = NL % 120;

		if (threadIdx.x < w){

			A = v * threadIdx.x;

			B = v * threadIdx.x + v;

			}

		else if (threadIdx.x < 120){

			A = w + u * threadIdx.x;

			B = w + u * threadIdx.x + u;

			}

		for (i = A; i < B; i++) {

			new_x = -rand_d[ind++]*(*BL) + *Vc+*V*(2.0*( (int) (2.0*rand_d[ind++])) - 1.0);

			if ( new_x > 0)  {

				atomicAdd(xcount_even, 1);

				x_even[*xcount_even] = new_x - 0.5;

				jtemp[threadIdx.x] += new_x;

				//*xcount_even++;

			}

		} // End for

		

		/* Fill the right buffer and take a step  */

		if (rand_d[ind++] < *rem_dNR){

			NR = *bottom_dNR + 1;

		}

		else {

			NR = *bottom_dNR;

		}

	u = NR / 120;

		v = u + 1;

		w = NR % 120;

		

		if (threadIdx.x < w){

			A = v * threadIdx.x;

			B = v * threadIdx.x + v;

			}

		else if (threadIdx.x < 120){

			A = w + u * threadIdx.x;

			B = w + u * threadIdx.x + u;

			}

		for (i = A; i < B; i++) {

			new_x = rand_d[ind++]*(*BR) + *Vc+*V*(2.0*( (int) (2.0*rand_d[ind++])) - 1.0);

			if ( new_x < 0) {

				atomicAdd(xcount_even, 1);

				x_even[*xcount_even] = new_x + 0.5;

				jtemp[threadIdx.x] += new_x;

				//*xcount_even++;

			}

		} //End for

		

		/* Calculates the flux at each microscopic time step */

		__syncthreads();

		jxt[t] = 0;

		if (threadIdx.x == 0){

			for (i=0 ; i < 120; i++){

				jxt[t] += jtemp[i];

		}

			}

	}//End even

	

	if ( t % 2 == 1)

	{

		*xcount_odd = -1;

		/*  Move the particles in the box */

		u = (*xcount_even + 1) / 120;

		v = u + 1;

		w = (*xcount_even + 1) % 120;

		if (threadIdx.x < w){

			A = v * threadIdx.x;

			B = v * threadIdx.x + v;

			}

		else if (threadIdx.x < 120){

			A = w + u * threadIdx.x;

			B = w + u * threadIdx.x + u;

			}

	

		for (i = A; i < B; i++) {

			/* Randomly assigns positive or negative velocity */

			vtemp = *Vc+*V*(2.0*( (int) (2.0*rand_d[ind++])) - 1.0);

			xtemp = x_even[i] + vtemp;

			/* If the particle leaves the box, then flux is only proportional 

			 to the amount of time spent in the box */          

			if (xtemp > 0.5) {

				jtemp[threadIdx.x] += 0.5 - x_even[i];

			}

			else if  (xtemp < -0.5){   

				jtemp[threadIdx.x] += -0.5 - x_even[i];

			}

			/* If the particle remains in the box, then whole flux recorded */

			else {

				atomicAdd(xcount_odd, 1);

				x_odd[*xcount_odd] = xtemp;

				//*xcount_odd++;

				jtemp[threadIdx.x] += vtemp;

			}

		} //End for

		

		/*  Fill left buffer and take a step  */

		if ( rand_d[ind++] < *rem_dNL){

			NL = *bottom_dNL + 1;

		}

		else  {

			NL =  *bottom_dNL;

		}

		

		u = NL / 120;

		v = u + 1;

		w = NL % 120;

		if (threadIdx.x < w){

			A = v * threadIdx.x;

			B = v * threadIdx.x + v;

			}

		else if (threadIdx.x < 120){

			A = w + u * threadIdx.x;

			B = w + u * threadIdx.x + u;

			}

		for (i = A; i < B; i++) {

			new_x = -rand_d[ind++]*(*BL) + *Vc+*V*(2.0*( (int) (2.0*rand_d[ind++])) - 1.0);

			if ( new_x > 0)  {

				atomicAdd(xcount_odd, 1);

				x_odd[*xcount_odd] = new_x - 0.5;

				jtemp[threadIdx.x] += new_x;

				//*xcount_odd++;

			}

		} // End for

		

		/* Fill the right buffer and take a step  */

		if (rand_d[ind++] < *rem_dNR){

			NR = *bottom_dNR + 1;

		}

		else {

			NR = *bottom_dNR;

		}

		

		u = NR / 120;

		v = u + 1;

		w = NR % 120; 			

		

		if (threadIdx.x < w){

			A = v * threadIdx.x;

			B = v * threadIdx.x + v;

			}

		else if (threadIdx.x < 120){

			A = w + u * threadIdx.x;

			B = w + u * threadIdx.x + u;

			}

		for (i = A; i < B; i++) {

			new_x = rand_d[ind++]*(*BR) + *Vc+*V*(2.0*( (int) (2.0*rand_d[ind++])) - 1.0);

			if ( new_x < 0) {

				atomicAdd(xcount_odd, 1);

				x_odd[*xcount_odd] = new_x + 0.5;

				jtemp[threadIdx.x] += new_x;

				//*xcount_odd++;

			}

		} //End for

		

		__syncthreads();

		/* Calculates the flux at each microscopic time step */

		jxt[t] = 0;

		if (threadIdx.x == 0){

			for (i=0 ; i < 120; i++){

				jxt[t] += jtemp[i];

		}

			}

		

	}//End odd

	

	__syncthreads();

} /* End of loop over time steps */



/* Only records the flux for the later part of time periods, reason

 why nSteps must be a multiple of 10 */

*temp = 0.0;

u = (9 * (nSteps/10)) / 120;

v = u + 1;

w = (9 * (nSteps/10)) % 120;

if (threadIdx.x < w){

	A = v * threadIdx.x + (nSteps/10);

	B = v * threadIdx.x + v + (nSteps/10);

}

else if (threadIdx.x < 120){

		A = w + u * threadIdx.x + (nSteps/10);

		B = w + u * threadIdx.x + u + (nSteps/10);

	}

subjxt[threadIdx.x] = 0;

for (t = A; t < B; t++) {

	subjxt[(int)threadIdx.x] += jxt[t];

    } 

if (threadIdx.x == 0) {

	for (i = 0; i < 120; i++) {

		*temp += subjxt[i];

	}

/* Makes flux independent of number of repetitions */

	*temp /= 0.9*(double)nSteps;

}



*result = *temp;

} /* End of calcFlux */

//this is the test-driver (runs on the host)

int main(void)

{

printf("__Multiscale Particle Problem - Microproblem Test 1.6__\n");

printf("   By Vishal Soni and E David Ballard -- Summer 2009\n");

printf("            Based on code by Alex Lang\n\n");



printf("*Generating Random Numbers*\n");

printf("Allocating Memory...\n");

float* rand_d;

const int PER_RNG = 40000;

const int Nblocks = 32;     //THIS NEEDS TO DIVIDE EVENLY INTO # OF CORES

const int threads_per_block = 256;

for(int k = 0; k < 12; k++) {

long rand_n = Nblocks * threads_per_block * PER_RNG;





cudaMalloc((void **)&rand_d, rand_n * sizeof(float));



printf("Loading Mersenne Twister Config...\n");

loadMTGPU("MersenneTwister.dat");



printf("Seeding Random Number Generator...\n");

const unsigned int DEV_NUMBER = 1;

if(k == 0) seedMTGPU((DEV_NUMBER + 2)*47 * DEV_NUMBER);



printf("Starting Kernel...     ");	

RandomGPU<<<Nblocks, threads_per_block>>>(rand_d, PER_RNG);



printf("Success!\n");	



printf("*Running Microscale Random-Walkers Simulation*\n");	



const int NB = 1; //number of blocks per card

double *a_h, *a_d;



printf("Initializing...\n");

size_t size = sizeof(double) * NB;

a_h = (double*)malloc(size);

cudaMalloc((void**)&a_d,size);



int block_size = 240;

int n_blocks = NB;



//Stand-in variables necessary for test

double cg 		= -2.165503000;

double dcdx 		=  1.49384400000;

long rank		= 1;



printf("Starting GPU on-device process...   ");

size_t smBytes = (150 + M * 4 + nSteps) * sizeof(double);



calcFlux<<<n_blocks,block_size, smBytes>>>(cg, dcdx, rank, a_d, rand_d, rand_n);

printf("Complete.\nCopying result to host...\n");

cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost);

printf("Printing result...\n");

printf("Flux: %d\n", a_h);

free(a_h); cudaFree(a_d); cudaFree(rand_d);

printf("Done.    %d\n",k);

}

}

[/codebox]

It’s pretty long, even in this simplest form. Sorry it took me a while to post it.