CURAND Crashing Problem

LucasCampos · February 25, 2011, 4:32pm

Hello there. I’m trying to use curand on a program of mine, but whenever I call curandGenerateNormal, I get a segmentation fault. Any ideas?

int main() {

	float *random; 

	curandGenerator_t gen;

...

nrand=N*2; //N is defined elsewhere. I often use N around 1000

size_t memSizeRandom=nrand*sizeof(float);

...

cudaMalloc( (void **) &random, memSizeRandom);

...

	CURAND_CALL(curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT)); 

	CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen,(rand()/RAND_MAX)));

...

			cout << "Call Random" << endl;

			curandGenerateNormal(gen, random, nrand,0.0,raizdatemp); //raizdatemp is defined elsewhere

...

}

The code compiles fine with nvcc -lcurand, but when I run the program, I get

lucas@sonic1:~$ nvcc -lcurand sample.cu 

lucas@sonic1:~$ ./a.out 

Particles: 512

Pinnings: 16

Box: 512

Sigma: 1.5

Eps: 1.

dt: 0.0001

Pinning's force: 1

Pinning's radius: 1

Ok para cudaMalloc

Ok para cudaMemCpy

Ok para Generate1

9

Call Random

Segmentation fault

lucas@sonic1:~$

I’m running on a Ubuntu 10.10 32-bits, ToolKit 3.2 and devdriver 260.19.26, with a GTX 465.

LucasCampos · February 25, 2011, 5:41pm

I made a test with a smaller system. These are the results:

lucas@sonic1:~$ nvcc -lcurand sample.cu 

lucas@sonic1:~$ ./a.out 

Particles: 512

Pinnings: 16

Box: 512

Sigma: 1.5

Eps: 1.

dt: 0.0001

Pinning's force: 1

Pinning's radius: 1

Ok para cudaMalloc

Ok para cudaMemCpy

Ok para nblocks

9

Call Random

Segmentation fault

lucas@sonic1:~$ ./a.out 

Particles: 4

Pinnings: 1

Box: 8

Sigma: .7

Eps: .7

dt: 0.0001

Pinning's force: 1

Pinning's radius: 1

Ok para cudaMalloc

Ok para cudaMemCpy

Ok para nblocks

9

1.00005

Several "Call Random"

8

1.00005

Several "Call Random" 

7

1.00005

Several "Call Random" 

6

1.00005

Several "Call Random" 

5

1.00005

Several "Call Random" 

4

1.00005

Several "Call Random" 

3

1.00005

Several "Call Random" 

2

1.00005

Several "Call Random" 

1

1.00005

Several "Call Random" 

0

1.00005

,

It’s interesting that the smaller system ran ok, but the larger one did not.

NathanW · February 26, 2011, 12:08am

Is it possible for you to post the code you are running? I don’t see anything obviously wrong in the code snippet you posted (other than you should check return values for cudaMalloc() and curandGenerateNormal()).

LucasCampos · February 27, 2011, 4:03pm

OK, here it is

#include <iostream>

#include <fstream>

#include <cuda.h>

#include <curand.h>

#include <cuda_runtime_api.h>

#include <curand_kernel.h>

using namespace std;

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \

printf("Error at %s:%d\n",__FILE__,__LINE__);\

return EXIT_FAILURE;}} while(0)

#define CURAND_CALL(x) do { if((x) != CURAND_STATUS_SUCCESS) { \

printf("Error at %s:%d\n",__FILE__,__LINE__);\

return EXIT_FAILURE;}} while(0)

__global__ void forca (float2 *pos, float2 *acc, float box, int N, float sigma2, float eps){

	float2 del;

	float r2;

	int i=blockIdx.x*blockDim.x+threadIdx.x;

		for (int j=0; j < N; j++) {

			if ((i != j) && (i < N) && (j<N)){

				del.x=pos[i].x-pos[j].x;

				del.y=pos[i].y-pos[j].y;

				if (del.x > box/2) {

					del.x -= box;

				}

				else if (del.x < -box/2) {

					del.x += box;

				}

				if (del.y > box/2) {

					del.y -= box;

				}

				else if (del.y < -box/2) {

					del.y += box;

				}

				r2=del.x*del.x+del.y*del.y;

				acc[i].x+=24*eps*(2*pow((sigma2/r2),6)-pow((sigma2/r2),3))*del.x/r2	;

				acc[i].y+=24*eps*(2*pow((sigma2/r2),6)-pow((sigma2/r2),3))*del.y/r2;

			}

		}

}

__global__ void forcapinnin(float2 *pos, float2 *pospinn, float2 *acc, float forcamax, float raiopinn, int N, int npinnings, float box){

	

	int i=blockIdx.x*blockDim.x+threadIdx.x;

	float2 del;

	float r2,r;

	if (i < N) {	

		for (int j=0; j<npinnings; j++){

			

				del.x=pos[i].x-pospinn[j].x;

				del.y=pos[i].y-pospinn[j].y;

				if (del.x > box/2) {

					del.x -= box;

				}

				else if (del.x < -box/2) {

					del.x += box;

				}

				if (del.y > box/2) {

					del.y -= box;

				}

				else if (del.y < -box/2) {

					del.y += box;

				}

			r2=del.x*del.x+del.y*del.y;

			r=sqrt(r2);

			

			if (( r < raiopinn) && ( r !=0)){

				acc[i].x+=del.x*forcamax*(r-raiopinn)/(r*raiopinn);

				acc[i].y+=del.y*forcamax*(r-raiopinn)/(r*raiopinn);

			}

		}

	}

}

			

__global__ void mover (float2 *pos, float2 *acc, float box, float dt, int N,float *randomgauss,float *randomuni) {

	int i=threadIdx.x+blockDim.x*blockIdx.x;

	float dr=randomgauss[i],dtet=randomuni[i]*2*3.1419;

	if (i <N) {

		pos[i].x+=cos(dtet)*dr+acc[i].x*dt;

		pos[i].y+=sin(dtet)*dr+acc[i].y*dt;

		

		acc[i].x=0;

		acc[i].y=0;

		if (pos[i].x > box) {

			pos[i].x -= box;

		}

		else if (pos[i].x < 0) {

			pos[i].x += box;

		}

		if (pos[i].y > box) {

			pos[i].y -= box;

		}

		else if (pos[i].y < 0) {

			pos[i].y += box;

		}

	}

} 

void checkCUDAError(const char *msg)

{

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err) 

	{

 	fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

 	cudaGetErrorString( err) );

		system("pause");

 	exit(EXIT_FAILURE);

	} 	

}

int main(void){

	float2 *pos, *acc,*pospinn;

	float2 *pos_d, *acc_d,*pospinn_d;

	int N,i,counter=0;

	float box,dt, t=0.0,eps,sigma2,forcamax,raiopinn,sigma;

	size_t nrand,npinnings;

	float *randomgauss,*randomuni;

	curandGenerator_t gen;

	srand(45645645);

	std::cout << "Particles: ";

	std::cin >> N;

	std::cout << "Pinnings: ";

	std::cin >> npinnings;

	std::cout << "Box: ";

	std::cin >> box;

	std::cout << "sigma: ";

	std::cin >> sigma;

	std::cout << "Eps: ";

	std::cin >> eps;

	std::cout << "dt: ";

	std::cin >> dt;

	std::cout << "Pinning's force: ";

	std::cin >> forcamax;

	std::cout << "Pinning's radius: ";

	std::cin >> raiopinn;

	nrand=N;

	sigma2=sigma*sigma;

	

	size_t memSize=N*sizeof(float2);

	size_t memSizeRandom=nrand*sizeof(float);

	size_t memSizePinn=npinnings*sizeof(float2);

	pos=(float2 *) malloc (memSize);

	acc=(float2 *) malloc (memSize);

	pospinn=(float2 *) malloc (memSizePinn);

	cudaMalloc( (void **) &pos_d, memSize );

checkCUDAError("Malloc pos_d");

	cudaMalloc( (void **) &acc_d, memSize );

checkCUDAError("Malloc acc_d");

	cudaMalloc( (void **) &randomgauss, memSizeRandom);

checkCUDAError("Malloc randomgauss");

	cudaMalloc( (void **) &randomuni, memSizeRandom);

checkCUDAError("Malloc randomgauss");

	cudaMalloc( (void **) &pospinn_d, memSizePinn);

checkCUDAError("Malloc pospinn_d");

cout << "Ok para cudaMalloc" << endl;

	CURAND_CALL(curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT));

	CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen,rand()));

	for (i=0; i<N; i++){

		pos[i].x=(rand()/(float)RAND_MAX)*box;

		pos[i].y=(rand()/(float)RAND_MAX)*box;

		pospinn[i].x=(rand()/(float)RAND_MAX)*box;

		pospinn[i].y=(rand()/(float)RAND_MAX)*box;

		acc[i].x=0;

		acc[i].y=0;

	}

	cudaMemcpy( pos_d, pos, memSize, cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy pos");

	cudaMemcpy( acc_d, acc, memSize, cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy acc");

	cudaMemcpy( pospinn_d, pospinn, npinnings*sizeof(float), cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy pospinn");

	cout << "Ok para cudaMemCpy" << endl;

	ofstream trans,inicial,final,pinnings;

	trans.open ("trans.dat");

	inicial.open("inicial.dat");

	pinnings.open("pinnings.dat");

	for (i=0;i<N;i++) {

				inicial << pos[i].x;

				inicial << " ";

				inicial << pos[i].y;

				inicial << " 	" <<endl;

				}

	inicial.close();

	for (i=0;i<N;i++) {

				pinnings << pospinn[i].x;

				pinnings << " ";

				pinnings << pospinn[i].y;

				pinnings << " 	" <<endl;

				}

	pinnings.close();

	int nblocksforca=N/256 + (N%256 == 0? 0:1);

	int nblocksmove=N/256 + (N%256 == 0? 0:1);

	cout << "Ok para nblocks" << endl;

	float temp=1000.0,dtemp=-1.,raizdatemp;

	

	while (temp > 0.001) {

		temp+=dtemp;

		raizdatemp=sqrt(2*temp)*dt;

		t=0.0;

		cout << temp << endl;

		while (t<= 1.0){

			t+=dt;

			counter+=1;

//			cout << "Call randomuni" << endl;

			curandGenerateUniform(gen, randomuni, nrand);

	

//			cout << "Finda randomuni" << endl;			

//			cout << "Call randomgauss" << endl;

			curandGenerateNormal(gen, randomgauss, nrand,0.0,raizdatemp);

	

//			cout << "Finda randomgauss" << endl;

			forca <<< nblocksforca, 256 >>> (pos_d, acc_d, box, N, sigma2, eps);

			checkCUDAError("forca");

			forcapinnin <<< nblocksforca, 256 >>> (pos_d, pospinn_d, acc_d, forcamax, raiopinn, N, npinnings, box);

			checkCUDAError("forcapinn");

			mover <<< nblocksmove, 256 >>> (pos_d, acc_d, box, dt, N, randomgauss,randomuni);

			checkCUDAError("mover");

			if (counter % 10000 == 0) {

				cout << t << endl;

				cudaMemcpy( pos, pos_d, memSize, cudaMemcpyDeviceToHost );

				checkCUDAError("Memcpy pos2");

				for (i=0;i<N;i++) {

					trans << pos[i].x;

					trans << " ";

					trans << pos[i].y;

					trans << " 	" <<endl;

					}

				trans << endl;

			}

		}

	}

	trans.close();

	final.open("final.dat");

	

	cudaMemcpy( pos, pos_d, memSize, cudaMemcpyDeviceToHost );	

	

	for (i=0;i<N;i++) {

				final << pos[i].x;

				final << " ";

				final << pos[i].y;

				final << " 	" <<endl;

				}

	final.close();

	CURAND_CALL(curandDestroyGenerator(gen));

	return 0;

}

sample.cu (6.98 KB)

NathanW · February 28, 2011, 7:54pm

Thanks for posting the code.

I think you might have some corrupted memory External Image

I compiled and ran the program on my Ubuntu desktop with a C2050, reproduced the segfault with the parameters in your first post. Then I ran it in valgrind and saw:

…

==2249== Invalid write of size 4

==2249== at 0x401E7F: main (sample.cu:194)

==2249== Address 0x67e7440 is 0 bytes after a block of size 128 alloc’d

==2249== at 0x4C274A8: malloc (vg_replace_malloc.c:236)

==2249== by 0x401C69: main (sample.cu:171)

==2249==

==2249== Invalid write of size 4

==2249== at 0x401EBA: main (sample.cu:195)

==2249== Address 0x67e7444 is 4 bytes after a block of size 128 alloc’d

==2249== at 0x4C274A8: malloc (vg_replace_malloc.c:236)

==2249== by 0x401C69: main (sample.cu:171)

Looking at lines 194 and 195, it does indeed look like there is a problem there. The index i is going from 0 to N-1, but posspinn is being allocated to store only npinnings. For the parameters, this means N is 512 but npinnings is only 16.

LucasCampos · March 1, 2011, 1:51am

That solved that. That was a silly mistake, but a hard-to-spot one. I’m used to Fortran, which checks these boundaries for. :)

Is there a way to produce something less ‘mistical’ when one is reading\writing on an out-of-boundary memory space? This error was noticed because the program was crashing some twenty lines later.

Thank you very much, Nathan.

seibert · March 1, 2011, 2:54am

valgrind, the program used by NathanW to discover the problem, is an excellent tool to become familiar with. I highly recommend people run it on their C and C++ programs periodically to find lurking memory bugs like this. I believe cuda-memcheck can find similar bugs in your device code.

Topic		Replies	Views
curand_uniform() problem Can't find out why kernell crashes with curand_unifor CUDA Programming and Performance	2	2876	June 6, 2012
cuRand Example (having problems) CUDA Programming and Performance	2	2282	April 21, 2011
Error running with Valgrind CUDA Programming and Performance	10	6068	May 30, 2010
random kernel execution failure with unknown error CUDA programming on Linux CUDA Programming and Performance	9	8667	June 11, 2008
curand eats device memory CUDA Programming and Performance	13	16682	January 28, 2011
CURAND acting strangely CUDA Programming and Performance	13	21358	April 26, 2011
curandGenerateNormalDouble is crashing CUDA Programming and Performance	17	5702	November 24, 2010
CUDA curand “An illegal memory access was encountered” CUDA Programming and Performance	1	5265	March 1, 2015
curandGenerateNormalDouble Issue CUDA Programming and Performance	3	872	November 15, 2010
CUDA C++ Segmentation Fault CUDA Programming and Performance	7	14860	October 1, 2017

CURAND Crashing Problem

Related topics