cudaMalloc corrupts openGL graphics display

I am new to CUDA and am relearning C++, so please forgive me for any non-technical noob jargon.

I am working on a model that simulates plant (biomass) growth using simple differential equations. One of them includes diffusing the biomass density and nutrient density, which is what I’ve shown below. I am having a problem with the graphics (using openGL) and cudaMalloc getting along. If I declare all of the constants used by a kernel inside of that kernel, it works fine but is inefficient and slow. If I declare all of my constants on the host then use cudaMalloc and cudaMemCpy to get them onto the device, the math works (I’m passing in what I expect to) but the output display is getting corrupted. I’ve posted some code snippets and .png images to illustrate what I’m doing and what the outcome is.

This code works fine (but slow):

__global__ void

d_diffuse_tex(float *grid, float *ftemp, int w, int h, float dt, float hh, unsigned int TextureFlag, unsigned int nSpecies)

{  

	int x = blockIdx.x*blockDim.x + threadIdx.x;

	int y = blockIdx.y*blockDim.y + threadIdx.y;

	if (TextureFlag == 1) /*---------- Diffuse biomass ----------*/

	{

		float alpha[10];			// Diffusion coefficients for biomass

		alpha[0] = 0.005479f;		// Diffusion coefficient for species 1

		alpha[1] = 0.005479f;		// Diffusion coefficient for species 2

		alpha[2] = 0.005479f;		// Diffusion coefficient for species 3

		alpha[3] = 0.005479f;		// Diffusion coefficient for species 4

		alpha[4] = 0.005479f;		// Diffusion coefficient for species 5

		alpha[5] = 0.005479f;		// Diffusion coefficient for species 6

		alpha[6] = 0.005479f;		// Diffusion coefficient for species 7

		alpha[7] = 0.005479f;		// Diffusion coefficient for species 8

		alpha[8] = 0.005479f;		// Diffusion coefficient for species 9

		alpha[9] = 0.005479f;		// Diffusion coefficient for species 10		

		if (nSpecies == 1)

		{

			grid[y*w + x] = DiffuseBio(grid[y*w + x], dt, hh, alpha[0], 0, x, y);					// Diffuse species 1

		}

		//Other conditions for nSpecies == 2, 3, etc)

		//Also a similar set of conditions and function calls for diffusing nutrient (TextureFlag == 2)

		//DiffuseBio is a separate function I call that helps me scale to increasing number of species without repeating code, but is not important for here

	}

In this one I define the diffusion coefficients for each species inside the kernel and it is a bit slow and not efficient at all. It produces results like the one shown in “Correct_Graphics.png” (attached)

This second code is extremely similar, but I declare the variables on the host, populate the array, then pass them into the kernel. I tested it before and mathematically the model is working, it gets all of the constants it needs. I even got it to produce the correct graphics once or twice with a 20% speedup over the first method. However, now the graphics are messed up and I cannot figure out why. I started commenting out parts of the code and found that just by allowing the line

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_Db, size ) );

the graphics became corrupt.

Main .cpp file:

float *h_Db = NULL;		//Diffusion coefficient array for biomass (host)

float *d_Db = NULL;		//Diffusion coefficient array for biomass (device)

float *h_Dn = NULL;		//Diffusion coefficient array for nutrient (host)

float *d_Dn = NULL;		//Diffusion coefficient array for nutrient (device)

int

main( int argc, char** argv) 

{		

	int size = 10 * sizeof(float);

	

	h_Db = new float[10];

	h_Db[0] = 0.005479f;

	h_Db[1] = 0.005479f;

	h_Db[2] = 0.005479f;

	h_Db[3] = 0.005479f;

	h_Db[4] = 0.005479f;

	h_Db[5] = 0.005479f;

	h_Db[6] = 0.005479f;

	h_Db[7] = 0.005479f;

	h_Db[8] = 0.005479f;

	h_Db[9] = 0.005479f;	

	CUDA_SAFE_CALL( cudaMalloc( (void**) &d_Db, size ) );

	CUDA_SAFE_CALL( cudaMemcpy( d_Db, h_Db, size, cudaMemcpyHostToDevice));

	

	h_Dn = new float[10];

	h_Dn[0] = 0.027397f;

	h_Dn[1] = 0.027397f;

	h_Dn[2] = 0.027397f;

	h_Dn[3] = 0.027397f;

	h_Dn[4] = 0.027397f;

	h_Dn[5] = 0.027397f;

	h_Dn[6] = 0.027397f;

	h_Dn[7] = 0.027397f;

	h_Dn[8] = 0.027397f;

	h_Dn[9] = 0.027397f;	

	CUDA_SAFE_CALL( cudaMalloc( (void**) &d_Dn, size ) );

	CUDA_SAFE_CALL( cudaMemcpy( d_Dn, h_Dn, size, cudaMemcpyHostToDevice));

... // Other stuff

}

Kernel:

__global__ void

d_diffuse_tex(float *grid, float *ftemp, int w, int h, float dt, float hh, unsigned int TextureFlag, unsigned int nSpecies, float *alpha)

{  

	int x = blockIdx.x*blockDim.x + threadIdx.x;

	int y = blockIdx.y*blockDim.y + threadIdx.y;

	if (TextureFlag == 1) /*---------- Diffuse biomass ----------*/

	{	

		if (nSpecies == 1)

		{

			grid[y*w + x] = DiffuseBio(grid[y*w + x], dt, hh, alpha[0], 0, x, y);					// Diffuse species 1

		}

		//Other conditions for nSpecies == 2, 3, etc)

		//Also a similar set of conditions and function calls for diffusing nutrient (TextureFlag == 2)

		//DiffuseBio is a separate function I call that helps me scale to increasing number of species without repeating code, but is not important for here

	}

As I mentioned I’ve checked that what is getting passed into the kernel is correct, so the tangled web of function calls that eventually lead to this kernel are not worth posting here. The graphics for this code snippet are shown in the “Messed_Up_Graphics.png” (attached).

Has anybody else seen this? Does anybody have a solution or a suggestion to help orient me in the direction of a solution?

Thank you!
Messed_Up_Graphics.png
Correct_Graphics.png

Your index calculations seem to be weird again… check those first ;)

You should have a look at this topic:

http://forums.nvidia.com/index.php?showtopic=203116

I also had another topic elsewhere with 2dx2d

This looks very useful for what I’m looking to do, I’ll definitely refer to it a few times until I get the hang of it. For right now I’m kind of hardcoding some of this stuff for speed, i.e. the code looks like this:

__device__ float DiffuseBio(float grid, float dt, float hh, float alpha, int sp, int x, int y)

{

	float ftemp = tex2D(tex_Biomass, x-1, sp+y) + tex2D(tex_Biomass, x+1, sp+y) +

						tex2D(tex_Biomass, x, sp+y-1) + tex2D(tex_Biomass, x, sp+y+1) - 

						4.0f * tex2D(tex_Biomass, x, sp+y);

	grid += dt * (alpha/(hh*hh)) * ftemp;

	

	return grid;

}

__global__ void

d_diffuse_tex(float *grid, float *ftemp, int w, int h, float dt, float hh, unsigned int TextureFlag, unsigned int nSpecies, float *alpha)

{  

	int x = blockIdx.x*blockDim.x + threadIdx.x;

	int y = blockIdx.y*blockDim.y + threadIdx.y;

if (nSpecies == 1)

		{

			grid[y*w + x] = DiffuseBio(grid[y*w + x], dt, hh, alpha[0], 0, x, y);					// Diffuse species 1

		}

		

		else if (nSpecies == 2)

		{

			grid[y*w + x] = DiffuseBio(grid[y*w + x], dt, hh, alpha[0], 0, x, y);					// Diffuse species 1

			grid[(h+y)*w + x] = DiffuseBio(grid[(h+y)*w + x], dt, hh, alpha[1], 1, x, y);			// Diffuse species 2

		}

		

		else if (nSpecies == 3)

		{

			grid[y*w + x] = DiffuseBio(grid[y*w + x], dt, hh, alpha[0], 0, x, y);					// Diffuse species 1

			grid[(h+y)*w + x] = DiffuseBio(grid[(h+y)*w + x], dt, hh, alpha[1], 1, x, y);			// Diffuse species 2

			grid[(2*h+y)*w + x] = DiffuseBio(grid[(2*h+y)*w + x], dt, hh, alpha[2], 2, x, y);		// Diffuse species 3

		}

... // More conditions to nSpecies == 10

}

I’m trying to get one thing working at a time, and this elementary indexing gets the job done enough that I can focus on other aspects of the code. When I get more comfortable with everything I’ll be more clever with my indexing.

I don’t think this specific problem is due to indexing though. I index the same way whether the graphics are correct or not, and all I have to do is cudaMalloc for the d_Db array (alpha array) and it corrupts the graphics, even without using this allocated memory.

From what I’ve read on similar forum posts, allocating more memory than the GPU has can cause the computer’s graphics to hang or otherwise be strange, and a reset will fix this, but my symptoms don’t follow this exactly. It only corrupts the openGL display window while the code is running. I also have a hard time believing that we are so close to the memory limit that allocating an additional array of 10 floats on the GPU could be “allocating too much” and cause this.

Perhaps opengl and cuda share the same memory.

So if your indexes go out of bounds it might corrupt the graphics ?

Just a thought… I don’t know if this can actually happen.

This could probably be tested by creating a simple/empty opengl project together with cuda.

Allocate the opengl buffers and such, and then allocate all remaining memory with cuda… and then try to write to even more memory to force a buffer overflow or underflow or something and then see if that corrupts the memory…

Another random though is: perhaps cuda context is lost and needs to be re-acquired.

I have no experience with opengl+cuda so I don’t know what I am writing about ;) I’m just guessing :)

To be honest I have no experience with opengl, and am relatively new to cuda, so the two together is entirely new to me. I like your suggestion and I’ll try the empty opengl project thing.

Thank you for the direction and all the help so far.

~Josh

Hi,

You could take a look at SDK examples (boxFilter and postProcess in OpenGL)

I guess that you could probably just plugin your code to boxFilter (replace the kernels)