memory fragmentation (perhaps?)

vargant51 · February 9, 2009, 6:38am

Hi,

edit: my /previous/ issue, not my current one was memory corruption. I actually think I got rid of that…that was a different error message entirely. Something along the lines of glibc detected such-and-such* !prev, and then a map of allocated addresses and a stacktrace. Sorry about the confusing title.

I am researching quantum chemistry and plasma physics using a robust, perturbative, renormalized QFT formulated using MaxEnt, Spacetime Algebra/Bivector General Relativity, and active stochastic processes (see e.g. doi:10.1016/j.physletb.2004.03.036, doi:10.1016/j.physletb.2007.10.060, also check out publications by D. Hestenes and E.T. Jaynes). The result will be open-sourced as soon as I can debug it, so feel free to ask for code snippets, etc.

Right now, I’ve got a nasty problem. Using the runtime API, I need to allocate 2 float arrays, each 1286464, asynchronously on a 512mb version of the 8600GT (yes, I know, I’m cheap). Then I need to call my initialization kernel on an input file stored as a uchar*, which has only about 26 registers—I launch 2 streams, each 2 blocks of size (2,2,2). These sizes are subject to as much change as they need given the hardware. I do, however, want to test support for asynchronous transfers and launches if I can, because I made an object-oriented wrapper and am going to write template to abstract the initialization kernel (right now it’s just a dummy that acts as “cudaMemset”). Ideally, more powerful hardware would allow applications to perform multiple simulations without interference.

I don’t know exactly what needs to be fixed. Here is what seems to be the problem code, lines 36-76:

__host__ cystGPUMsim::cystGPUMsim(struct simParams *params, const char* buff){

			dp=(cudaPitchedPtr*) malloc(sizeof(cudaPitchedPtr)*(nstreams));

			width=params->width;

			height=params->height;

			depth=params->depth;

			dT=params->dT;

			dx=params->dx;

			n=params->n;

			grid=params->grid;	

			block=params->block;

			max=params->max;

			nstreams=params->nstreams;

	

			

			ca_extent.width  = width/nstreams;

			ca_extent.height = height;

			ca_extent.depth  = depth;

			

			unsigned char* dbuff;

			cudaMalloc((void**) &dbuff, sizeof(buff));

			cudaMemcpy(dbuff, buff, sizeof(buff), cudaMemcpyHostToDevice);

			stream=(cudaStream_t*) malloc(sizeof(cudaStream_t)*nstreams);

			for (int s = 0; s < nstreams; ++s){

				dp[s]=make_cudaPitchedPtr( NULL, width/nstreams*sizeof(float), height, depth*4 );

				cudaMalloc3D(&(dp[s]), ca_extent);

			}

				for(int i=0; i<int(width/nstreams); i+=grid.x*block.x){

					for(int j=0; j<int(height); i+=grid.y*block.y){

						for(int k=0; k<int(depth*4); i+=grid.z*block.z){

							for (int s = 0; s < nstreams; ++s){

							initialize<<<grid, block, 0, stream[s]>>>((float***) dp[s].ptr, dbuff, i,j,k);

							}

							cudaThreadSynchronize();

							checkCUDAError(" ");

						

					}

				}

			}

			cudaFree(dbuff);

	}

Compile-time yields no relevant warnings. What I see on runtime is the following error:

and the gdb stack says:

#0 0xb7f73b13 in ?? () from /cuda/lib/libcudart.so.2

#1 0xb7f431de in ?? () from /cuda/lib/libcudart.so.2

#2 0xb7f3f550 in ?? () from /cuda/lib/libcudart.so.2

#3 0xb7f73bfc in ?? () from /cuda/lib/libcudart.so.2

#4 0xb7fa1fcf in ?? () from /lib/ld-linux.so.2

#5 0xb7ceb084 in exit () from /lib/tls/i686/cmov/libc.so.6

#6 0x0805ce1a in checkCUDAError (msg=0x80604ab " ") at cyst.cu:30

#7 0x0805d34b in cystGPUMsim (this=0x8167a58, params=0x8167a08,
buff=0x80604ad "k") at cyst.cu:70
#8 0x0805d554 in demo () at demo.cu:40

#9 0x0805db92 in main () at main.cpp:22.

I would use deviceemu to debug. But right now, if I compile it with device emulation, I receive the following warnings:

Then, when I run it, this odd segmentation fault occurs:

I think this has something to do with my host system’s being single-threaded; I am pretty sure the CUDA Programming Guide said you need at least enough threads on the host to run the device code.

I am running Ubuntu 8.04 32-bit Desktop Ed. with the Cuda 2.1 Beta Toolkit and SDK (my installation doesn’t seem to cause any problems), drivers 180.06 beta, and all of the most recent package updates (=> up-to-date GCC, etc.). I have a Pentium D 805, 1GB of RAM (DDR2, don’t know what speed), and a BFG Tech GeForce 8600 GT OC w/ 512 GB GDDR3 (quite a mouthful). Only nvcc settings are -G, -g, and -arch SM_11, which is the correct target (I checked. I had an atomicAdd that I got rid of, if you must know).

Any thoughts? Appreciate your comments.

tmurray · February 9, 2009, 7:10am

Can you post full source?

vargant51 · February 9, 2009, 7:44am

sure, sure. here goes:

cyst.h:

/*   Copyright (c) 2009 Kevin Daley  

	 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

#include <cuda_gl_interop.h>

#include <cstdio>

#include <cstdlib>

#include <cmath>

 __global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k);

extern int resx, resy;

namespace cyst{

struct simParams{

	float dT;

	float n;

	float4 dx; 

	dim3 grid; 

	dim3 block;

	float max;

	unsigned int width, height, depth;

	int nstreams;

};

class cystGPUMsim{

protected:

	cudaPitchedPtr *dp; 

	float dT;

	float4 dx;

	float n; 

	dim3 grid; 

	dim3 block;

	cudaStream_t *stream;

	int nstreams;

	float max;

	unsigned int width, height, depth;

	cudaExtent ca_extent;

public:

	__host__ explicit cystGPUMsim(struct simParams *params, const char* buff);

	__host__ virtual ~cystGPUMsim();

	__host__ void callGPUsim(int frames);

	__host__ float* retrieveGPUsim();

};

}

cyst.cu

/*  Copyright (c) 2009 Kevin Daley 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

#include "cyst.cu"

__global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k){

	uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx

.y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);

	hd[pg.x][pg.y][pg.z]=3;

}

void demo(){

cudaSetDevice(0);

struct simParams *parms= new simParams;

parms->dT=10;

parms->n=.5;

parms->dx=make_float4(0.1,0.1,0.1,0.1);

dim3 grid (2,1);

dim3 block (2,2,2);

parms->width=128;

parms->height=128;

parms->depth=64;

parms->max=50.0f;

parms->nstreams=2;

cudaSetDevice(0);

	parms->grid=grid;

	parms->block=block;

	cyst::cystGPUMsim* sim=new cyst::cystGPUMsim(parms, "k");		

	sim->callGPUsim(1);

	

	float* H=sim->retrieveGPUsim();

	delete sim;

	FILE* f=fopen("out.dat","rw+");

	for(int i=0; i<256; i++){

		for(int j=0; j<256; j++){

			for(int k=0; k<256; k++){	

				fprintf(f, "%f %f %f %f %f %f %f %f\n", 

i*parms->dx.x, j*parms->dx.y,k*parms->dx.z, k*parms->dx.w, 

H[(i+j*256+k*256*1024)+3], H[(i+j*256+k*256*1024)], 

H[(i+j*256+k*256*1024)+1], 

H[(i+j*256+k*256*1024)+2]);

	}}}

	fclose(f);

	exit(0);

}

and demo.cu

/*  Copyright (c) 2009 Kevin Daley 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

#include "cyst.cu"

__global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k){

	uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx

.y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);

	hd[pg.x][pg.y][pg.z]=3;

}

void demo(){

cudaSetDevice(0);

struct simParams *parms= new simParams;

parms->dT=10;

parms->n=.5;

parms->dx=make_float4(0.1,0.1,0.1,0.1);

dim3 grid (2,1);

dim3 block (2,2,2);

parms->width=128;

parms->height=128;

parms->depth=64;

parms->max=50.0f;

parms->nstreams=2;

cudaSetDevice(0);

	parms->grid=grid;

	parms->block=block;

	cyst::cystGPUMsim* sim=new cyst::cystGPUMsim(parms, "k");		

	sim->callGPUsim(1);

	

	float* H=sim->retrieveGPUsim();

	delete sim;

	FILE* f=fopen("out.dat","rw+");

	for(int i=0; i<256; i++){

		for(int j=0; j<256; j++){

			for(int k=0; k<256; k++){	

				fprintf(f, "%f %f %f %f %f %f %f %f\n", 

i*parms->dx.x, j*parms->dx.y,k*parms->dx.z, k*parms->dx.w, 

H[(i+j*256+k*256*1024)+3], H[(i+j*256+k*256*1024)], 

H[(i+j*256+k*256*1024)+1], 

H[(i+j*256+k*256*1024)+2]);

	}}}

	fclose(f);

	exit(0);

}

and kernel.cu:

/*  Copyright (c) 2009 Kevin Daley 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

//planck units

#define const_pi 3.1415927f

#define kC 1.0f

#define kB 1.0f

#define lightspeed  1.0f

#define finestructure 1.0f

#include <sm_11_atomic_functions.h>

__global__ void kernelGPU(float ***H, float dT, float4 dx, float n, float max, int i, int j, int k){

		float mg=0;

		uint3 pl=make_uint3(threadIdx.x, threadIdx.y, threadIdx.z);

		uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx

.y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);

		float x=0.0f, y=0.0f, z=0.0f;

		__shared__ float shared[2][2][2],  E[2][2][2], E2[2][2][2];

		

		E[pl.x][pl.y][pl.z]=H[pg.x][pg.y][pg.z];

		E2[pl.x][pl.y][pl.z]=0.0f;

		for(float i=0; i<max && i<E[pl.x][pl.y][pl.z]; i+=n){

						if((threadIdx.z+1)%4==0){

							shared[pl.x][pl.y][pl.z]=i;

							mg=sqrtf(pow(shared[pl.x][pl.y][pl.z-3],2.0f)+pow(shared[pl.x][pl.y][pl.z-2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));

							x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));

						}

						else if((threadIdx.z+1)%3==0){

							shared[pl.x][pl.y][pl.z]=i;

							mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z-2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));

							x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));								

						}

						else if((threadIdx.z+1)%2==0){

							shared[pl.x][pl.y][pl.z]=i;

							mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z+2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));

							x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));	

						}

						else{

							shared[pl.x][pl.y][pl.z]=i;

							mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z+2],2.0f)+pow(shared[pl.x][pl.y][pl.z+3],2.0f));

							x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));	

						}

						shared[pl.x][pl.y][pl.z]=x;

						__syncthreads();

							

							

						

						__syncthreads();

						if((threadIdx.z+1)%4==0){

							z=shared[pl.x][pl.y][pl.z-1];

														

						

						

							

							y=shared[pl.x][pl.y][pl.z-2];

							

						

						

							

							x=shared[pl.x][pl.y][pl.z-3];

							

			

						}

						__syncthreads();

						if((threadIdx.z+1)%4==0 && int(fdividef(x,dx.x))>=0 && int(fdividef(x,dx.x))){

							E2[pl.x+int(fdividef(x,dx.x))][pl.y+int(fdividef(y,dx.y))][pl.z+4*int(fdividef(z,dx.z))]+=shared[pl.x][pl.y][pl.z];

							

						}

						else if((threadIdx.z+1)%4==0){

							H[int(fdividef(x,dx.x))+pg.x][int(fdividef(y,dx.y))+pg.y][(int(fdividef(z,dx.z))*4+pg.z)]+=i;

						}

						

						E2[pl.x][pl.y][pl.z]-=i;

						__syncthreads();

					}

				

			H[pg.x][pg.y][pg.z]=E[pl.x][pl.y][pl.z]+E2[pl.x][pl.y][pl.z];

	}

Finally, main.cpp:

/*  Copyright (c) 2009 Kevin Daley 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

extern void demo();

int main(int argc, char** argv){

	demo();

}

There you go.

vargant51 · March 19, 2009, 5:22pm

bump

Topic		Replies	Views
[SOLVED]memory leak issues CUDA Programming and Performance	12	3748	October 12, 2015
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37885	April 23, 2009
Random segmentation fault Legacy PGI Compilers	12	1412	December 30, 2020
Problem in Emulation CUDA Programming and Performance	1	870	September 4, 2009
Memory Fragmentation: 3d memory not completely freed CUDA Programming and Performance	1	1017	November 29, 2014
Bad performance or bad coding? CUDA Programming and Performance	21	941	October 12, 2021
memory fragmentation? CUDA Programming and Performance	2	4316	April 15, 2009
Segmentation error in cudaMalloc - concurrent kernel execution CUDA Programming and Performance	8	1289	January 30, 2012
Memory Corruption on a Fermi-Class GPU Error only on Fermis, program works on non-Fermis. CUDA Programming and Performance	18	7282	July 22, 2011
Unable to Allocate on Memory and Stack Overflow CUDA Programming and Performance	11	2643	September 6, 2011

memory fragmentation (perhaps?)

Related topics