memory fragmentation (perhaps?)

Hi,

edit: my /previous/ issue, not my current one was memory corruption. I actually think I got rid of that…that was a different error message entirely. Something along the lines of glibc detected such-and-such* !prev, and then a map of allocated addresses and a stacktrace. Sorry about the confusing title.

I am researching quantum chemistry and plasma physics using a robust, perturbative, renormalized QFT formulated using MaxEnt, Spacetime Algebra/Bivector General Relativity, and active stochastic processes (see e.g. doi:10.1016/j.physletb.2004.03.036, doi:10.1016/j.physletb.2007.10.060, also check out publications by D. Hestenes and E.T. Jaynes). The result will be open-sourced as soon as I can debug it, so feel free to ask for code snippets, etc.

Right now, I’ve got a nasty problem. Using the runtime API, I need to allocate 2 float arrays, each 1286464, asynchronously on a 512mb version of the 8600GT (yes, I know, I’m cheap). Then I need to call my initialization kernel on an input file stored as a uchar*, which has only about 26 registers—I launch 2 streams, each 2 blocks of size (2,2,2). These sizes are subject to as much change as they need given the hardware. I do, however, want to test support for asynchronous transfers and launches if I can, because I made an object-oriented wrapper and am going to write template to abstract the initialization kernel (right now it’s just a dummy that acts as “cudaMemset”). Ideally, more powerful hardware would allow applications to perform multiple simulations without interference.

I don’t know exactly what needs to be fixed. Here is what seems to be the problem code, lines 36-76:

__host__ cystGPUMsim::cystGPUMsim(struct simParams *params, const char* buff){

			dp=(cudaPitchedPtr*) malloc(sizeof(cudaPitchedPtr)*(nstreams));

			width=params->width;

			height=params->height;

			depth=params->depth;

			dT=params->dT;

			dx=params->dx;

			n=params->n;

			grid=params->grid;	

			block=params->block;

			max=params->max;

			nstreams=params->nstreams;

	

			

			ca_extent.width  = width/nstreams;

			ca_extent.height = height;

			ca_extent.depth  = depth;

			

			unsigned char* dbuff;

			cudaMalloc((void**) &dbuff, sizeof(buff));

			cudaMemcpy(dbuff, buff, sizeof(buff), cudaMemcpyHostToDevice);

			stream=(cudaStream_t*) malloc(sizeof(cudaStream_t)*nstreams);

			for (int s = 0; s < nstreams; ++s){

				dp[s]=make_cudaPitchedPtr( NULL, width/nstreams*sizeof(float), height, depth*4 );

				cudaMalloc3D(&(dp[s]), ca_extent);

			}

				for(int i=0; i<int(width/nstreams); i+=grid.x*block.x){

					for(int j=0; j<int(height); i+=grid.y*block.y){

						for(int k=0; k<int(depth*4); i+=grid.z*block.z){

							for (int s = 0; s < nstreams; ++s){

							initialize<<<grid, block, 0, stream[s]>>>((float***) dp[s].ptr, dbuff, i,j,k);

							}

							cudaThreadSynchronize();

							checkCUDAError(" ");

						

					}

				}

			}

			cudaFree(dbuff);

	}

Compile-time yields no relevant warnings. What I see on runtime is the following error:

and the gdb stack says:

I would use deviceemu to debug. But right now, if I compile it with device emulation, I receive the following warnings:

Then, when I run it, this odd segmentation fault occurs:

I think this has something to do with my host system’s being single-threaded; I am pretty sure the CUDA Programming Guide said you need at least enough threads on the host to run the device code.

I am running Ubuntu 8.04 32-bit Desktop Ed. with the Cuda 2.1 Beta Toolkit and SDK (my installation doesn’t seem to cause any problems), drivers 180.06 beta, and all of the most recent package updates (=> up-to-date GCC, etc.). I have a Pentium D 805, 1GB of RAM (DDR2, don’t know what speed), and a BFG Tech GeForce 8600 GT OC w/ 512 GB GDDR3 (quite a mouthful). Only nvcc settings are -G, -g, and -arch SM_11, which is the correct target (I checked. I had an atomicAdd that I got rid of, if you must know).

Any thoughts? Appreciate your comments.

Can you post full source?

sure, sure. here goes:

cyst.h:

/*   Copyright (c) 2009 Kevin Daley  

	 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

#include <cuda_gl_interop.h>

#include <cstdio>

#include <cstdlib>

#include <cmath>

 __global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k);

extern int resx, resy;

namespace cyst{

struct simParams{

	float dT;

	float n;

	float4 dx; 

	dim3 grid; 

	dim3 block;

	float max;

	unsigned int width, height, depth;

	int nstreams;

};

class cystGPUMsim{

protected:

	cudaPitchedPtr *dp; 

	float dT;

	float4 dx;

	float n; 

	dim3 grid; 

	dim3 block;

	cudaStream_t *stream;

	int nstreams;

	float max;

	unsigned int width, height, depth;

	cudaExtent ca_extent;

public:

	__host__ explicit cystGPUMsim(struct simParams *params, const char* buff);

	__host__ virtual ~cystGPUMsim();

	__host__ void callGPUsim(int frames);

	__host__ float* retrieveGPUsim();

};

}

cyst.cu

/*  Copyright (c) 2009 Kevin Daley 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

#include "cyst.cu"

__global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k){

	uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx

.y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);

	hd[pg.x][pg.y][pg.z]=3;

}

void demo(){

cudaSetDevice(0);

struct simParams *parms= new simParams;

parms->dT=10;

parms->n=.5;

parms->dx=make_float4(0.1,0.1,0.1,0.1);

dim3 grid (2,1);

dim3 block (2,2,2);

parms->width=128;

parms->height=128;

parms->depth=64;

parms->max=50.0f;

parms->nstreams=2;

cudaSetDevice(0);

	parms->grid=grid;

	parms->block=block;

	cyst::cystGPUMsim* sim=new cyst::cystGPUMsim(parms, "k");		

	sim->callGPUsim(1);

	

	float* H=sim->retrieveGPUsim();

	delete sim;

	FILE* f=fopen("out.dat","rw+");

	for(int i=0; i<256; i++){

		for(int j=0; j<256; j++){

			for(int k=0; k<256; k++){	

				fprintf(f, "%f %f %f %f %f %f %f %f\n", 

i*parms->dx.x, j*parms->dx.y,k*parms->dx.z, k*parms->dx.w, 

H[(i+j*256+k*256*1024)+3], H[(i+j*256+k*256*1024)], 

H[(i+j*256+k*256*1024)+1], 

H[(i+j*256+k*256*1024)+2]);

	}}}

	fclose(f);

	exit(0);

}

and demo.cu

/*  Copyright (c) 2009 Kevin Daley 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

#include "cyst.cu"

__global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k){

	uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx

.y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);

	hd[pg.x][pg.y][pg.z]=3;

}

void demo(){

cudaSetDevice(0);

struct simParams *parms= new simParams;

parms->dT=10;

parms->n=.5;

parms->dx=make_float4(0.1,0.1,0.1,0.1);

dim3 grid (2,1);

dim3 block (2,2,2);

parms->width=128;

parms->height=128;

parms->depth=64;

parms->max=50.0f;

parms->nstreams=2;

cudaSetDevice(0);

	parms->grid=grid;

	parms->block=block;

	cyst::cystGPUMsim* sim=new cyst::cystGPUMsim(parms, "k");		

	sim->callGPUsim(1);

	

	float* H=sim->retrieveGPUsim();

	delete sim;

	FILE* f=fopen("out.dat","rw+");

	for(int i=0; i<256; i++){

		for(int j=0; j<256; j++){

			for(int k=0; k<256; k++){	

				fprintf(f, "%f %f %f %f %f %f %f %f\n", 

i*parms->dx.x, j*parms->dx.y,k*parms->dx.z, k*parms->dx.w, 

H[(i+j*256+k*256*1024)+3], H[(i+j*256+k*256*1024)], 

H[(i+j*256+k*256*1024)+1], 

H[(i+j*256+k*256*1024)+2]);

	}}}

	fclose(f);

	exit(0);

}

and kernel.cu:

/*  Copyright (c) 2009 Kevin Daley 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

//planck units

#define const_pi 3.1415927f

#define kC 1.0f

#define kB 1.0f

#define lightspeed  1.0f

#define finestructure 1.0f

#include <sm_11_atomic_functions.h>

__global__ void kernelGPU(float ***H, float dT, float4 dx, float n, float max, int i, int j, int k){

		float mg=0;

		uint3 pl=make_uint3(threadIdx.x, threadIdx.y, threadIdx.z);

		uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx

.y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);

		float x=0.0f, y=0.0f, z=0.0f;

		__shared__ float shared[2][2][2],  E[2][2][2], E2[2][2][2];

		

		E[pl.x][pl.y][pl.z]=H[pg.x][pg.y][pg.z];

		E2[pl.x][pl.y][pl.z]=0.0f;

		for(float i=0; i<max && i<E[pl.x][pl.y][pl.z]; i+=n){

						if((threadIdx.z+1)%4==0){

							shared[pl.x][pl.y][pl.z]=i;

							mg=sqrtf(pow(shared[pl.x][pl.y][pl.z-3],2.0f)+pow(shared[pl.x][pl.y][pl.z-2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));

							x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));

						}

						else if((threadIdx.z+1)%3==0){

							shared[pl.x][pl.y][pl.z]=i;

							mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z-2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));

							x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));								

						}

						else if((threadIdx.z+1)%2==0){

							shared[pl.x][pl.y][pl.z]=i;

							mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z+2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));

							x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));	

						}

						else{

							shared[pl.x][pl.y][pl.z]=i;

							mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z+2],2.0f)+pow(shared[pl.x][pl.y][pl.z+3],2.0f));

							x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));	

						}

						shared[pl.x][pl.y][pl.z]=x;

						__syncthreads();

							

							

						

						__syncthreads();

						if((threadIdx.z+1)%4==0){

							z=shared[pl.x][pl.y][pl.z-1];

														

						

						

							

							y=shared[pl.x][pl.y][pl.z-2];

							

						

						

							

							x=shared[pl.x][pl.y][pl.z-3];

							

			

						}

						__syncthreads();

						if((threadIdx.z+1)%4==0 && int(fdividef(x,dx.x))>=0 && int(fdividef(x,dx.x))){

							E2[pl.x+int(fdividef(x,dx.x))][pl.y+int(fdividef(y,dx.y))][pl.z+4*int(fdividef(z,dx.z))]+=shared[pl.x][pl.y][pl.z];

							

						}

						else if((threadIdx.z+1)%4==0){

							H[int(fdividef(x,dx.x))+pg.x][int(fdividef(y,dx.y))+pg.y][(int(fdividef(z,dx.z))*4+pg.z)]+=i;

						}

						

						E2[pl.x][pl.y][pl.z]-=i;

						__syncthreads();

					}

				

			H[pg.x][pg.y][pg.z]=E[pl.x][pl.y][pl.z]+E2[pl.x][pl.y][pl.z];

	}

Finally, main.cpp:

/*  Copyright (c) 2009 Kevin Daley 

	The following code is part of cYst.

	cYst is free software: you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation, either version 3 of the License, or

	(at your option) any later version.

	cYst is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License

	along with cYst.  If not, see <http://www.gnu.org/licenses/>.

*/

extern void demo();

int main(int argc, char** argv){

	demo();

}

There you go.

bump