free(): invalid next size (normal): cudaMalloc problem

Hi everyone. I have spent a large amount of time trying to figure out the problem that I am about to attempt to present, and am looking for help.

I am writing a program that is using a standalone library with its own make utility configured specifically for the library in such a way that I must use it lest I want to spend more time putting things together than writing my code. This means that I cannot use nvcc to compile everything and have to separate things. Half of my code is compiled via the custom make utility, and the other half - the cuda bit, is compiled by nvcc into a static library. The library contains a wrapper class, and the kernel that I want to run.

I compile and link everything fine, however, at runtime I get this libc6 error:

*** glibc detected *** /home/mfastovets/ProjOut/lib/RAVL/linux64/bin/SimpleStereoMatching: free(): invalid next size (normal): 0x00000000017a3050 ***

I was able to trace this problem to the cudaMalloc function inside my static library using ddd. At the end of the cudaMalloc execution I get this error. I have tried everything I can think of to resolve this, but perhaps I am missing something simple. here are the important bits of my code relating to this error:

float *imgR, *imgL, *scoreGrid;

imgR = new float[(rowRange.Max().V()-rowRange.Min().V())*


imgL = new float[(rowRange.Max().V()-rowRange.Min().V())*


scoreGrid = new float[(rowRange.Max().V()-rowRange.Min().V())*(lColRange.Max().V()-lColRange.Min().V())*(rColRange.Max().V()-rColRange.Min().V())];

int index = 0;

   for(Array2dIterC<ByteT> iml(pair.LeftRectifiedImage()); iml; iml++)


	  imgL[index] = iml.Data();




   index = 0;

   for(Array2dIterC<ByteT> imr(pair.RightRectifiedImage()); imr; imr++)


	  imgR[index] = imr.Data();



for(int idx=0; idx < (rowRange.Max().V()-rowRange.Min().V())*(lColRange.Max().V()-lColRange.Min().V())*(rColRange.Max().V()-rColRange.Min().V()); idx++)


	  scoreGrid[idx] = 0;


  //scoreGrid, imgOne, imgTwo, need filling with the elements for their respecive objects

MatchKernel kern;

kern.launch_kernel(scoreGrid, imgL, imgR, (int)(rowRange.Max().V()-rowRange.Min().V()), (int)(lColRange.Max().V()-lColRange.Min().V()), (int)(rColRange.Max().V()-rColRange.Min().V()));

And the wrapper function:

// Wrapper for the __global__ call that sets up the kernel call

void MatchKernel::launch_kernel(float *scores, float * imgOne, float * imgTwo, int rows, int colOne, int colTwo)


	float * scores_d;

	float * one_d;

	float * two_d;

	cudaMalloc((void**)&scores_d, sizeof(float)*rows*colOne*colTwo);

	cudaMalloc((void**)&one_d, sizeof(float)*rows*colOne);

	cudaMalloc((void**)&two_d, sizeof(float)*rows*colTwo);

	cudaMemcpy(one_d, imgOne, sizeof(float)*rows*colOne, cudaMemcpyHostToDevice);

	cudaMemcpy(two_d, imgTwo, sizeof(float)*rows*colTwo, cudaMemcpyHostToDevice);

	cudaMemcpy(scores_d, scores, sizeof(float)*rows*colOne*colTwo, cudaMemcpyHostToDevice);

	dim3 block(colTwo, 0, 0);

	dim3 grid(colOne, rows, 0);


	computeScores<<< grid, block>>>(scores_d, one_d, two_d); //kernel call

	cudaMemcpy(scores, scores_d,sizeof(float)*rows*colOne*colTwo, cudaMemcpyDeviceToHost);


Any help you can provide would be MUCH appreciated. Cheers!

I have discovered that this problem seems to only happen when the function is called from a statically linked library. If I compile with a simple test main, things go fine. I still have no idea what could possibly be wrong.

Similar problem here:

I get: free(): invalid next size (fast)

This happens when all the program has been already executed and memory is going to be released.

any thoughts?

Is the application linked against CUDART as well?

Yes, it is. I used cudart because I need to have access to cuMemgetInfo.

Er, no, that’s not my question. Your library uses CUDART–does the application use CUDART as well? Linking against libcuda should be fine.

I don’t know if this answers your question (sorry): My application uses both runtime API calls and driver API calls (actually that cuMemgetInfo).

Note: “The free(): invalid next size (fast)” error does not appear always. For small sizes of my grid of blocks works fine. It appears when I go beyond a number in the Y dimension of the grid.

cudaMalloc is part of the cudart library, which I link against. So that can’t be the problem, I don’t think.