Heap Corruption help needed!

Hello there,

As I was coding some MonteCarlo simulation I’m facing some troubles with heap corruption.

This is the message I get:

Unhandled exception at 0x000000018000ec51 in intCUDA.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.

Aparently I’m writing to memory I don’t own I suppose.

However this is the line that sends the error:

CUDA_SAFE_CALL(cudaMemcpy(suma, sumad, sizeof(float), cudaMemcpyDeviceToHost));

It is a write back from the device to the host.

Since this is not vital to my program I commented it out to check if this was indeed the problem!

And now I get the same error but in the next line naturally meaning that the problem was not the previous one, I suppose.

This is the line that I get the error on now:

CUDA_SAFE_CALL(cudaFree(vd));

And this is the error message it throws:

Unhandled exception at 0x000000018002a053 in intCUDA.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.

When I comment all those lines out, I get this error regarding the heap corruption:

[b][i]Windows has triggered a breakpoint in intCUDA.exe.

This may be due to a corruption of the heap, and indicates a bug in intCUDA.exe or any of the DLLs it has loaded.

The output window may have more diagnostic information[/i]

[/b]

Can somebody help me out here please? Anyone who knows what’s happening?

Btw, I’m using CUDA 2.0 64bits with Vista X64 on a GF8600M GT, however right now I’m running my code on the emulator, and I’m getting different unexpected behavior everytime I run the code or recompile it.

This is the code that’s giving me trouble. The other parts of the code seem to work well:

This file is intCUDA.h

[codebox]#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

#define BLOCKS 5

#define THREADS 10

#define SIZE BLOCKS*THREADS

#define DIM 5

#define random //Método a ejecutar

//#define integrar

#define debug

#define dimensional[/codebox]

And this is intCUDA.cu

[codebox]

#include “intCUDA.h”

#include “drand48.h”

#include “random.h”

#include <Math.h>

device void dDimensional(float * v, float * rd, int index)

{

float producto = 1;

int i = index*5;

int dim = i+DIM; 

for(; i < dim; i++)

{

	producto = producto * calculo(v[i]);

	printf("index = %i, i = %i, producto = %f, v[i] = %f \n",index, i, producto, v[i]);

}

rd[index] = producto;

}

global void montecarlo(float * vd, float * sumad, float * rd)

{

float div = SIZE;

__shared__ float sumas;

int index = blockIdx.x*blockDim.x+threadIdx.x;

#ifndef random

	if(!index)

		//printf("Llenando vector con midpoints...\n\n");

	vd[index] = llenarVectorMidPoint(index); //Escribiendo en memoria global

#endif 

#ifdef integrar

	if(!index)

		//printf("Integrando...\n\n");

	vd[index] = calculo(vd[index])*(__fdividef(1.0,div)); //Escribiendo en memoria global

#else

	#ifndef dimensional

		if(!index)

			//printf("Calculando Buffon...\n\n");

		vd[index] = calculoBuffon(vd[index]);

	#else

		if(!index)

			printf("Calculando dDimensional...\n\n");

		printf("index = %i \n",index);

		dDimensional(vd, rd, index); //Naive implementation

	#endif

#endif

#ifndef dimensional

	sumas += vd[index];

#else

	sumas += rd[index];

#endif

//printf("b(%i) t(%i) vd[i]:%f  shared: %.6f  \n",blockIdx.x,threadIdx.x,vd[index], sumas);

__syncthreads();

if(threadIdx.x == THREADS-1){

	*sumad = sumas;

	//printf("b(%i) t(%i)  shared: %f   sumad: %f\n",blockIdx.x,threadIdx.x,sumas,*sumad);

}

}

void montecarlo(float * res, float * r, float * suma, const unsigned long long int size)

{

//puntero al input en el device

float * vd = 0;

float * rd = 0;

//espacio en el device para el input

CUDA_SAFE_CALL(cudaMalloc((void**)&vd, size*sizeof(float))); 



//Copiar el vector al vector device

#ifdef random

	CUDA_SAFE_CALL(cudaMemcpy(vd, res, sizeof(float)*size, cudaMemcpyHostToDevice)); 

	#ifdef dimensional

		CUDA_SAFE_CALL(cudaMalloc((void**)&rd, sizeof®)); 

	#endif

#endif



// puntero a suma en el device

float * sumad; 

CUDA_SAFE_CALL(cudaMalloc((void**)&sumad, sizeof(float)));

//Lanzar el Kernel del cálculo

montecarlo<<<BLOCKS, THREADS>>>(vd, sumad, rd); 

//Copiar el vector al vector device

#ifdef dimensional

	CUDA_SAFE_CALL(cudaMemcpy(r, rd, size*sizeof(float), cudaMemcpyDeviceToHost)); 

#else

	CUDA_SAFE_CALL(cudaMemcpy(r, vd, size*sizeof(float), cudaMemcpyDeviceToHost));

#endif

//Copiar el vector al vector device

//CUDA_SAFE_CALL(cudaMemcpy(suma, sumad, sizeof(float), cudaMemcpyDeviceToHost)); 

//Liberar memoria

//CUDA_SAFE_CALL(cudaFree(vd));

// CUDA_SAFE_CALL(cudaFree(rd));

// CUDA_SAFE_CALL(cudaFree(sumad));

}

int main(int argc, char * argv)

{

CUT_DEVICE_INIT(argc, argv);

float * res = 0;

float * r = (float *)malloc(SIZE*sizeof(float)); //puntero al resultado de MC

float * suma = (float *)malloc(sizeof(float)); //puntero a suma

unsigned int hTimer;

unsigned long long int size;

#ifdef random

	#ifdef dimensional

		size = DIM*SIZE;

		printf("Llenando vector dimensional...\n\n");

	#else 

		size = SIZE;

		printf("Llenando vector con números pseudo aleatorios...\n\n");

	#endif

	res = (float *)malloc(size*sizeof(float)); //puntero al resultado de RNG

	startTimer(&hTimer);

	rng(res, size);

	stopTimer(&hTimer);

	imprimir(res,size);

#endif

startTimer(&hTimer);

montecarlo(res, r, suma, size);

stopTimer(&hTimer);

imprimir(r,SIZE, *suma);

free®;

r = 0;

free(res);

res = 0;



free(suma);

suma = 0;

CUT_EXIT(argc, argv);

return 0;

}

[/codebox]

This is the output:

[codebox]‘intCUDA.exe’: Loaded ‘C:\Program Files (x86)\NVIDIA Corporation\NVIDIA CUDA SDK\projects\intCUDA\x64\EmuDebug\intCUDA.exe’, Symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\ntdll.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\kernel32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\CUDA\bin\cudart.dll’, Binary was not built with debug information.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\cutil64D.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\nvcuda.dll’, Binary was not built with debug information.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\user32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\gdi32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\advapi32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\rpcrt4.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\opengl32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\msvcrt.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\glu32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\ddraw.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\dciman32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\setupapi.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\oleaut32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\ole32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\dwmapi.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\imm32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\msctf.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\lpk.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\usp10.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\nvapi64.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\shlwapi.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded ‘C:\Windows\System32\shell32.dll’, No symbols loaded.

‘intCUDA.exe’: Loaded 'C:\Windows\winsxs\amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.6001.18000_none_152e7382f3bd50

c6\comctl32.dll’, No symbols loaded.

The thread ‘Win64 Thread’ (0xfdc) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1614) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1738) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1444) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1334) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x13e0) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x187c) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0xbe0) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1118) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1594) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1958) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x8dc) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1254) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x17e0) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0xc08) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1480) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1590) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1290) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x294) has exited with code 0 (0x0).

The thread ‘Win64 Thread’ (0x1a6c) has exited with code 0 (0x0).

HEAP[intCUDA.exe]: Heap block at 0000000000162170 modified at 0000000000162248 past requested size of c8

Windows has triggered a breakpoint in intCUDA.exe.

This may be due to a corruption of the heap, and indicates a bug in intCUDA.exe or any of the DLLs it has loaded.

The output window may have more diagnostic information

[/codebox]

Problem solved guys!

I was confused with the memory transfers from device-host. When I’m doing dimensional I have two different sizes for the input array and for the output array, but it’s the same size when I’m doing something else but dimensional.

So basically SIZE was 50,

(Which by the way it strikes performance badly since I’m not following multiple of warps guidelines, but I was doing this for testing purposes only and it’s way easier to read 50 printed on screen results than more than that, afterwards I realized I could have 1 block of 32 and that’d follow guidelines, but I wanted to be sure I could beat the 8 blocks at a time hardware constraint so I could have more blocks to process and still be able to read results on screen)

and this is the size for the output.

For the input I had to have DIM times this amount since I’m using DIM times random numbers for calculating one number of the result.

pointers: res and vd are input in host and device respectively for my MonteCarlo function, their size is DIM*SIZE

pointers: r and rd are both output. Their size must be only SIZE.

THis is totally incorrect, since r is the pointer and its size will be the size of the pointer which is float here.
CUDA_SAFE_CALL(cudaMalloc((void**)&rd, sizeof®));

and then here I was trying to transfer back from the device size*sizeof(float) amount of memory which obviously corrupts the heap since rd is the size of a float, but also I have to specifically use SIZE.

CUDA_SAFE_CALL(cudaMemcpy(r, rd, size*sizeof(float), cudaMemcpyDeviceToHost));

And also:

CUDA_SAFE_CALL(cudaFree(rd));

I was freeing rd memory all the time, regardless that for allocating rd I had a condition. So, potentially it’d crash if I didn’t get hte right combination of code through the preprocessor when using #ifdef.

So basically the code was a bit messy with this and sizes were messed up, so that’s why I kept on corrupting the heap memory wanting to transfer.

That’s it

Hope it helps others

xthO