strange error in summation memory problems

Hi all,

I’ve been working on a kernel for a few days already and I am always getting the same compiler error message which I don’t really understand External Image It is:

First-chance exception at 0x7c812a5b in HoloPipelineApp.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012dd88…

First-chance exception at 0x7c812a5b in HoloPipelineApp.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012dd90…

First-chance exception at 0x7c812a5b in HoloPipelineApp.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012dd88…

When I compile my code, I get an Advisory warning, that some memory is not clearly located:

… line 228: Advisory: Cannot tell what pointer points to, assuming global memory space

The kernel works on a texture and each thread should actually calculate one pixel of the result data. I would be very happy for any suggestions as how to find and/or fix the error :) The important lines have been commented in color. I am trying to use as much global memory (device) as possible to increase the performance. d_resultData contains about 250’000 float2 elements. By the way, doesn’t CUDA have operators for the vector types, e.g. to add a float3 in a straightforward manner:

float3 f3A = make_float3(0.0f,0.0f,0.0f);

float3 f3B = make_float3(0.2f,0.4f,0.8f);

float3 f3Result = f3A + f3B;

futhermore, operators like length(TYPE) would be nice External Image

Anyway…here is my code, which is called from my framework:

#ifndef HOLO_CUDA_EYE_EVALUATOR_H

#define HOLO_CUDA_EYE_EVALUATOR_H

// define CUDA block size

#define BLOCK_SIZE 16

//! includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

//! includes, CUDA

#include <cuda.h>

#include <cutil.h>

// forward declarations

extern “C” global void evaluatePointSources_kernel(float2* d_resultData, int iNumPointSources, int iWidth, int iHeight);

extern “C” device float3 interpolatePosition(float x, float y);

extern “C” device float3 interpolateDirection(float3 f3PositionSource, float x, float y);

extern “C” device float f3Length(float3 vec);

texture<float2, 2, cudaReadModeElementType> sourceTexture;

// these for corners are used for billinear interpolation

// in the simulated pixel shader

device float3 cornerLL; // lower left

device float3 cornerLR; // lower right

device float3 cornerUL; // upper left

device float3 cornerUR; // upper right

// these four directions are used for billinear interpolation

// in the simulated pixel shader

device float3 directionLL;

device float3 directionLR;

device float3 directionUL;

device float3 directionUR;

// the point positions and (amplitude,phase,k) data in global memory

device float3* d_pointPositions;

device float3* d_f3APhaseKs;

// the host part

extern “C” host bool evaluatePointSources(int iNumPointSources,

										  int iWidth,

										  int iHeight,

										  float* targetData, 

										  float* sourceData,

										  float3* pointPositions,

										  float3* f3APhaseKs,

										  float3 aPosf3[4],

										  float3 aDirf3[4])

{

// initialize memory on the device for the source texture

// the CUDA channel description for the source array

cudaChannelFormatDesc sourceChannelDesc =  cudaCreateChannelDesc(32,32,0,0,cudaChannelFormatKindFloat);

// the CUDA array for the source texture

cudaArray* sourceArray;

CUDA_SAFE_CALL(cudaMallocArray(&sourceArray, &sourceChannelDesc, iWidth, iHeight));

CUT_CHECK_ERROR("cudaMallocArray failed");

CUDA_SAFE_CALL(cudaMemcpyToArray(sourceArray, 0, 0, sourceData, iWidth*iHeight*2*sizeof(float), cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToArray");

// set texture parameters for source texture

sourceTexture.addressMode[0] = cudaAddressModeClamp;

sourceTexture.addressMode[1] = cudaAddressModeClamp;

sourceTexture.filterMode = cudaFilterModeLinear;

sourceTexture.normalized = false;

// bind the source array to the source texture

CUDA_SAFE_CALL(cudaBindTextureToArray(sourceTexture, sourceArray, sourceChannelDesc));

CUT_CHECK_ERROR("cudaBindTextureToArray failed");

// copy the point locations and (amplitudue,phase,k) triples to the device

// obtain the correct pointer for the point positions

float3* devPtr = NULL;

CUDA_SAFE_CALL(cudaGetSymbolAddress((void**)&devPtr, d_pointPositions));

CUT_CHECK_ERROR("cudaGetSymbolAddress failed");

CUDA_SAFE_CALL(cudaMalloc((void**)&devPtr, iNumPointSources*sizeof(float3)));

CUT_CHECK_ERROR("cudaMalloc failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(d_pointPositions, pointPositions, iNumPointSources, 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpy failed");



// obtain the correct pointer for the (Amplitude,Phase,K) triples

devPtr = NULL;

CUDA_SAFE_CALL(cudaGetSymbolAddress((void**)&devPtr, d_f3APhaseKs));

CUT_CHECK_ERROR("cudaGetSymbolAddress failed");

CUDA_SAFE_CALL(cudaMalloc((void**)&devPtr, iNumPointSources*sizeof(float3)));

CUT_CHECK_ERROR("cudaMalloc failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(d_f3APhaseKs, f3APhaseKs, iNumPointSources, 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpy failed");



// allocate memory for the result data

float2* d_resultData = NULL;

CUDA_SAFE_CALL(cudaMalloc((void**)&d_resultData, iNumPointSources*sizeof(float2)));

CUT_CHECK_ERROR("cudaMalloc failed");

// copy the corner positions and directions to the (global) device memory

	

CUDA_SAFE_CALL(cudaMemcpyToSymbol(cornerLL,&aPosf3[0],sizeof(float3),0,cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(cornerLR,&aPosf3[1],sizeof(float3),0,cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(cornerUL,&aPosf3[2],sizeof(float3),0,cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(cornerUR,&aPosf3[3],sizeof(float3),0,cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(directionLL,&aDirf3[0],sizeof(float3),0,cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(directionLR,&aDirf3[1],sizeof(float3),0,cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(directionUL,&aDirf3[2],sizeof(float3),0,cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

CUDA_SAFE_CALL(cudaMemcpyToSymbol(directionUR,&aDirf3[3],sizeof(float3),0,cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");



// calculate the grid and block size

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(iWidth / dimBlock.x, iHeight / dimBlock.y);

// call the CUDA kernel with the appropriate grid and block size

evaluatePointSources_kernel<<<dimGrid, dimBlock>>>(d_resultData, iNumPointSources, iWidth, iHeight);

//copy result data back

CUDA_SAFE_CALL(cudaMemcpy(targetData, d_resultData, iNumPointSources*sizeof(float2), cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpy failed");



// free memory on the device

CUDA_SAFE_CALL(cudaFreeArray(sourceArray));

CUT_CHECK_ERROR("cudaFreeArray failed");



CUDA_SAFE_CALL(cudaFree(d_pointPositions));

CUT_CHECK_ERROR("cudaFree failed");

CUDA_SAFE_CALL(cudaFree(d_f3APhaseKs));

CUT_CHECK_ERROR("cudaFree failed");

CUDA_SAFE_CALL(cudaFree(d_resultData));

CUT_CHECK_ERROR("cudaFree failed");

// if all went well return TRUE

return true;

}

// the device part

extern “C” global void evaluatePointSources_kernel(float2* d_resultData, int iNumPointSources, int iWidth, int iHeight)

{

// CUDA block index

int bX = blockIdx.x;

int bY = blockIdx.y;

// CUDA thread index

int tX = threadIdx.x;

int tY = threadIdx.y;

// calculate total coordinate of current thread which acutally corresponds to a single pixel

// being evaluated in the current thread

int iX = bX*blockDim.x + tX;

int iY = bY*blockDim.y + tY;

float2 resTexel = tex2D(sourceTexture, iX, iY);



// for all point sources...

for(int i=0; i<iNumPointSources; i++)

{

	

	// ...evaluate the current texel

	float3 positionSource = d_pointPositions[i];

	

	float3 f3APhaseK = d_f3APhaseKs[i]; // this memory location seems to be unknown, but strangely, only if the line commented by purple color is uncommented...

	

	

	//float3 interpVPosition = interpolatePosition(iX/iWidth, iY/iHeight);

	float3 interpVDirection = interpolateDirection(positionSource, iX/iWidth, iY/iHeight);

	

	

	float fR = f3Length(interpVDirection);

	

	float fSin, fCos;

	

	__sincosf(fR*f3APhaseK.z + f3APhaseK.y, &fSin, &fCos);

	

	

	float fAmp = f3APhaseK.x / fR;

	

	float2 curTexel = make_float2(fAmp*fCos,fAmp*fSin);

	

	resTexel = make_float2(resTexel.x+curTexel.x, resTexel.y+curTexel.y); // This line makes the program hang, without it, it runs fine, but of course then the whole loop maybe even doesn't get evaluated, since no result is written or used, so the compiler simply strips this code then I guess.		

}

// write the resulting color value to the summation memory

//__syncthreads(); // no change if uncommented

d_resultData[iY*iWidth+iX] = resTexel;

}

/*

  • This function replaces the fragment interpolation which is usually done

  • in the rendering pipeline between vertex and fragment processing.

  • It is a bilinear interpolation of the vertex position.

*/

extern “C” device float3 interpolatePosition(float x, float y)

{

float3 position = make_float3(cornerLL.x*(1-x)*(1-y) + cornerLR.x*x*(1-y) + cornerUL.x*(1-x)*y + cornerUR.x*x*y,

							  cornerLL.y*(1-x)*(1-y) + cornerLR.y*x*(1-y) + cornerUL.y*(1-x)*y + cornerUR.y*x*y,

							  cornerLL.z*(1-x)*(1-y) + cornerLR.z*x*(1-y) + cornerUL.z*(1-x)*y + cornerUR.z*x*y );

return position;

}

/*

  • This function replaces the fragment interpolation which is usually done

  • in the rendering pipeline between vertex and fragment processing.

  • It is a bilinear interpolation of the vertex direction relative to the point of evaluation \c f3PositionSource

*/

extern “C” device float3 interpolateDirection(float3 f3PositionSource, float x, float y)

{

float3 newDirectionLL = make_float3(directionLL.x - f3PositionSource.x,

									directionLL.y - f3PositionSource.y,

									directionLL.z - f3PositionSource.z);

float3 newDirectionLR = make_float3(directionLR.x - f3PositionSource.x,

									directionLR.y - f3PositionSource.y,

									directionLR.z - f3PositionSource.z);

float3 newDirectionUL = make_float3(directionUL.x - f3PositionSource.x,

									directionUL.y - f3PositionSource.y,

									directionUL.z - f3PositionSource.z);

float3 newDirectionUR = make_float3(directionUR.x - f3PositionSource.x,

									directionUR.y - f3PositionSource.y,

									directionUR.z - f3PositionSource.z);

float3 direction = make_float3(directionLL.x*(1-x)*(1-y) + directionLR.x*x*(1-y) + directionUL.x*(1-x)*y + directionUR.x*x*y,

							   directionLL.y*(1-x)*(1-y) + directionLR.y*x*(1-y) + directionUL.y*(1-x)*y + directionUR.y*x*y,

							   directionLL.z*(1-x)*(1-y) + directionLR.z*x*(1-y) + directionUL.z*(1-x)*y + directionUR.z*x*y );

return direction;

}

/*

  • This small helper function computes the length of a float3 vector.

*/

extern “C” device float f3Length(float3 vec)

{

float length = vec.x*vec.x + vec.y*vec.y + vec.z*vec.z;

return sqrtf(length);

}

#endif

I have seen this error you have mentioned come up many times. But it does not hang my system.

If I remember correctly, some other post mentions that this error comes up when the device sends back one of the error codes. Since VS2005 does not understand this error code it raises the above exception.

You might want to use the error checking macros used in the samples.

Thanks for the advice; I found that the memory releases at the end of the host function were wrong; cudaGetSymbolAddress must be used first to get the appropriate pointer to be released.

I use CUDA_SAFE_CALL and CUT_CHECK_ERROR from cutil.h, what else would you suggest to use?

So now my kernel runs without hanging or crashes, but the result I get is not the right one :(

Did anybody ever try to simulate the vertex and fragment shader functionality of the programmable rendering pipeline?
E.g for a simple texture to be rendered only four vertices are needed for the corners to set up the two triangles on which the texture is rendered. Suppose now, one would like to have the interpolated vertex position from the fragment shader (which is usually automatically done when the vertex shader passes the vertex positions to the fragment shader in the programmable rendering pipeline) in the CUDA kernel. What I did is that I computed the four corners on the CPU the same way as my vertex shader does. Then I pass these four values to the CUDA kernel and I wrote a (device) billinear interpolation routine to mimick this behaviour, but it seems to be wrong…am I missing something from the programmable rendering pipeline?

Here is my billinear interpolation routine; the four corners are set in the host portion of the code by cudaMemcpyToSymbol (see first post):

// these for corners are used for billinear interpolation
// in the simulated pixel shader
device float3 cornerLL; // lower left
device float3 cornerLR; // lower right
device float3 cornerUL; // upper left
device float3 cornerUR; // upper right

/*

  • This function replaces the fragment interpolation which is usually done

  • in the rendering pipeline between vertex and fragment processing.

  • It is a bilinear interpolation of the vertex position. The coordinates must

  • be normalized, e.g. in [0,1]
    /
    extern “C” device float3 interpolatePosition(float x, float y)
    {
    float3 position = make_float3(cornerLL.x
    (1-x)(1-y) + cornerLR.xx*(1-y) + cornerUL.x*(1-x)y + cornerUR.xxy,
    cornerLL.y
    (1-x)(1-y) + cornerLR.yx*(1-y) + cornerUL.y*(1-x)y + cornerUR.yxy,
    cornerLL.z
    (1-x)(1-y) + cornerLR.zx*(1-y) + cornerUL.z*(1-x)y + cornerUR.zx*y );

    return position;

}

When you do a “cudaMalloc” you are passing “&devptr” which is an address in the CPU space and NOT the GPU space. You need to copy yhis pointer to the GPU space which you are NOT doing.

So, first get the symbol address in a variable called “symbol”.

Then do a cudaMalloc and obtain the “devptr”.

Now, copy the “devptr” value to that symbol using “cudaMemcpy”.

Just do "cudaMemcpy(symbol, &devptr, sizeof(void *), cudaMemcpyHostToDevice);

This will make sure that “d_pointpositions” has the right pointer in it when the GPU kernel runs.

Also, I think your usage of “cudaMemcpyToSymbol” is also wrong. When you copy an array to a symbol, I would guess that the symbold has to be an “Array” and not a pointer. So, avoid that.

Instead, say, “cudaMemcpy(devptr, pointPositions, whatever-size, cudaMemcpyHostToDevice)”;

This will make sure that whatever pointer that “d_pointpositions” points to has the right data.

I think you have similar code fragments throughout your code. You got to fix all of them.

Hope this helps

Thanks a lot for the advice; this makes of course sense :) Now I’ve changed all sections accordingly. In the meantime, I also tried a version where data gets passed to the kernel (e.g. w/o device). But in both versions, the output from my kernel seems to be none. Even if I explicitly set a result value (colored) in the kernel. I know this for sure because first I did not cudaMemset my result array and then the output was random “noise”, but with cudaMemset it is black, which corresponds perfectly to the zero value I set…Now, does anybody know or have a clue why my kernel seems to be doing nothing??? :blink:

The kernel is supposed to execute a thread for each texel; and for each texel the global data is accessed to do some calculations; at the end the result is written to a global array. The only parameters passed in the kernel signature are the number of elements in the global data arrays and the texture width and height; e.g.

number of elements = width*height to be precise.

Here is all the code (host, device, device helper functions):

// define CUDA block size

define BLOCK_SIZE 24

//! includes, system

//include <stdafx.h>

include <stdlib.h>

include <stdio.h>

include <string.h>

include <math.h>

//! includes, CUDA

//include <cuda_runtime.h>

//include <cuda_runtime_api.h>

//include <cuda.h>

//include <cufft.h>

include <cutil.h>

// forward declarations

extern “C” global void evaluatePointSources_kernel(int iNumPointSources, int iWidth, int iHeight);

extern “C” device float3 interpolatePosition(float x, float y);

extern “C” device float3 interpolateDirection(float3 f3PositionSource, float x, float y);

extern “C” device float f3Length(float3 vec);

texture<float2, 2, cudaReadModeElementType> sourceTexture;

// these for corners are used for billinear interpolation

// in the simulated pixel shader

device float3 cornerLL; // lower left

device float3 cornerLR; // lower right

device float3 cornerUL; // upper left

device float3 cornerUR; // upper right

// these four directions are used for billinear interpolation

// in the simulated pixel shader

device float3 directionLL;

device float3 directionLR;

device float3 directionUL;

device float3 directionUR;

// the point positions and (amplitude,phase,k) data in global memory

device float3* d_pointPositions;

device float3* d_f3APhaseKs;

device float2* d_resultData;

// the host part

extern “C” host void evaluatePointSources(int iNumPointSources,

										  int iWidth,

										  int iHeight,

										  float* targetData, 

										  float* sourceData,

										  float3* pointPositions,

										  float3* f3APhaseKs,

										  float3 aPosf3[4],

										  float3 aDirf3[4])

{

printf("HoloCUDAEyeEvaluator::evaluatePointSources: Entered the function.\n");

// initialize memory on the device for the source texture

// the CUDA channel description for the source array

cudaChannelFormatDesc sourceChannelDesc =  cudaCreateChannelDesc(32,32,0,0,cudaChannelFormatKindFloat);

// the CUDA array for the source texture

cudaArray* sourceArray;

CUDA_SAFE_CALL(cudaMallocArray(&sourceArray, &sourceChannelDesc, iWidth, iHeight));

CUT_CHECK_ERROR("cudaMallocArray failed");

CUDA_SAFE_CALL(cudaMemcpyToArray(sourceArray, 0, 0, sourceData, iWidth*iHeight*2*sizeof(float), cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToArray");

// set texture parameters for source texture

sourceTexture.addressMode[0] = cudaAddressModeClamp;

sourceTexture.addressMode[1] = cudaAddressModeClamp;

sourceTexture.filterMode = cudaFilterModeLinear;

sourceTexture.normalized = false;

// bind the source array to the source texture

CUDA_SAFE_CALL(cudaBindTextureToArray(sourceTexture, sourceArray, sourceChannelDesc));

CUT_CHECK_ERROR("cudaBindTextureToArray failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: Source array was bound.\n");

// copy the point locations and (amplitudue,phase,k) triples to the device:

// obtain the device pointer for the point positions

void* symbolPtr = NULL;

CUDA_SAFE_CALL(cudaGetSymbolAddress((void**)&symbolPtr, d_pointPositions));

CUT_CHECK_ERROR("cudaGetSymbolAddress failed");

// allocate memory

CUDA_SAFE_CALL(cudaMalloc((void**)&d_pointPositions, iNumPointSources*sizeof(float3)));

CUT_CHECK_ERROR("cudaMalloc failed");

// now copy the CPU pointer to the device

CUDA_SAFE_CALL(cudaMemcpy(symbolPtr, &d_pointPositions, sizeof(void*), cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpy failed");

// finally copy the input data to the allocated device memory

CUDA_SAFE_CALL(cudaMemcpy(d_pointPositions, pointPositions, iNumPointSources*sizeof(float3), cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpy failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: Point positions were copied to the device.\n");

// obtain the device pointer for the (Amplitude,Phase,K) triples

symbolPtr = NULL;

CUDA_SAFE_CALL(cudaGetSymbolAddress((void**)&symbolPtr, d_f3APhaseKs));

CUT_CHECK_ERROR("cudaGetSymbolAddress failed");

// allocate memory

CUDA_SAFE_CALL(cudaMalloc((void**)&d_f3APhaseKs, iNumPointSources*sizeof(float3)));

CUT_CHECK_ERROR("cudaMalloc failed");

// now copy the CPU pointer to the device

CUDA_SAFE_CALL(cudaMemcpy(symbolPtr, &d_f3APhaseKs, sizeof(void*), cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpy failed");

// finally copy the input data to the allocated device memory

CUDA_SAFE_CALL(cudaMemcpy(d_f3APhaseKs, f3APhaseKs, iNumPointSources*sizeof(float3), cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpy failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: Point amplitudes and phases were copied to the device.\n");

// allocate memory for the result data

symbolPtr = NULL;

CUDA_SAFE_CALL(cudaGetSymbolAddress((void**)&symbolPtr, d_resultData));

CUT_CHECK_ERROR("cudaGetSymbolAddress failed");

// allocate memory

CUDA_SAFE_CALL(cudaMalloc((void**)&d_resultData, iNumPointSources*sizeof(float2)));

CUT_CHECK_ERROR("cudaMalloc failed");

// now copy the CPU pointer to the device

CUDA_SAFE_CALL(cudaMemcpy(symbolPtr, &d_resultData, sizeof(void*), cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpy failed");

// initialize result data to zero

CUDA_SAFE_CALL(cudaMemset(d_resultData, 0, iNumPointSources*sizeof(float2)));	

            CUT_CHECK_ERROR("cudaMemset failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: Created device memory for the result data.\n");

// copy the corner positions and directions to the (global) device memory



// lower left corner

printf("HoloCUDAEyeEvaluator::evaluatePointSources: cornerLL is located at (%f,%f,%f)\n", aPosf3[0].x, aPosf3[0].y, aPosf3[0].z);

CUDA_SAFE_CALL(cudaMemcpyToSymbol(cornerLL, &aPosf3[0], sizeof(float3), 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

float3 cornerLLTest;

CUDA_SAFE_CALL(cudaMemcpyFromSymbol(&cornerLLTest, cornerLL, sizeof(float3), 0, cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpyFromSymbol failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: cornerLL on the device is located at (%f,%f,%f)\n", cornerLLTest.x, cornerLLTest.y, cornerLLTest.z);

// lower right corner

printf("HoloCUDAEyeEvaluator::evaluatePointSources: cornerLR is located at (%f,%f,%f)\n", aPosf3[1].x, aPosf3[1].y, aPosf3[1].z);

CUDA_SAFE_CALL(cudaMemcpyToSymbol(cornerLR, &aPosf3[1], sizeof(float3), 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

float3 cornerLRTest;

CUDA_SAFE_CALL(cudaMemcpyFromSymbol(&cornerLRTest, cornerLR, sizeof(float3), 0, cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpyFromSymbol failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: cornerLR on the device is located at (%f,%f,%f)\n", cornerLRTest.x, cornerLRTest.y, cornerLRTest.z);



// upper left corner

printf("HoloCUDAEyeEvaluator::evaluatePointSources: cornerUL is located at (%f,%f,%f)\n", aPosf3[2].x, aPosf3[2].y, aPosf3[2].z);

CUDA_SAFE_CALL(cudaMemcpyToSymbol(cornerUL, &aPosf3[2], sizeof(float3), 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

float3 cornerULTest;

CUDA_SAFE_CALL(cudaMemcpyFromSymbol(&cornerULTest, cornerUL, sizeof(float3), 0, cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpyFromSymbol failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: cornerUL on the device is located at (%f,%f,%f)\n", cornerULTest.x, cornerULTest.y, cornerULTest.z);



// upper right corner

printf("HoloCUDAEyeEvaluator::evaluatePointSources: cornerUR is located at (%f,%f,%f)\n", aPosf3[3].x, aPosf3[3].y, aPosf3[3].z);

CUDA_SAFE_CALL(cudaMemcpyToSymbol(cornerUR, &aPosf3[3], sizeof(float3), 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

float3 cornerURTest;

CUDA_SAFE_CALL(cudaMemcpyFromSymbol(&cornerURTest, cornerUR, sizeof(float3), 0, cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpyFromSymbol failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: cornerUR on the device is located at (%f,%f,%f)\n", cornerURTest.x, cornerURTest.y, cornerURTest.z);



// lower left direction

printf("HoloCUDAEyeEvaluator::evaluatePointSources: directionLL points to (%f,%f,%f)\n", aDirf3[0].x, aDirf3[0].y, aDirf3[0].z);

CUDA_SAFE_CALL(cudaMemcpyToSymbol(directionLL, &aDirf3[0], sizeof(float3), 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

float3 directionLLTest;

CUDA_SAFE_CALL(cudaMemcpyFromSymbol(&directionLLTest, directionLL, sizeof(float3), 0, cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpyFromSymbol failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: directionLL on the device is located at (%f,%f,%f)\n", directionLLTest.x, directionLLTest.y, directionLLTest.z);



// lower right direction

printf("HoloCUDAEyeEvaluator::evaluatePointSources: directionLR points to (%f,%f,%f)\n", aDirf3[1].x, aDirf3[1].y, aDirf3[1].z);

CUDA_SAFE_CALL(cudaMemcpyToSymbol(directionLR, &aDirf3[1], sizeof(float3), 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

float3 directionLRTest;

CUDA_SAFE_CALL(cudaMemcpyFromSymbol(&directionLRTest, directionLR, sizeof(float3), 0, cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpyFromSymbol failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: directionLR on the device is located at (%f,%f,%f)\n", directionLRTest.x, directionLRTest.y, directionLRTest.z);



// upper left direction

printf("HoloCUDAEyeEvaluator::evaluatePointSources: directionUL points to (%f,%f,%f)\n", aDirf3[2].x, aDirf3[2].y, aDirf3[2].z);

CUDA_SAFE_CALL(cudaMemcpyToSymbol(directionUL, &aDirf3[2], sizeof(float3), 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

float3 directionULTest;

CUDA_SAFE_CALL(cudaMemcpyFromSymbol(&directionULTest, directionUL, sizeof(float3), 0, cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpyFromSymbol failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: directionUL on the device is located at (%f,%f,%f)\n", directionULTest.x, directionULTest.y, directionULTest.z);



// upper right direction

printf("HoloCUDAEyeEvaluator::evaluatePointSources: directionUR points to (%f,%f,%f)\n", aDirf3[3].x, aDirf3[3].y, aDirf3[3].z);

CUDA_SAFE_CALL(cudaMemcpyToSymbol(directionUR, &aDirf3[3], sizeof(float3), 0, cudaMemcpyHostToDevice));

CUT_CHECK_ERROR("cudaMemcpyToSymbol failed");

float3 directionURTest;

CUDA_SAFE_CALL(cudaMemcpyFromSymbol(&directionURTest, directionUR, sizeof(float3), 0, cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpyFromSymbol failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: directionUR on the device is located at (%f,%f,%f)\n", directionURTest.x, directionURTest.y, directionURTest.z);

	

// calculate the grid and block size

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(iWidth / dimBlock.x, iHeight / dimBlock.y);

// call the CUDA kernel with the appropriate grid and block size

// usa a CUDA timer

unsigned int timer = 0;

float elapsedTimeInMs = 0.0f;

CUDA_SAFE_CALL(cudaThreadSynchronize());

CUT_SAFE_CALL(cutCreateTimer(&timer));

printf("HoloCUDAEyeEvaluator::evaluatePointSources: Calling CUDA kernel...\n");

CUT_SAFE_CALL(cutStartTimer(timer));

evaluatePointSources_kernel<<<dimGrid, dimBlock>>>(iNumPointSources, iWidth, iHeight);

CUT_CHECK_ERROR("evaluatePointSources_kernel failed");

CUDA_SAFE_CALL(cudaThreadSynchronize());

CUT_SAFE_CALL(cutStopTimer(timer));

elapsedTimeInMs = CUT_SAFE_CALL(cutGetTimerValue(timer));

printf("HoloCUDAEyeEvaluator::evaluatePointSources: ...kernel execution took %f milliseconds.\n", elapsedTimeInMs);



//copy result data back

CUDA_SAFE_CALL(cudaMemcpy(targetData, d_resultData, iNumPointSources*sizeof(float2), cudaMemcpyDeviceToHost));

CUT_CHECK_ERROR("cudaMemcpy failed");



// free the texture

CUDA_SAFE_CALL(cudaUnbindTexture(sourceTexture));

CUT_CHECK_ERROR("cudaUnbindTexture failed");

CUDA_SAFE_CALL(cudaFreeArray(sourceArray));

CUT_CHECK_ERROR("cudaFreeArray failed");



// free the point source data

CUDA_SAFE_CALL(cudaFree(d_pointPositions));

CUT_CHECK_ERROR("cudaFree failed");

CUDA_SAFE_CALL(cudaFree(d_f3APhaseKs));

CUT_CHECK_ERROR("cudaFree failed");



// free the result data

CUDA_SAFE_CALL(cudaFree(d_resultData));

CUT_CHECK_ERROR("cudaFree failed");

printf("HoloCUDAEyeEvaluator::evaluatePointSources: all memory was released\n");

}

// the device part

extern “C” global void evaluatePointSources_kernel(int iNumPointSources, int iWidth, int iHeight)

{

// CUDA block index

int bX = blockIdx.x;

int bY = blockIdx.y;

// CUDA thread index

int tX = threadIdx.x;

int tY = threadIdx.y;

// calculate total coordinate of current thread which acutally corresponds to a single pixel

// being evaluated in the current thread

int iX = bX*blockDim.x + tX;

int iY = bY*blockDim.y + tY;

// obtain data for the current texel

float2 resTexel = tex2D(sourceTexture, (float)iX, (float)iY);

float3 positionSource = d_pointPositions[iY*iWidth+iX];

// calculate the current texel value:

// for all point sources...

for(int i=0; i<iNumPointSources; i++)

{

	// ...evaluate the current texel



	float3 f3APhaseK = d_f3APhaseKs[i];

	

	//float3 interpVPosition = interpolatePosition((float)iX/iWidth, (float)iY/iHeight);

	float3 interpVDirection = interpolateDirection(positionSource, (float)iX/(float)iWidth, (float)iY/(float)iHeight);

	

	float fR = f3Length(interpVDirection);

	float fSin, fCos;

	sincosf(fR*f3APhaseK.z + f3APhaseK.y, &fSin, &fCos);

	

	float fAmp = f3APhaseK.x / fR;

	

	float2 curTexel = make_float2(resTexel.x + fAmp*fCos, resTexel.y + fAmp*fSin);

	

	resTexel = curTexel;

	

}

// write the resulting color value to the global summation texture

__syncthreads();

d_resultData[iY*iWidth+iX] = resTexel;

//d_resultData[iY*iWidth+iX] = make_float2(1000.0f, 1000.0f);	

}

/*

  • This function replaces the fragment interpolation which is usually done

  • in the rendering pipeline between vertex and fragment processing.

  • It is a bilinear interpolation of the vertex position.

*/

extern “C” device float3 interpolatePosition(float x, float y)

{

float3 position = make_float3(cornerLL.x*(1-x)*(1-y) + cornerLR.x*x*(1-y) + cornerUL.x*(1-x)*y + cornerUR.x*x*y,

							  cornerLL.y*(1-x)*(1-y) + cornerLR.y*x*(1-y) + cornerUL.y*(1-x)*y + cornerUR.y*x*y,

							  cornerLL.z*(1-x)*(1-y) + cornerLR.z*x*(1-y) + cornerUL.z*(1-x)*y + cornerUR.z*x*y );

return position;

}

/*

  • This function replaces the fragment interpolation which is usually done

  • in the rendering pipeline between vertex and fragment processing.

  • It is a bilinear interpolation of the vertex direction relative to the point of evaluation \c f3PositionSource

*/

extern “C” device float3 interpolateDirection(float3 f3PositionSource, float x, float y)

{

float3 newDirectionLL = make_float3(directionLL.x - f3PositionSource.x,

									directionLL.y - f3PositionSource.y,

									directionLL.z - f3PositionSource.z);

float3 newDirectionLR = make_float3(directionLR.x - f3PositionSource.x,

									directionLR.y - f3PositionSource.y,

									directionLR.z - f3PositionSource.z);

float3 newDirectionUL = make_float3(directionUL.x - f3PositionSource.x,

									directionUL.y - f3PositionSource.y,

									directionUL.z - f3PositionSource.z);

float3 newDirectionUR = make_float3(directionUR.x - f3PositionSource.x,

									directionUR.y - f3PositionSource.y,

									directionUR.z - f3PositionSource.z);

// special solution where reference values are located at (0,0), (1,0), (0,1), (1,1)

float3 direction = make_float3(newDirectionLL.x*(1-x)*(1-y) + newDirectionLR.x*x*(1-y) + newDirectionUL.x*(1-x)*y + newDirectionUR.x*x*y,

							   newDirectionLL.y*(1-x)*(1-y) + newDirectionLR.y*x*(1-y) + newDirectionUL.y*(1-x)*y + newDirectionUR.y*x*y,

							   newDirectionLL.z*(1-x)*(1-y) + newDirectionLR.z*x*(1-y) + newDirectionUL.z*(1-x)*y + newDirectionUR.z*x*y );

							   

return direction;

}

/*

  • This small helper function computes the length of a float3 vector.

*/

extern “C” device float f3Length(float3 vec)

{

float length = vec.x*vec.x + vec.y*vec.y + vec.z*vec.z;

return sqrtf(length);

}

endif

You are passing “d_resultData” as a pointer to “cudaMalloc”. This just does NOT make sense. You need to pass a “void **” as you did in your previous code (devptr). The pointer you pass must be a CPU pointer (declared as void devptr). d_resultData is a device pointer. you cannot pass it to “cudaMalloc”. After that you need to copy the value of that pointer (which is of size sizeof(void)) to the “d_resultData” (your code is doing it). Please see my previous post and follow it precisely. You are confusing device pointers , symbols and host pointers thoroughly.

I have one piece of advice for you.

When you build big apps, please do it step by step. Make sure that things work the way you think it works. It is very very important .

And, please do post only your problematic code. Your post is too big – which is a deterrent for people to answer your questions. You may not find replies for big posts - usually.

All right, Good Luck! Sorry about sounding preachy. Hope you take it in the right spirit.

Best Regards,

Sarnath

Hi Sarnath, that’s perfectly ok…I know that my posting was way too long, but I thought that for experienced CUDA programmer that should be ok, since there is a lot of repeated code sections. Anyway, I was already told several times to split my code into different files and functions in order to make it readable… :-)

So I understand by now what I did wrong. The device pointer is really just a pointer, and one can’t allocate memory for it directly. So considering performance reasons I also pass the array data as input argument in the signature, same as all the NVidia CUDA examples do in the SDK. I wanted to improve the performance but since in the end the memory gets allocated the same way (location) whether one uses a lobal pointer or a pointer as an input argument doesn’t matter. So now my code actually works, the “only problem” is now that the performance is really bad. Not only that but also there seems to be some sort of memory problems or overflow in my kernel:

/*

  • This kernel evaluates all the point sources. Each thread calculates

  • one texel of the resulting texture.

*/

extern “C” global void evaluatePointSources_kernel(float2* d_resultData,

												   float3* d_pointPositions,

												   float2* d_f2APhases,

												   int iWidth,

												   int iHeight,

												   float fK,

												   float3* d_aDirf3 )

{

// calculate total coordinate of current thread which acutally corresponds to a single pixel

// being evaluated in the current thread

int iX = blockIdx.x*blockDim.x + threadIdx.x;

int iY = blockIdx.y*blockDim.y + threadIdx.y;

// obtain data for the current texel

float2 resTexel = make_float2(0.0f,0.0f);

// normalized texel location for billinear interpolation

float x = (float)iX/(float)iWidth;

float y = (float)iY/(float)iHeight;



// calculate the current texel value:

// for all point sources...

//#pragma unroll 1 // prevents from ever unrolling this loop

for(int i=0; i<(iWidth*iHeight); i++)

{

	// ...evaluate the current texel

	float2 f2APhase = d_f2APhases[i];

	float3 positionSource = d_pointPositions[i];

	float3 newDirectionLL = make_float3(d_aDirf3[0].x - positionSource.x,

										d_aDirf3[0].y - positionSource.y,

										d_aDirf3[0].z - positionSource.z);

	float3 newDirectionLR = make_float3(d_aDirf3[1].x - positionSource.x,

										d_aDirf3[1].y - positionSource.y,

										d_aDirf3[1].z - positionSource.z);

	float3 newDirectionUL = make_float3(d_aDirf3[2].x - positionSource.x,

										d_aDirf3[2].y - positionSource.y,

										d_aDirf3[2].z - positionSource.z);

	float3 newDirectionUR = make_float3(d_aDirf3[3].x - positionSource.x,

										d_aDirf3[3].y - positionSource.y,

										d_aDirf3[3].z - positionSource.z);

	

	// billinear interpolation

	float3 interpVDirection = make_float3(newDirectionLL.x*(1-x)*(1-y) + newDirectionLR.x*x*(1-y) + newDirectionUL.x*(1-x)*y + newDirectionUR.x*x*y,

										  newDirectionLL.y*(1-x)*(1-y) + newDirectionLR.y*x*(1-y) + newDirectionUL.y*(1-x)*y + newDirectionUR.y*x*y,

										  newDirectionLL.z*(1-x)*(1-y) + newDirectionLR.z*x*(1-y) + newDirectionUL.z*(1-x)*y + newDirectionUR.z*x*y );

	

	float fR = sqrtf(interpVDirection.x*interpVDirection.x + interpVDirection.y*interpVDirection.y + interpVDirection.z*interpVDirection.z);

	float fSin, fCos;

	sincosf(fR*fK + f2APhase.y, &fSin, &fCos);

	

	float fAmp = f2APhase.x / fR;

	

	resTexel = make_float2(resTexel.x + fAmp*fCos, resTexel.y + fAmp*fSin);

	

}

// write the resulting color value to the summation array

d_resultData[iY*iWidth+iX] = resTexel;

}

Now each thread corresponds to one texel being evaluated. In my case this is a 512x512 texture. But my kernel only runs fine on a 128x128 texture size, for larger textures I end up with strange memory locations and pointers such as 00000000 or CCCCCCCC, so there must be something like an overflow or the like.

So again: In each thread one texel is evaluated, but for this texel, I have to traverse an array which is as long as the texture dims (e.g. width*height)…could this be the problem? Is there some kind of loop limit? I could understand if the kernel takes some time, but the result should be correct which it is not. From my point of view, if a loop fits into the local thread memory whith all its local loop variables, why should should it matter how often the loop is evaluated? I know that there is some memory issue, since I can iterate through the kernel loop only like 1024 times in case of 512x512 input data. But for the 128x128 texture the loop runs fine, but obviously 128x128=16384 loop iterations which is way larger than 1024. So the loop count is coupled to the memory usage, e.g. input to the kernel…Now does anybody have an idea as how to tackle this problem? I would be very happy if somebody could help…I am pretty sure that I am doing something wrong here but I just can’t see it… External Media

Right now I am doing some precalculations and will use texturing units to pass my data since I did some good experience with this approach in Cg shader code.

So I switched to textures and my kernel is quite fast by now. Still the problem remains: My kernel runs perfectly for 128x128 data size (iWidth,iHeight) but for larger values the result is black, e.g. most probably there are some issues I don’t understand, because the .cubin file shows:

name = evaluatePointSources_kernel

	lmem = 28

	smem = 36

	reg = 20

I would be extremely happy if somebody would have a suggestion as where the problem might be :) Or how can I check for memory issues considering uncoalesced access or overflow for example?

Here is the current kernel:

extern "C" __global__ void evaluatePointSources_kernel(float2* d_resultData,

                int iWidth,

                int iHeight,

                float fK,

                float3* d_aDirf3 )

{

	// calculate total coordinate of current thread which acutally corresponds to a single pixel

	// being evaluated in the current thread

	unsigned int iX = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int iY = blockIdx.y*blockDim.y + threadIdx.y;

	// obtain data for the current texel

	float2 resTexel = make_float2(0.0f,0.0f);

	float4 direction = tex2D(interpDirectionsTexture, (float)iX, (float)iY);

	

	// calculate the current texel value:

	// for all point sources...

	float fAmp, fR, fSin, fCos;

	float2 f2APhase;

	float3 newDirection;

	float4 positionSource;

	//#pragma unroll 1 // prevents from ever unrolling this loop

	for(int j=0; j<iHeight; j++)

	{

  for(int i=0; i<iWidth; i++)

  {

 	// ...evaluate the current texel

 	//__syncthreads();

  	f2APhase = tex2D(amplitudePhaseTexture, (float)i, (float)j);

  	positionSource = tex2D(pointPositionsTexture, (float)i, (float)j);

 	newDirection = make_float3(direction.x-positionSource.x, direction.y-positionSource.y, direction.z-positionSource.z);

 	fR = sqrtf(newDirection.x*newDirection.x + newDirection.y*newDirection.y + newDirection.z*newDirection.z);

 	sincosf(fR*fK + f2APhase.y, &fSin, &fCos);

  	

  	fAmp = f2APhase.x / fR;

  	

  	resTexel = make_float2(resTexel.x + fAmp*fCos, resTexel.y + fAmp*fSin);

 }	

	}

	// write the resulting color value to the summation array

	//__syncthreads();

	d_resultData[iY*iWidth+iX] = resTexel;

	

}

I’ve optimized my code and most importantly the grid and block dimensionsions using the occupancy calculator. Now I am able to calculate textures of size 256x256 (700ms, all textures have the same size) but I get the same strange memory behaviour for a size of 512x512: The first call to my kernel seems to be fine but in the following ones my d_resultData pointer suddenly points to 00000000…am I really runnging out of memory? How can I find out? :huh: I would be happy for any suggestions since I am beginning to External Image here …

The .cubin file shows:

name = evaluatePointSources_kernel

	lmem = 0

	smem = 32

	reg = 15

	bar = 0

Here is my current kernel:

/*

 * This kernel evaluates all the point sources. Each thread calculates

 * the influence of all point sources on one texel of the resulting texture.

 *

 */

extern "C" __global__ void evaluatePointSources_kernel(float2* d_resultData,

                int iWidth,

                int iHeight,

                float fK)

{

	// calculate total coordinate of current thread which acutally corresponds to a single pixel

	// being evaluated in the current thread

	unsigned int iX = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int iY = blockIdx.y*blockDim.y + threadIdx.y;

	// obtain data for the current texel

	float2 resTexel = make_float2(0.0f,0.0f);

	float4 direction = tex2D(interpDirectionsTexture, (float)iX, (float)iY);

	

	// for all point sources...

	float fAmp, fR, fSin, fCos;

	float2 f2APhase;

	float4 positionSource;

	//#pragma unroll 1 // prevents from ever unrolling this loop

	for(int j=0; j<iHeight; j++)

	{

  for(int i=0; i<iWidth; i++)

  {

 	// ...evaluate the influence on the current texel

 	positionSource = tex2D(pointPositionsTexture, (float)i, (float)j);

  	f2APhase = tex2D(amplitudePhaseTexture, (float)i, (float)j);

  	

  	direction = make_float4(direction.x-positionSource.x, direction.y-positionSource.y, direction.z-positionSource.z, 0.0f);

  	fR = f4Length(direction);

 	__sincosf(fR*fK + f2APhase.y, &fSin, &fCos);

  	

  	fAmp = __fdividef(f2APhase.x,fR);

  	

  	resTexel = make_float2(resTexel.x + fAmp*fCos, resTexel.y + fAmp*fSin);

 }	

	}

	

	// write the resulting texel color value to the summation array

	d_resultData[iY*iWidth+iX] = resTexel;

	

}

I haven’t been following this tread from the beginning. Just reading your latest post, I have a few questiond.
What grid/thread dimensions are you running? You may be exhausting register resources.
You say your result data pointer is NULL, did you check for error messages returned by the cudaMalloc that allocates it?
How long is your kernel running? You may be hitting the 5s limitation.

I know you are troubleshooting what you have now, but there might be more options for optimizing your kernel. John Stone calculates potentials on a 3D grind in this paper [url=“TCB Publications - Abstract”]http://www.ks.uiuc.edu/Publications/Papers...tbcode=STON2007[/url] . Some of the optimization techniques he uses might work out in your 2D grid.

Hello Mr. Anderson,

the block dimensions are (16,16) or (24,16) depending upon the occupancy calculator and minor midifications to the kernel. Then I divide the texture dimensions by the block dimensions, so the grid dimensions are (texWidth/16, texHeight/16) or (texWidth/24,texHeight/16), respectively.

I check the cudaMalloc with a CUDA_SAFE_CALL and a CUT_CHECK_ERROR, so I am pretty sure there are no memory errors.

However, my kernel needs about 9000ms for a 512x512 data set…now what is the 5s rule? :huh:

Thanks a lot for your help, especially the paper. I will read it right away :)

242615 = 5760, so your register usage should be fine.

It seems like you might be hitting the 5s limitation. In windows with the GPU on the primary display or in Linux with X-windows, any single kernel can only execute for 5 seconds. The only way around this is to run on the linux console without X or run on a 2nd display adapter in windows.

Hi all,

I have a question concerning the causes behind cudaerror_enum.
I use several arrays of shared data in the kernel, but as long as i don’t use this data i can have blocks of size up to 512 threads.
But as soon as I try to use the shared data, the kernel failed (I use cutilCheckMsg(“Kernel execution failed”);

So is it normal that the kernel don’t actually load the data, if it notices that i don’t use it ?

I use a Quadro FX 4600 and CUDA 2.1

I read in this thread that the parameters given to the kernel are also loaded in shared memory. What are all others parameters and data that we need to take into account when we need to calculate the size of the shared memory ? (to make sure that we are not overloading it) If someone from nvidia could answer that latter question, that would be really nice of him !

Tom

I have the same problem

It’s hard to debug your program since you don’t post the program in entirety. (If you do that, you use “upload” as an attachment.) For CUDA memory issues, I just wrote a wrapper for CUDA memory API (with macros for cudaMalloc, Free, etc) to help debug those calls. See http://code.google.com/p/cuda-memory-debug/. It might help you catch buffer overruns, wrong use of pointers, etc.