Kernel works in deviceemu but not on device Involves copying a struct with a dynam. allocated array

I’ve got a code which isn’t working as expected. It may be that I’ve been staring at this for far too long but I cannot get this working right. NB this is my first CUDA application, so any help is appreciated. Please let me know if more details are required, I’m still just a newbie :).

    [*]when compiled with the -deviceemu flag

      [*]works as expected when run in the VS2008 debugger

      [*]works as expected when run from command line

    [*]when compiled without the -deviceemu flag

      [*]crashes when run in the VS2008 debugger (crash happens when kernel is executed)

      [*]fills h_sum_out with 0.0000 values when run from command line

#define BLOCK_SIZE 200

struct svm_node

{

	int index;

	float value;

};

struct svm_problem

{

	unsigned int rowCount; // number of rows

	float* y; // value for this vector (used only in a classifier and can be ignored atm)

	unsigned int* columnCount; // number of columns for each row

	unsigned long* columnOffsets; // Starting offset of columns in each row

	svm_node* x; // A flattened 2D VLA

};

The data structures above represent a collection of sparse N-dimensional vectors. The code below works only on a single svm_problem at any time. The program flow is like so:

[list=1]

[*]create a svm_problem problem on the host and fill it with some data

[list=1]

[*]Malloc for the members y, columnCount and columnOffset

[*]Set the number of columns/dimensions of every vector in problem to something like 10

[*]Calculate the offset for each row in the 1D array problem.x

[*]Malloc for the 1D array problem.x

[*]fill problem.x with vectors

[*]create svm_problem d_problem_host on the host which members point to device pointers (marshalling code)

[list=1]

[*]cudaMalloc for d_problem_host members

[*]cudaMemcpy problem members to d_problem members

[*]create svm_problem d_problem which is stored on device

[list=1]

[*]cudaMalloc for d_problem

[*]cudaMemcpy d_problem_host to d_problem

[*] Prepare d_sum_out and h_sum_out

[*] Execute kernel

The kernel svm_dot_kernel calculates the square of the xNodes parameter using the dot function. It’s not specifically for a svm_problem and therefor it takes an array of svm_nodes instead of a svm_problem. It’s worth to mention that svm_dot_kernel worked when I didn’t put the svm_nodes into svm_problem.

// outputs a single float to "float* sum_out" parameter

__device__ void device_dot(svm_node *px, svm_node *py, 

		const unsigned int indexI, const unsigned int indexJ, float* sum_out,

		unsigned long* integerOffsets, unsigned int* columnCount) {

	float sum = 0;

	

	

	

	unsigned int i =  integerOffsets[indexI];

	unsigned int j = integerOffsets[indexJ];

	

	while(i < (integerOffsets[indexI] + columnCount[indexI]) 

			&& j < (integerOffsets[indexJ] + columnCount[indexJ]))

	{

		

		

		if(px[i].index != -1 && py[j].index != -1)

		{

			sum += px[i].value * py[j].value;

			++i;

			++j;

		}

		else

		{

			if(px[i].index  > py[j].index)

				++j;

			else

				++i;

		}

		

	}

	

	*sum_out = sum;

 }

__global__ 

 void svm_dot_kernel(svm_node* xNodes, float* sum_out, unsigned long* integerOffsets, unsigned int* columnCount, unsigned int rowCount) {

	

	unsigned int i = threadIdx.x;

	unsigned int j = blockIdx.x;

	j = j*BLOCK_SIZE;

	

	

	

	unsigned int index = j+i;

	

	if(index < rowCount)

	{

		float value = 0.0;

	

		device_dot(xNodes, xNodes, index,index, &value, integerOffsets, columnCount);

	

		sum_out[index] = value;

	}

 }

void

runTest ( int argc, char** argv)

{

	// initialize a problem

	svm_problem problem = {0};

	problem.rowCount = 500;

	problem.y = (float*) malloc(sizeof(float)*problem.rowCount);

	problem.columnCount = (unsigned int*) malloc(sizeof(int)*problem.rowCount);

	problem.columnOffsets = (unsigned long*) malloc(sizeof(long)*problem.rowCount);

	

	unsigned long size = 0;

	unsigned int memsize;

	

	

		

	// use command-line specified CUDA device, otherwise use device with highest Gflops/s

	if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

		cutilDeviceInit(argc, argv);

	else

		cudaSetDevice( cutGetMaxGflopsDeviceId() );

	unsigned int timer = 0;

	cutilCheckError( cutCreateTimer( &timer));

	cutilCheckError( cutStartTimer( timer));

	

		// Create a list of columns to test, in reality this will be 

	// random and created by parsing a file

	for(unsigned int i = 0; i < problem.rowCount; i++)

	{

			problem.columnCount[i] = 40;

	}

	

	

	// Calculatet he offset for each row

	for( unsigned int i = 0; i < problem.rowCount; i++)

	{

		problem.columnOffsets[i] = size;

		size += problem.columnCount[i];

	}

	

	memsize = sizeof(svm_node)*size;

	

	problem.x = (svm_node*) malloc(memsize);

	

	// Fill problem.x with data

	for(unsigned int i = 0; i < problem.rowCount; i++)

	{				

		for(unsigned int j = 0; j < (problem.columnCount[i]); j++)

		{

			float value = 0.5f;

			

			svm_node item;

			item.index = j;

			item.value = value;

			

			problem.x[problem.columnOffsets[i]+j] = item;

		}

	}

	

	

	

	// allocate memory for d_problem - marshalling

		svm_problem* d_problem;

		svm_problem d_problem_host; // used for marshalling

		d_problem_host.rowCount = problem.rowCount;

			

		

		cutilSafeCall( cudaMalloc((void**) &d_problem, sizeof(svm_problem)) );

		cutilSafeCall( cudaMalloc((void**) &d_problem_host.y, sizeof(float)*problem.rowCount));

		cutilSafeCall( cudaMalloc((void**) &d_problem_host.columnCount, sizeof(int)*problem.rowCount));

		cutilSafeCall( cudaMalloc((void**) &d_problem_host.columnOffsets, sizeof(long)*problem.rowCount));

		cutilSafeCall( cudaMalloc((void**) &d_problem_host.x, memsize));

	

		

	// copy problem to device

		

		cutilSafeCall( cudaMemcpy( d_problem_host.y, problem.y, sizeof(float)*problem.rowCount, cudaMemcpyHostToDevice) ); 

		cutilSafeCall( cudaMemcpy( d_problem_host.columnCount, problem.columnCount, sizeof(int)*problem.rowCount, cudaMemcpyHostToDevice) );

		cutilSafeCall( cudaMemcpy( d_problem_host.columnOffsets, problem.columnOffsets, sizeof(long)*problem.rowCount, cudaMemcpyHostToDevice) );

		cutilSafeCall( cudaMemcpy( d_problem_host.x, problem.x, memsize, cudaMemcpyHostToDevice) );

		cutilSafeCall( cudaMemcpy( d_problem, &d_problem_host, sizeof(svm_problem), cudaMemcpyHostToDevice) );

		

		

	float* d_sum_out;

	cutilSafeCall( cudaMalloc( (void**) &d_sum_out, sizeof(float)*problem.rowCount) );

		

	// Fill problem.x with some other data to test if d_problem points to the same values as problem

	for(unsigned int i = 0; i < problem.rowCount; i++)

	{				

		for(unsigned int j = 0; j < (problem.columnCount[i]); j++)

		{

			float value = 1.5f;

			

			svm_node item;

			item.index = j;

			item.value = value;

			

			problem.x[problem.columnOffsets[i]+j] = item;

		}

	}

	

	free(problem.x);

	

	// kernel parameters

	unsigned int numBlocks =  problem.rowCount/BLOCK_SIZE +1;

	dim3 blocks(numBlocks,1,1);

	dim3 threads( BLOCK_SIZE ,1,1);

	

	// execute kernel

	svm_dot_kernel<<< blocks, threads>>>( d_problem->x, d_sum_out, d_problem->columnOffsets, d_problem->columnCount, d_problem->rowCount);

	cutilCheckMsg("Kernel execution failed");

	

	float* h_sum_out = (float*) malloc(sizeof(float)*problem.rowCount);

	cutilSafeCall( cudaMemcpy( h_sum_out, d_sum_out, sizeof(float)*problem.rowCount, cudaMemcpyDeviceToHost) );

	

	

	cutilCheckError( cutStopTimer( timer));

	printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));

	cutilCheckError( cutDeleteTimer( timer));

	

	for(unsigned int i = 0; i < problem.rowCount; i++)

	{

		printf("%f \n", h_sum_out[i]);

	}

	

	free(problem.y);

	free(problem.columnCount);

	free(problem.columnOffsets);

	

	free(h_sum_out);

	cutilSafeCall( cudaFree(d_sum_out));

	

	

}

What do you guys think?

Even I faced a similar problem. I was not allocating memory properly for the structures that I was using in my program. You may want to look at this thread:

[link] http://forums.nvidia.com/index.php?showtop…&pid=590259 [/link]

formal parameter of your kernel in runTest() is wrong

svm_dot_kernel<<< blocks, threads>>>( d_problem->x, d_sum_out, d_problem->columnOffsets, d_problem->columnCount, d_problem->rowCount);

modify it as

svm_dot_kernel<<< blocks, threads>>>( d_problem_host.x, d_sum_out, d_problem_host.columnOffsets, 

   				d_problem_host.columnCount, d_problem_host.rowCount);

note that d_problem contains address in device memory, host code CANNOT fetch content of d_problem_host.x

second, in device function device_dot, you use

if(px[i].index != -1 && py[j].index != -1){

			sum += px[i].value * py[j].value;

			++i;

			++j;

		}

		else

		{

			if(px[i].index  > py[j].index){

				++j;

			}else{

				++i;

			}

		}

what does “px[i].index = -1” mean? why not use

if(px[i].index > py[j].index){

			++j;

		}else if (px[i].index < py[j].index){	  

			++i;

		}else{

			sum += px[i].value * py[j].value;

			++i;

			++j;

		}

Thanks a lot for the thorough explanation. It works now!

As for the index != -1 part I just copied a function from another program where index == -1 represents a null or “end of row” instead of using the i and j values.