cudamalloc struct problems - unspecified launch failure

Hi there, i have a problem i can’t seem to solve with a struct on cuda.

I don’t have data in my struct, i create them in the kernel.

The problem is, when i try to make some computations with the struct in the kernel i get a

“Cuda error: Kernel Synchronization: unspecified launch failure.”

This here is my struct:

(I defined valuesize=100)

struct  prefixArray {

	int value[VALUESIZE];		//An array of hash values

	int size;		//Hold the total size of the value array

	int index[VALUESIZE];		//Store the pattern number

};

main - allocating space on device + passing the struct with the kernel:

struct prefixArray *d_PREFIX;

	cudaMalloc( (void **) &d_PREFIX, sizeof( struct prefixArray ) * shiftsize );

	checkCUDAError("Malloc");

        (...)

        wuKernel<<< dimGrid, dimBlock, sharedMemSize >>>( ........ d_PREFIX, ......);

and inside the kernel:

__global__ void wuKernel(....... struct prefixArray *d_PREFIX, ...... ) {

        (...)

        d_PREFIX[hash].value[d_PREFIX[hash].size]=prefixhash;

	d_PREFIX[hash].index[d_PREFIX[hash].size]=j;

	d_PREFIX[hash].size++;

        (...)

}

I was thinking that maybe i must pass some random data with the struct, for it to work.

Any ideas? Let’s hope it’s a syntax error.

(compute capability 1.3)

Oh! I just realised i post this in the wrong section!

I’m sorry, can an admin relocate this post to: “CUDA Programming and Development”?

Can you post a self-contained example that reproduces your problem? Just from the snippets you posted I can’t tell anything.

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include <assert.h>

#include <cutil_inline.h>

#include <cuda.h>

#include <cuPrintf.cu>

#include "functions.h"

#define CUPRINTF cuPrintf

#define VALUESIZE 100

// Global Variables

int results = 0;

int verbose = 0;

int memory_usage = 0;

int create_data = 0;

struct  prefixArray {

	int value[VALUESIZE];		//An array of hash values

	int size;			//Hold the total size of the value array

	int index[VALUESIZE];		//Store the pattern number

};

__global__ void wuKernel( unsigned char **d_pattern, unsigned char *d_text, int *d_matches, int m, int n, int blocksize, long shiftsize, int *d_SHIFT, struct prefixArray *d_PREFIX, int p_size ) {

	int d_verbose = 1;

	 

	int idx = blockDim.x * blockIdx.x + threadIdx.x; //thread index

	int start = idx * n / ( blockDim.x * blocksize );

	int stop = start + n / ( blockDim.x * blocksize );

	

	int startThread = threadIdx.x * ( m / blockDim.x );

	

	int stopThread = startThread + ( m / blockDim.x );

	

	__syncthreads();

	int i, j, q, B = 3;

	

	unsigned int hash;

	

	//The original paper uses Hbits = 5 in the example code given

	unsigned short m_nBitsInShift = 4;

		

	size_t shiftlen, prefixhash;

	//default to m - B + 1 for shift;

	for ( i = 0; i < shiftsize; i++ )

		d_SHIFT[i] = m - B + 1;

	

	//Preprocessing phase 		

	//for each pattern

	for ( j = 0; j < p_size; ++j ) {

		//add each 3-character subpattern (similar to q-grams)

		for ( q = m; q >= B; --q ) {

			

			hash  = d_pattern[j][q - 2 - 1]; // bring in offsets of X in pattern j

			hash <<= m_nBitsInShift;

			hash += d_pattern[j][q - 1 - 1];

			hash <<= m_nBitsInShift;

			hash += d_pattern[j][q     - 1];

			CUPRINTF("hash = %i %i %i %i\n", hash, d_pattern[j][q - 2 - 1], d_pattern[j][q - 2], d_pattern[j][q - 1]);

			shiftlen = m - q;

			d_SHIFT[hash] = MIN2( d_SHIFT[hash], shiftlen );

			//calculate the hash of the prefixes for each pattern

			if ( shiftlen == 0 ) {

				prefixhash = d_pattern[j][0];

				prefixhash <<= m_nBitsInShift;

				prefixhash += d_pattern[j][1];

			

			//Error while trying to access d_PREFIX (uncomment one of the 3 lines)

			//	d_PREFIX[hash].value[d_PREFIX[hash].size] = prefixhash;

			//	d_PREFIX[hash].index[d_PREFIX[hash].size] = j;

			//	d_PREFIX[hash].size++;

			}

		}

	}

}

int main( int argc, char** argv)

{	

	

	int m, n, alphabet, p_size;

	unsigned int i;

       	       	 

	char *text_filename = ( char * ) malloc ( sizeof( char ) * 100 );

	char *pattern_filename = ( char * ) malloc ( sizeof( char ) * 100 );

	// Define grid and block size 

	int numThreadsPerBlock, numBlocks;

       	int sharedMemSize = 16000;

	

	// Scan command line arguments    	

		m = atoi(argv[1]);

		n = atoi(argv[2]);

		p_size = atoi(argv[3]);

		alphabet = atoi(argv[4]);

		numBlocks = atoi(argv[5]);

		numThreadsPerBlock = atoi(argv[6]);

	

	// Data paths

	text_filename = "data/text";

	pattern_filename = "data/pattern";

    	

	// HOST malloc

    	unsigned char *text = ( unsigned char * ) malloc ( n * sizeof ( unsigned char ) );

	unsigned char **pattern = ( unsigned char ** ) malloc ( m * sizeof( unsigned char * ));

	if( pattern == NULL )

		fail("Failed to allocate array!\n");

	for( i = 0; i < m; i++ ) {

		pattern[i] = ( unsigned char * ) malloc ( p_size * sizeof( unsigned char ));

		if( pattern[i] == NULL )

			fail("Failed to allocate array!\n");

	}

	int *cuda_matches = ( int * ) malloc ( sizeof( int ) * m );

	int *results_array = ( int * ) malloc ( sizeof( int ) * ( ( n - m + 1) / m ) );

	load_files( pattern, text, m, n, pattern_filename, text_filename, p_size );

	// Device Pointers

	unsigned char **d_pattern;

   	unsigned char *h_temp[p_size], *d_text;  

	int *d_matches;

	// Create 2D Array

	cudaMalloc( (void **)&d_pattern, m * sizeof( unsigned char * ) );

	for( i = 0; i < p_size; i++ )

	       cudaMalloc( (void **)&h_temp[i], m * sizeof( unsigned char ) );	

	

  	cudaMemcpy(d_pattern, h_temp, m * sizeof( unsigned char * ), cudaMemcpyHostToDevice);

  		

	for( i = 0; i < p_size; i++ )

		cudaMemcpy(h_temp[i], pattern[i], m * sizeof( unsigned char ), cudaMemcpyHostToDevice);

	checkCUDAError("Memcpy");

	

	cudaMalloc( (void **) &d_text, n * sizeof ( unsigned char ) );

	checkCUDAError("Malloc");

	

	cudaMalloc( (void **) &d_matches, m * sizeof ( int ) );

	checkCUDAError("Malloc");

	cudaMemcpy( d_text, text, n * sizeof ( unsigned char ), cudaMemcpyHostToDevice );

	cudaMemcpy( d_matches, cuda_matches, m * sizeof( int * ), cudaMemcpyHostToDevice );	

	

	printf(" * Copying host to device array...\n");

	// Threads init

	dim3 dimGrid(numBlocks);

	dim3 dimBlock(numThreadsPerBlock);

	

	// WU CUDA Run

	long shiftsize = 32768;

      	

	int *d_SHIFT;

	cudaMalloc( (void **) &d_SHIFT, sizeof( int ) * shiftsize );

	checkCUDAError("Malloc");

	struct prefixArray *d_PREFIX;

	cudaMalloc( (void **) &d_PREFIX, sizeof( struct prefixArray ) * shiftsize );

	checkCUDAError("Malloc");

	// Kernel Launch

	cudaPrintfInit();

	

	wuKernel<<< dimGrid, dimBlock, sharedMemSize >>>( d_pattern, d_text, d_matches, m, n, numBlocks, shiftsize, d_SHIFT, d_PREFIX, p_size );

	checkCUDAError("Kernel Invocation");

	// Block until the device has completed

	cudaThreadSynchronize();

	checkCUDAError("Kernel Synchronization");

	// Copy Device mem to Host mem

	for( int i = 0; i < p_size; i++)

	   	cudaMemcpy( pattern[i], h_temp[i], m * sizeof( unsigned char ), cudaMemcpyDeviceToHost);

	

	//Print GPU results

	cudaPrintfDisplay(stdout, true);

	// de-allocation

	cudaPrintfEnd();

//freeing device memory

	cudaThreadExit();

			

// freeing host memory

	

	return 0;

}

I tried narrowing it down (it is 700 lines :p)

It’s an algorithm for multiple pattern matching. Its a port from the cpu version, to cuda.

First runs the cpu (not shown in the snippet) and then the gpu.

Assume that everything else load nicely.

The problem is in the kernel where i have the appropriate comment. Uncomenting said lines, results in the error mentioned above. Obviously there is a problem with the struct.

I also tried to create a struct in main and the cudamemcpy it to the struct on the gpu, but nothing changed.

This example still isn’t self-contained, so I can’t test it.

You seem to assume that memory returned by cudaMalloc() is initialized to zero, which it isn’t.

Are you sure your test patterns are never > 127, which would result in an access outside of d_PREFIX?

There are calls to 2 external files (some functions are called) so i couldn’t post them too.

The memory allocation of 2d array on the gpu is taken from sarnath’s example.
When the results are back from the gpu to cpu, i print them and the array is data are as they should be. (if i make a change in the array on the gpu, then i can see it on the copied array on the cpu).

About the pattern size, yes the thing is that the same code runs on the cpu too, it’s identical and the values are right. I tried runing it with set values but still it gives the same error.

Thanks for your time :)
Still haven’t found a solution. Gonna take a breather and try tommorow (it’s 8:42 noon here).

I mean you obviously assume that [font=“Courier New”]d_PREFIX[hash].size[/font] starts from zero, which is not the case. You can have random data in the memory returned by cudaMalloc, which will the lead to an out-of-bounds access when used in the expression [font=“Courier New”]d_PREFIX[hash].value[d_PREFIX[hash].size][/font].

ooooooh! Now i get it! I’ll try and pass some zero values in a loop.

Thanks! :D

You were absolutely right! A million thanks for noting this simple solution :)

(The same does not apply though to the cpu program, but to be sure i will initialize it there too.)

malloc() on the CPU doesn’t zero memory either - although you quite often find the memory cleared because the OS has to zero pages it hands out for security reasons. calloc() does return zeroed memory.

(I’m not sure about the security implications for GPUs - I sometimes wonder when I see screen content from a different user for a short moment…).

Another problem has emerged. It seems that the above is not enough for d_PREFIX struct.

I am now running my program with multiple threads/blocks.

I have 3 warnings when i compile:

"Warning: Cannot tell what pointer points to, assuming global memory space"

and they all refer to lines with access to d_PREFIX.

When i have 1 block and 32 threads (max) everything runs smoothly.

But when i have 33+ threads, my results are different in every run.

The same with 2+ blocks, any number of threads.

So i came to the conclusion that once again i have out of bounds memory access in courtesy of my beloved d_PREFIX struct.

I noted the warnings in the code.

code follows:

Kernel

__global__ void wuKernel( unsigned char **d_pattern, unsigned char *d_text, int *d_matches, int m, int n, int blocksize, int *d_SHIFT, struct prefixArray *d_PREFIX, int p_size, int shiftsize ) {

	 

	int idx = (blockDim.x * blockIdx.x) + threadIdx.x; //thread index

	int start = idx * n / ( blockDim.x * blocksize );

	int stop = start + n / ( blockDim.x * blocksize );	

	//int startThread = threadIdx.x * ( m / blockDim.x );

	//int stopThread = startThread + ( m / blockDim.x );

	

	int i, j, q, B = 3;	

	

	__syncthreads();

	//Safety precaution: The first thread must start at position m - 1, not 0!!

	if ( start < m - 1 )

		start = m - 1;	

	

	unsigned int hash=0, hash1=0, hash2=0;

	unsigned short m_nBitsInShift = 4;

	size_t shift = 0, shiftlen = 0, prefixhash = 0;

	int index = m - 1;

	

	//Preprocessing phase 		

	//for each pattern

	for ( j = 0; j < p_size; ++j ) {

		//add each 3-character subpattern (similar to q-grams)

		for ( q = m; q >= B; --q ) {

			

			hash  = d_pattern[j][q - 2 - 1]; // bring in offsets of X in pattern j

			hash <<= m_nBitsInShift;

			hash += d_pattern[j][q - 1 - 1];

			hash <<= m_nBitsInShift;

			hash += d_pattern[j][q     - 1];

			shiftlen = m - q;

			

			d_SHIFT[hash] = MIN2( d_SHIFT[hash], shiftlen );

			

			//calculate the hash of the prefixes for each pattern

			if ( shiftlen == 0 ) {

				prefixhash = d_pattern[j][0];

				prefixhash <<= m_nBitsInShift;

				prefixhash += d_pattern[j][1];

			

/////warning line///////////////d_PREFIX[hash].value[d_PREFIX[hash].size] = prefixhash;

				d_PREFIX[hash].index[d_PREFIX[hash].size] = j;

				d_PREFIX[hash].size++;

			}

		}

	} 

	__syncthreads();

	

	index = start;

		

	while ( index < stop ) {

		

		hash1 = d_text[index - 2];

		hash1 <<= m_nBitsInShift;

		hash1 += d_text[index - 1];

		hash1 <<= m_nBitsInShift;

		hash1 += d_text[index];

		shift = d_SHIFT[ hash1 ];

		

		__syncthreads();

		

		if ( shift > 0 ) 

			index += shift;

			

		else {	

			

			//when shift = 0 we have a potential match			

			hash2 = d_text[index - m + 1];

			hash2 <<= m_nBitsInShift;

			hash2 += d_text[index - m + 2];

			

			//Compare the prefix of each of the patterns to hash2

			for ( i = 0; i < d_PREFIX[hash1].size; i++ ) {

				

				//if prefix matches, compare target substring with pattern

				if ( hash2 == d_PREFIX[hash1].value[i] ) {

				

					for ( j = 2; j < m; j++ ) {

								

						if( d_pattern[d_PREFIX[hash1].index[i]][j] != d_text[index - m + 1 + j] ){

												

/////////warning line///////////////////////////break; //actually the warning must be referring to the if statement, but this line is what the compiler gave me the warning.

						}

					}

				

					if( j == m )

						atomicAdd( &d_matches[idx], 1);	

				}

			}

		index++;

		__syncthreads();	

		}

		

	__syncthreads();

	

	}

	

__syncthreads();

	

CUPRINTF("Wu GPU matches : %d start = %i stop = %i\n", d_matches[idx], start, stop);     	

}

In the host i create PREFIX struct for the cpu and d_PREFIX host for the device.

I fill the PREFIX struct with zeroes and then copy PREFIX to d_PREFIX.

I thought that would fix the error/warnings but it doesn’t.

Code follows:

Host code:

struct  prefixArray {

	int value[VALUESIZE];		//An array of hash values

	int size;			//Hold the total size of the value array

	int index[VALUESIZE];		//Store the pattern number

};

....

....

int B=3;

struct prefixArray *PREFIX;

	

PREFIX = (struct prefixArray *) malloc ( sizeof( struct prefixArray ) * shiftsize );

//Passing zero values to the host struct

        for ( i = 0; i < shiftsize; i++ ) {

		PREFIX[i].size = 0;

		*PREFIX[i].value = 0;			

		*PREFIX[i].index = 0;	

	}

struct prefixArray *d_PREFIX;

cudaMalloc( (void **) &d_PREFIX, sizeof( struct prefixArray ) * shiftsize );

//passing the zero values to the device struct

cudaMemcpy( d_PREFIX, PREFIX, sizeof( struct prefixArray ) * shiftsize, cudaMemcpyHostToDevice );

wuKernel<<< dimGrid, dimBlock, sharedMemSize >>>( .... );

shameless bump
Anyone? Also can an admin relocate the thread to “CUDA Programming and Development”?
I made a mistake making this thread!

You are probably running into a problem with the code that handles hash collisions. Change it to

//calculate the hash of the prefixes for each pattern

                        if ( shiftlen == 0 ) {

                                prefixhash = d_pattern[j][0];

                                prefixhash <<= m_nBitsInShift;

                                prefixhash += d_pattern[j][1];

int pos = atomicAdd(&(d_PREFIX[hash].size), 1);

                                d_PREFIX[hash].value[pos] = prefixhash;

                                d_PREFIX[hash].index[pos] = j;

                        }

tera: thanks for the reply.

The hash code must run on every thread (so it can provide them with the same prefix table at the end of the loop). So an atomicadd will give undesirable data, thus breaking the algorithm :(

(I tried calculating the preprocessing part of the code on the cpu and then pass the results to the gpu so the threads can do their job and it runs smoothly. The thing is that i MUST have a preprocessing phase on the device so i can run some tests between cpu and gpu speeds on exactly the same algorithms. I’m gonna pass the data to shared memory after i make sure it works on global.)

I tried running the code you suggested though and the error on the results skyrocketed from 140 / 70 to 13000 / 70.

In that case move the initialization of d_PREFIX into a separate kernel that you run with a single thread per block and a single block. There is no point in running the same calculation with the same data all over, particularly if that also corrupts your data structure.

Later you may think about how to use the parallelism provided by CUDA to speed up the initialization.

You’re right. I’m gonna try it. Thanks! :)

It actually works! Finally! :D Thanks again tera! I was kinda dissapointed at first because i’m working on the code several months (my first cuda project).

There is a bug again but it’s minor. When i have a specific number of blocks and threads it gives the wrong results but in the scale of for example 2 to 5 errors from 70 results. I think it’s a syncthread problem from here on. (And it’s linked again to the remainin PREFIX call:

if( d_pattern[d_PREFIX[hash1].index[i]][j] != d_text[index - m + 1 + j] ){

												

        break;

    }

From here on it’s easy. Going to add shared memory!