Memory access violation Reading text file, sending content to GPU

Hello everyone,

I’m quite new to C++ and have just started CUDA programming as well for my master’s thesis, but all i have encountered so far is thousands different problems.

For days now i have struggled trying to get a program to read a text file with a list of words (patterns), then send it to the GPU for further processing (pattern matching against some other text).

For this i have found a text file with “bad words”, and i want to try run a kernel for each line from this file.

(The goal of the program as a whole is to be like a tiny prototype of an IDS, scanning network packets like Snort for known signatures, however that seems a few years away as of now).

I have tried many sorts of guides but none of them seem to manage to send a dynamic character array to the GPU, (or a list of character arrays).

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <stdio.h>

#include <list>

#include <iostream>

#include <fstream>

#include <string>

using namespace std;

struct Dict{

	char *list;

	int offsets[480]; //Can this be made dynamically? its 480 patterns in the badwords.txt

	int max_len;

	int num_patterns;

	void genDictionary(const char * filename){	

		max_len =0;

		num_patterns =0;

		char * cstr;

		string temp, line;

		int i =0;

		int curr_offset =0;

	ifstream myfile (filename);

			if (myfile.is_open())

				{

				

				while ( myfile.good() )

				{

					getline (myfile, line);

					cstr = new char[line.size()+1];

					strcpy(cstr, line.c_str());

					temp += cstr;

					curr_offset +=line.size(); 

					offsets[i] = curr_offset;

					

					num_patterns++;

					if (max_len < line.size()) max_len = line.size();

					i++;

					}

				myfile.close();

				}

			list = new char[temp.size()+1];

			strcpy(list, temp.c_str());

	}

};

__global__ void runKernel( char** g_list, int *g_offset) 

{  

	  printf("Hello thread %d\n", threadIdx.x);

	  //Do something with the input

}

void runEngine(){

	Dict dictionary;

	dictionary.genDictionary("badwords.txt");

	int num_threads = dictionary.num_patterns;

	int mem_size_list = sizeof(char) *strlen(dictionary.list);

	int mem_size_offsets = sizeof(int)* sizeof(dictionary.offsets);

// allocate device memory for list

    char** d_list;

    cudaMalloc( (void**) &d_list, mem_size_list);

	cudaMemcpy( d_list, dictionary.list, mem_size_list, cudaMemcpyHostToDevice) ;

	//// allocate device memory for offset

    int* d_offset;

    cudaMalloc( (void**) &d_offset, mem_size_offsets);

	cudaMemcpy( d_offset, dictionary.offsets, mem_size_offsets, cudaMemcpyHostToDevice) ;

    // copy host memory to device

	runKernel<<<1,1>>>(d_list, d_offset);  //start 1 kernel just for testing

}

int main(){

runEngine();

cudaDeviceReset();

return 0;

}

Can anyone please see if they locate the error, or give link to some guides that manage list of character arrays.

Appreciate all efforts made!

I added my whole project file, + the badwords.txt as attachment as well. (VS 2010 project).

Windows 7 64bit (program 32bit)

CUDA 4.1

SDK 4.1

Compiled as 2.0 compute.
Test.rar (3.05 MB)

Hello

According to my understanding of your solution, you are trying to create one long character array with all patterns in the same array, and then use an offset variable to determine where each pattern start. In principle, this should be a good approach.

I noticed some errors in the memory allocation part of the code. First, since you have merged all patterns into one long string, this one should be of type char* and not char**, which is an array of pointers.

Second, every time you allocate memory using malloc or cudaMalloc, you need to multiply the length of the array with the size of the type you want to allocate, i.e.

cudaMalloc( (void**) &d_offset, mem_size_offsets*sizeof(int));

the same applies for cudaMemcpy.

Besides these errors, if you use dynamic memory allocation with new, you should remember to call delete on the arrays when you have finished using them.

If you want to use dynamic memory allocation for the offsets variable, you either need to let your program count the number of lines in advance before starting to read them, or you could allocate a suitable size, and when that size is too small, you allocate a larger array and copy all data to the new array.

Regards

Anders

Thank you for the reply!

I got the char array into the kernel now, so does the offset (atleast so far it looks good of what i have printf’ed).

Now to figure out how to handle output!

Still would love links to char array examples with 2 or more dimensions if anyone know any.

In principle, you can do it like the following, assuming that you have your data in the two dimensional array cpuarray. For simplicity, I made the array containing 10 strings

char* array[10];

char** cudaarray;

int strlength;

for(int k=0;k<10;k++) {

  strlength=strlen(cpuarray[k]);

  cudaMalloc( (void**) &array[k], strlength*sizeof(char));

  cudaMemcpy( array[k], cpuarray[k], strlength*sizeof(char), cudaMemcpyHostToDevice);

}

cudaMalloc( (void**) &cudaarray, 10*sizeof(char*));

cudaMemcpy( cudaarray, array, 10*sizeof(char*), cudaMemcpyHostToDevice);

However, if you do not need to do it in this way, I would not really recommend using arrays of pointers on the gpu unless it is necessary. First, it requires a lot small memory allocations and memory copies, which likely is slower than one large. Besides, I have personally had problems using pointers to pointers on the gpu (see [topic=‘The Official NVIDIA Forums | NVIDIA’]The Official NVIDIA Forums | NVIDIA) and I would say that having all data in one single array is a more reliable solution.

Regards

Anders

Thank you for good advice;)

As the goal is to make it as fast as possible, so one single large memcopy would be ideal as you say, (and as I’ve read in the CUDA books i have), it would just have made it easier to handle in the kernel code-vise, but i guess i have to manage;)

What I’m struggling with now is how i can get all the kernels to write to the same output character array for the return answers as well, if that is ever possible.

If kernel thread x find the pattern in the input string, it should just add a “notification of it” at the current end of g_output, and continue parse the input until its done.

But this would make a race condition against g_output would it not?

Is there some sort of on-kernel atomicAdd for combining character arrays?

and how would one actually go forth in combining the current g_output with this new notification? And is there some cuda function as normal STD such as strcat?

example: g_output = (g_output + “notification”);?

What my code looks like now is something like this, however it does not work because of the strcat, and frankly I’m not sure if the rest if correct either, still fresh on this C code:

__global__ void runKernel(char* g_pattern_list, int *g_pattern_offset, int pattern_size, char* g_packet_input, int packet_size, char * g_output)

{  

	int tid = threadIdx.x + blockIdx.x * blockDim.x;

	int index = 0;

	for(int i=0; i<packet_size; i++){

		for(int j=g_pattern_offset[tid]; j< g_pattern_offset[tid+1]; j++){

			if(g_pattern_list[j] != g_packet_input[i+index]) break; //Break if not the same

			else

			index++;

			if(index == g_pattern_offset[tid+1]) //found the whole pattern

				strcat(g_output, "found pattern at: "+g_pattern_offset[tid]);

		}

	}

}

One way you could try is to use atomicAdd for integers. Have a offset value, where the next data should be inserted. Once you have your data and know how long it is, call atomicAdd on the offset value to reserve space for the data. Then, you should safely be able to insert your data in the array at the offset position without any race conditions.

regarding strcat, it can be implemented using a normal for-loop.

If your intention is to write fast code, I recommend you to study how the gpu warps work. In principle, your goal is to keep all threads (within a warp) executing exactly the same code all the time. Considering that your work appears to contain several if statements etc, this is probably something you need to consider for optimal performance.

Regards
Anders