kernel produces only sometimes right results hierarchical clustering, bachelor thesis

For my bachelor thesis I create a document hierarchical clustering algorithm on the gpu.

I have a CPU only for testing and validating purpose, and a CPU GPU collaboration framework.

Document Clustering Background

On page 9-14 document hierarchical clustering is explained, its in german, but those sites are mainly filled with

formulas and calculations and tables, trees. (the algorithms dont represent the ones in my code!)

Document Clustering (german)

or something i found on the net

document clustering (english)

Development Background:

In the CPU_GPU Framework, the CPU preprocesses the documents and creates a term-document matrix (TDM) with term frequency (TF) values. This TDM is copied to device and the inverse document frequency is calculated in a kernel for each term and the TDM is updated and the final results are column (document) wise normalized in a kernel.

A new kernel creates pairwise correlations with the dot product as similarity value in a float array.

(usually it is a document-document-similarity triangular matrix (DDM_0), which i linearized to save memory)



CPU: create cpu cluster objects (nodes with two children) for each file pair from the (DDM_0)

CPU: allocate a empty references index int array (refArr) on the device for building the cluster tree

Till here everything works perfect and these steps on the gpu are extremely fast due to excessive use of reduction algorithm and no atomic operations, compared to their cpu counterparts, even for very small number of documents. (smaller then 100).

Clustering Loop: wrong results are produced in red line

In a CPU loop two till three kernels are invoked.

Loop start (iterator = i)


//if input data is to big to solve in one reduction, invoke the kernel twice

Kernel: get maximum value from the DDM_(i-1)

CPU: allocate new, smaller values array on the device = DDM_i

//lets reuse our reference array!

CPU: set memory from refArr to 0

Kernel: calculate new cluster similarity values regarding the given maximum index, output in DDM_i,

output the neighbours linear indices in refArr

CPU: free DDM_(i-1) device pointer

CPU: replace DDM_(i-1) with DDM_i device pointer

update cpu cluster tree

CPU: copy DDM_i from device to host

CPU: copy refArr from device to host

CPU: update cluster tree with the cluster references and values

loop end


Its seldom that one of the first iterations produce wrong results, and with bigger document count the fault rate increases!

With document count less then 20, the error is not reproducable and the results are always correct. (or they happen so seldom that it did not appeared yet)


Each Kernel invokation, cudaFree call is followed by a cudaThreadSynchronize(), cuCtxSynchronize()

Both syncs are used since I use Java and i always use the runtime api, and only the driver api is used for kernel calls, dont know if its necessary, just to be sure.

Result Trees 40 documents:

Results Log 40 documents loop:

see appendix.

iteration: 18 is the last correct one in the wrong result log

with a compare tool you can see that only the reference array is wrong in iteration 19, of cause leading to complete wrong results in the following iterations


Before I go into to much detail of my clustering update kernel there is the question.

Why is my final cluster tree sometimes correct!, and sometimes wrong, or

why is my references array (refArr) sometimes correct, and sometimes wrong

I mean, if my algorithm would be wrong, i expect wrong results always.


I started my programm now 30 times where 9 times my results were correct.

I think sometimes the wrong results also repeated themselves.
wrong.txt (8.97 KB)
correct.txt (8.93 KB)

Most likely because the results of your code depend on the specific timing, which is different for each run (a ‘race condition’). An example could be two threads writing the same variable without proper synchronization.

Does the kernel in question use shared memory? [font=“Courier New”]__syncthreads()[/font]? Could different threads write to the same address in global memory (or one thread write to an address thad is read by different threads)?

If these general hints don’t help you to find the problem, post some code here.

1. Shared memory usage:

one cluster struct (int row, int column, float value) is used and initialized by threadIdx.x == 0

2. Syncthreads:

only one synchronisation is used right after 1.

Important Note

Points 1. 2. are not really necessary since each thread can load the max cluster itself.

3. Possible race conditions:

If the cluster of one thread is in relation with the maximum cluster this thread writes to its linear index adress.

So in most cases two threads, write to the same global memory output adress, which is never read again in that kernel, but the algorithm should guarantee that they calculate the same value.

Detailed explanation of my cluster algorithm with a iteration calculated by hand:

cluster algorithm pdf, example iteration

Holy ****:

this is ridiculous, the error happens till the __syncthreads(), but there is no conditional statement, so all threads should reach it! i am searching for a few days and all day long, and would never have found that by myself.

Damn something told me not to use structs, especially if its such a simple one.

Code creating error:

//first line of kernel

__shared__ Cluster clusterMax;


const unsigned int blockId = blockIdx.y * gridDim.x + blockIdx.x;

//initialize our max cluster struct

if(threadIdx.x == 0)

	clusterMax.set(inIndicesRow_g[inMaxIdx_s], inIndicesCol_g[inMaxIdx_s], inValues_g[inMaxIdx_s]);



const unsigned int threadId = blockId * blockDim.x + threadIdx.x;

bool copy = false;

Cluster cluster = Cluster();

Cluster clusterComp;

if(threadId >= inCount_s

	|| threadId == inMaxIdx_s)


//... perform clustering

Fixed Code:

I ran this “fixed” code now 30 times and results were all perfect!

//first line of kernel

const unsigned int blockId = blockIdx.y * gridDim.x + blockIdx.x;

const unsigned int threadId = blockId * blockDim.x + threadIdx.x;

if(threadId >= inCount_s

	|| threadId == inMaxIdx_s)



bool copy = false;

Cluster cluster = Cluster();

Cluster clusterComp;

Cluster clusterMax = Cluster(inIndicesRow_g[inMaxIdx_s], inIndicesCol_g[inMaxIdx_s], inValues_g[inMaxIdx_s]);

//... perform clustering

Cluster struct


 * Simple cluster struct containing a row and column integer field

 * and a float value field, 


 * Simple: represents a element from a two dimensional float matrix


struct __align__(16) Cluster


	float value;

	unsigned int row;

	unsigned int column;

	__device__ Cluster()


		this->set(0, 0, -1.0f);


	__device__ Cluster(unsigned int row, 

		unsigned int column, 

		float value)


		this->set(row, column, value);


	__device__ void set(unsigned int row, 

		unsigned int column, 

		float value)


		this->row = row;

		this->column = column;

		this->value = value;



Great thanks @tera, i dont know why the first code should be wrong but it works now!

especially in my maximum index reduction kernels, i always use a shared struct, which is not initialized with any value and only has two 4 byte fields



how does it come that the value in my shared cluster memory variable changes?, (i never write to it again!), because not all values are wrong in my output linear reference index array, as can be seen in my txt files in the opening thread

for further investigation a version with shared memory and without a struct would be necessary to reduce the fault reason to the struct!