For my bachelor thesis I create a document hierarchical clustering algorithm on the gpu.
I have a CPU only for testing and validating purpose, and a CPU GPU collaboration framework.
Document Clustering Background
On page 9-14 document hierarchical clustering is explained, its in german, but those sites are mainly filled with
formulas and calculations and tables, trees. (the algorithms dont represent the ones in my code!)
or something i found on the net
In the CPU_GPU Framework, the CPU preprocesses the documents and creates a term-document matrix (TDM) with term frequency (TF) values. This TDM is copied to device and the inverse document frequency is calculated in a kernel for each term and the TDM is updated and the final results are column (document) wise normalized in a kernel.
A new kernel creates pairwise correlations with the dot product as similarity value in a float array.
(usually it is a document-document-similarity triangular matrix (DDM_0), which i linearized to save memory)
CPU: create cpu cluster objects (nodes with two children) for each file pair from the (DDM_0)
CPU: allocate a empty references index int array (refArr) on the device for building the cluster tree
Till here everything works perfect and these steps on the gpu are extremely fast due to excessive use of reduction algorithm and no atomic operations, compared to their cpu counterparts, even for very small number of documents. (smaller then 100).
Clustering Loop: wrong results are produced in red line
In a CPU loop two till three kernels are invoked.
Loop start (iterator = i)
//if input data is to big to solve in one reduction, invoke the kernel twice
Kernel: get maximum value from the DDM_(i-1)
CPU: allocate new, smaller values array on the device = DDM_i
//lets reuse our reference array!
CPU: set memory from refArr to 0
Kernel: calculate new cluster similarity values regarding the given maximum index, output in DDM_i,
output the neighbours linear indices in refArr
CPU: free DDM_(i-1) device pointer
CPU: replace DDM_(i-1) with DDM_i device pointer
update cpu cluster tree
CPU: copy DDM_i from device to host
CPU: copy refArr from device to host
CPU: update cluster tree with the cluster references and values
Its seldom that one of the first iterations produce wrong results, and with bigger document count the fault rate increases!
With document count less then 20, the error is not reproducable and the results are always correct. (or they happen so seldom that it did not appeared yet)
Each Kernel invokation, cudaFree call is followed by a cudaThreadSynchronize(), cuCtxSynchronize()
Both syncs are used since I use Java and i always use the runtime api, and only the driver api is used for kernel calls, dont know if its necessary, just to be sure.
Result Trees 40 documents:
Results Log 40 documents loop:
iteration: 18 is the last correct one in the wrong result log
with a compare tool you can see that only the reference array is wrong in iteration 19, of cause leading to complete wrong results in the following iterations
Before I go into to much detail of my clustering update kernel there is the question.
Why is my final cluster tree sometimes correct!, and sometimes wrong, or
why is my references array (refArr) sometimes correct, and sometimes wrong
I mean, if my algorithm would be wrong, i expect wrong results always.
I started my programm now 30 times where 9 times my results were correct.