why my map-reduce k-means doesn't work as it should?

apostglen46 · December 22, 2010, 8:15am

I have written a version of k-means with map/reduce logic that doesn’t seem to work as it should,it is not finished yet so don’t worry if it doesn’t look like k-means.
The reduce phaze is written with Thrust and works just like it should (just like the cpu version),but the map stage produces different results in GPU and CPU.
I’m attaching the code file for you to see and help me.

my_k-means_cuda.mapgpu.cu (10 KB)

Many thanks in advance.
Apostolis Glenis

tera · December 22, 2010, 12:08pm

Why do you call [font=“Courier New”]mapFuncGpu()[/font] with [font=“Courier New”]256/NUMBER_OF_ELEMENTS[/font] blocks? That doesn’t seem to make sense to me.

In general, I find your code very hard to read. Use proper indentation, and try to have as much commonality between the CPU and GPU versions as possible. There is no need to name the same (automatic) variables differently in the host and device functions. And you can even use the same [font=“Courier New”]distance()[/font] function on host and device if you write it like this:

__host__ __device__

float distance(float x1, float x2, float y1, float y2)

{

    return sqrtf((x2-x1)*(x2-x1) + (y2-y1)*(y2-y1));

}

apostglen46 · December 22, 2010, 9:50pm

Why do you call [font=“Courier New”]mapFuncGpu()[/font] with [font=“Courier New”]256/NUMBER_OF_ELEMENTS[/font] blocks? That doesn’t seem to make sense to me.

In general, I find your code very hard to read. Use proper indentation, and try to have as much commonality between the CPU and GPU versions as possible. There is no need to name the same (automatic) variables differently in the host and device functions. And you can even use the same [font=“Courier New”]distance()[/font] function on host and device if you write it like this:
__host__ __device__

float distance(float x1, float x2, float y1, float y2)

{

    return sqrtf((x2-x1)*(x2-x1) + (y2-y1)*(y2-y1));

}

I tried to clean up the code to make it more readable,but the output is still the same.Now the map functions are almost identical but still produce different results.

I’m also attaching the data files for anyone who would be kind enought to test them in a reference k-means implementation to determine at least which version of the output is correct.

Thank you for your time

Apostolis Glenis

a.txt (190 Bytes)

b.txt (192 Bytes)

my_k-means_cuda.mapgpu.both.cu (11.7 KB)

tera · December 22, 2010, 10:25pm

You fixed the wrong number of blocks in the call, but introduced two new mistakes in [font=“Courier New”]mapFuncGpu()[/font] while cleaning up:

[][font=“Courier New”]j[/font] should be calculated as [font=“Courier New”]j=blockIdx.xblockDim.x+threadIdx.x[/font].

[*][font=“Courier New”]currDistance[/font] should be a [font=“Courier New”]float[/font] variable, not an [font=“Courier New”]int[/font].

Also the [font=“Courier New”]__syncthreads()[/font] at the end serves no purpose, just leave it out.

Lev · December 22, 2010, 10:26pm

Can you post code here, also are yuo usre cpu implementation works right, has it visual results?

sergeyn · December 22, 2010, 10:36pm

I can’t compile the kernel you’ve uploaded because of missing include files, but how much different your result is ?

The only real difference I can see is that you use double precision sqrt on cpu on gpu you use single precision sqrtf which has 3ulp error (see the docs). Depending on the data your points may end up in different clusters. To minimize the error you could use equivalent metrics, like abs(a.x-b.x)+abs(a.y-b.y).

Btw, to me it seems that you totally miss the point of __syncthreads(), though in the code above it does not affect the result.

apostglen46 · December 23, 2010, 11:30am

i just confirmed it with matlab.

The cpu approach gives the correct result in the map phaze:

0

1

0

1

0

1

0

1

0

while the gpu map function gives something like

0

1

i really can’t understand why this is happening

sergeyn · December 23, 2010, 11:34am

I’d recommend you to debug one particular point and see what exactly happens. Either use nsight debugger, or modify the code to output every intermediate step your kernel does.

apostglen46 · December 25, 2010, 10:58am

I have cleaned up the code even further,tested the reduction code with the map function of the cpu as input and it works exactly like it should,i also added a printf in the gpu version of the map function and it seems that the gpu version also maps the data correctly,i dont know why it is not printed correctly on the screen.

I’m attaching the latest version of my code.

The only extra thing you need to build it is to put the thrust lib folder in the same directory as the .cu file.

http://code.google.com/p/thrust/

my_k-means_cuda.map_gpu_correct.cu (9.46 KB)

sergeyn · December 25, 2010, 3:36pm

I don’t get what do you think is wrong with this code ?
Both cpu and gpu versions work as expected, and return 0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1
Because after the first mapFuncGPU you compare it with not initalized prev_begin (or whatever thrust library initializes it to by default), so the iterations continue (btw the result of this call is the same as you get with matlab)
then you run reduceFuncGPU which rearranges your “data” points so that on the next iteration you get 0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1 .
then you do one more iteration which doesn’t change anything and the loop finally exits.

And btw - debugging by putting printf all around the code is not what I mean by debugging rather use the real debugger where you can see the values of ALL data you work with.

apostglen46 · December 26, 2010, 8:16am

thank you,

I completely forgot aboot the rearrangement after reduce.

About the debugging i’m still king of a newbie in cuda and i am not very accostumed to cuda-gdb yet.

Apostolis Glenis

apostglen46 · December 30, 2010, 8:48am

Hello again,
I’m trying to make my code run faster but i haven’t gotten it to work yet.
My first idea was to use tuple with zip_iterator to avoid doing sort_by_key and reduce_by_key twice.For some weird reason the straight forward approach of using the zip_iterator instead of cendroids_x doesn’t work.
My secong idea was to make a struct to house both the x and y coordinates so that i won’t need to have two seperate arrays.That doesn’t compile as it should.
I posted my question in the thrust mailing list but nobody answered so i’m posting it again here

my_k-means_cuda.map_gpu_reduce_gpu_op_check2.cu (9.3 KB)

this one needs to do one reduce_by_key and sort_by_key instead of two

my_k-means_cuda.map_gpu_reduce_gpu_op_struct.cu (7.81 KB)

this one doesn’t compile

Thanks in advance
Apostolis

sergeyn · December 30, 2010, 1:28pm

I’m sorry, but I have no idea what thrust is about and how it is supposed to work.

I’m not a fan of libraries like that in general since to me it seems that it would take you a lot less effort writing your own version of reduceFuncGPU without using thrust.

apostglen46 · December 30, 2010, 2:56pm

The project might lead to my university diploma thesis,so the idea is to implement reduce function both using thrust and cudpp and native cuda api and evaluate performance overhead,programming effort and stuff like that.

So i will have to write it both with and without thrust.

Thanks for your interest anyway

Apostolis

Edit:i have gotten the zip iterator to work (finally),what i was doing wrong is i forgot to also zip up the data.

I’m attaching the final version

my_k-means_cuda.map_gpu_reduce_gpu_op_zip.cu (9.1 KB)