npp GraphCut bug..? Problems running nppiGraphcut_32s8u.

Hi there,
Been trying to use nppiGraphcut_32s8u() for a while now.
Most of the time, it works just fine. However, it seems that on certain inputs it fails and returns NPP_CUDA_KERNEL_EXECUTION_ERROR.
I am unable to point out the reason nor predict on which of my real inputs nppiGraphcut_32s8u() will fail.
However, I found some simple artificial inputs for which it also fails with the same error code. Of course, I’m not sure its for the same reasons.
For example, try running nppiGraphcut_32s8u() with all the capacities other than the terminals set to 0, and the terminals capacities all set to 1 (or -1 or any other constant).
Is this an expected behavior?

Any help will be much appreciated.

Thanks,
David

Bump

Hi David,

thanks for your report! The only critically important thing for the graph cut to work should be that the graph is valid, meaning edges at the boundaries pointing out are all set to 0 (e.g. left_transposed(y,0) == 0 / right_transposed( y,width-1) == 0 / top(x,0) == 0 / bottom(x,height-1) == 0).

The case you describe should be valid and i have been trying to reproduce the issues you are seeing but wasn’t successful so far so. So hoping you can detail a bit more or upload the repro case. Here’s my repro attempt by modifying the SDK sample binarySegmentation to get 0 edge capacities / 1 source cap case (also tried the 1 sink cap case) from your report :

loadMiddleburyMRFData(sFilename, dataCostArray, hCue, vCue, width, height, nLabels);

		NPP_ASSERT(nLabels == 2);

		// mod start

		memset(hCue, 0, width*height*sizeof(int)); // This will lead to edge capacities == 0 everywhere

		memset(vCue, 0, width*height*sizeof(int)); 

		for( int i=0; i<width*height; ++i ) { // This leads to excessflow == 1 everywhere

			dataCostArray[i] = 1;

			dataCostArray[i+width*height] = 0;

		}

                // mod end

		std::cout << "Dataset: " << sFilename << std::endl;

		std::cout << "Size: " << width << "x" << height << std::endl;

This still works fine for me - anything i missed? To rule out system related issues can you also post your OS?

Thanks,

-Timo
binarySegmentation.cpp (7.69 KB)

I have the same problem with NPP graphcut also (NPP_CUDA_KERNEL_EXECUTION_ERROR). The dataset I use seems to be correct (there are check in the code). Do you see any problems?

int main(int argc, char** argv) {

    int* terminals = load("terminals.txt");

int* left_tr = load("left_tr.txt");

    for (int y = 0; y < height; ++y)

        assert(left_tr[y] == 0);

int* right_tr = load("right_tr.txt");

    for (int y = 0; y < height; ++y)

        assert(right_tr[(width - 1) * height + y] == 0);

int* top = load("top.txt");

    for (int x = 0; x < width; ++x)

        assert(top[x] == 0);

int* bottom = load("bottom.txt");

    for (int x = 0; x < width; ++x)

        assert(bottom[(height - 1) * width + x] == 0);

int step, step_tr, step_labels;

Npp32s* d_terminals = nppiMalloc_32s_C1(width, height, &step);

    cout << cudaMemcpy2D(d_terminals, step, terminals, width * sizeof(int), width * sizeof(int), height, cudaMemcpyHostToDevice) << endl;

Npp32s* d_left_tr = nppiMalloc_32s_C1(height, width, &step_tr);

    cout << cudaMemcpy2D(d_left_tr, step_tr, left_tr, height * sizeof(int), height * sizeof(int), width, cudaMemcpyHostToDevice) << endl;

Npp32s* d_right_tr = nppiMalloc_32s_C1(height, width, &step_tr);

    cout << cudaMemcpy2D(d_right_tr, step_tr, right_tr, height * sizeof(int), height * sizeof(int), width, cudaMemcpyHostToDevice) << endl;

Npp32s* d_top = nppiMalloc_32s_C1(width, height, &step);

    cout << cudaMemcpy2D(d_top, step, top, width * sizeof(int), width * sizeof(int), height, cudaMemcpyHostToDevice) << endl;

Npp32s* d_bottom = nppiMalloc_32s_C1(width, height, &step);

    cout << cudaMemcpy2D(d_bottom, step, bottom, width * sizeof(int), width * sizeof(int), height, cudaMemcpyHostToDevice) << endl;

Npp8u* d_labels = nppiMalloc_8u_C1(width, height, &step_labels);

NppiSize size;

    size.width = width;

    size.height = height;

Npp8u* d_buf;

    int buf_size;

    cout << nppiGraphcutGetSize(size, &buf_size) << endl;

    cout << cudaMalloc(&d_buf, buf_size) << endl;

// It prints -3 = NPP_CUDA_KERNEL_EXECUTION_ERROR

    cout << "gc: " << nppiGraphcut_32s8u(d_terminals, d_left_tr, d_right_tr, d_top, d_bottom,

                                         step_tr, step, size, d_labels, step_labels, d_buf) << endl;

    cout << cudaGetLastError() << endl;

    cout << cudaDeviceSynchronize() << endl;

nppiFree(d_buf);

    nppiFree(d_terminals);

    nppiFree(d_left_tr);

    nppiFree(d_right_tr);

    nppiFree(d_top);

    nppiFree(d_bottom);

    nppiFree(d_labels);

delete[] terminals;

    delete[] left_tr;

    delete[] right_tr;

    delete[] top;

    delete[] bottom;

return 0;

}

NppGraphCutReproCase.zip (64.7 KB)

Hi alsp,

thanks very much for your report. Your code is valid and indeed exposes a bug with specific graphs in the current NPP library. The fixed version of the library will become available with the next CUDA release.

The good news is that the result computed in this case is still correct although the return result indicates otherwise. The error is triggered in the last kernel call which is launched although no work is left (the bug is due to a missing check for a 0 grid size which is an invalid launch configuration). Since this is in the last stage and no work is left the result was already final when the error happens and hence can be ignored.

Sorry for the inconvinience,
-Timo

Hi alsp,

thanks very much for your report. Your code is valid and indeed exposes a bug with specific graphs in the current NPP library. The fixed version of the library will become available with the next CUDA release.

The good news is that the result computed in this case is still correct although the return result indicates otherwise. The error is triggered in the last kernel call which is launched although no work is left (the bug is due to a missing check for a 0 grid size which is an invalid launch configuration). Since this is in the last stage and no work is left the result was already final when the error happens and hence can be ignored.

Sorry for the inconvinience,
-Timo

Okay. Thanks for the answer.

Okay. Thanks for the answer.

Hi all,

is there something special we should know about how the data are represented in the arrays. For some reason I can’t make it work.
This is how I access the data in the label array for instance:

int step_labels_norm = step_labels/sizeof(Npp8u);
unsigned char h = d_labels[i*step_labels_norm + j];

i is the row index and j the column index.
i goes from 0 to Height-1
j goes from 0 to Width-1
I suppose that there are only two values for the binary segmentation : 0 and 255, right ?

In the same way I don’t understand how the data are represented in the above example : terminals.txt, left_tr.txt, top.txt…

I am trying to build an application that use this GC segmentation but without success so far. The seed location are entered via the user and from that I am building very simple data models (gaussian), which I use afterward to compute my excess flow (e.g d_terminals).
Concerning the smoothness term I use a simple relation : 255 * (2.0f/(1.0f + expf(d2))) + 10 where d2 is the euclidean distance between two pixels forming an edge in the image.
The example I found on Internet are all working with the middlebury data where a Gaussian Mixture model is used and in this case the function is working fine.

In my code there is no error returned. The program just can’t get out of the function nppiGraphcut_32s8u(). I was also cautious about the boundary edges in the 4 capacity arrays.

I provide also the data files representing my data concerning the sponge image(see attached files). The image should be inverted (in CUDA the image origin is on the bottom left corner and in Windows its the top left corner)

Thanks in advance for any information that can help me.

Greg

Arrays.zip (502 KB)

Hi Greg,

you are correct about the memory layout for the computed labels. Note, that in the above example the data is only available on the device (GPU) after nppiGraphcut_32s8u. If you want to access the result on the host (CPU) the data needs to be copied back to host.

The layout for each of the arrays following your notation is:

int nStep_norm = nStep / sizeof(Npp32s);
int nTransposedStep_noem = nTransposedStep / sizeof(Npp32s);

terminal[j * nStep_norm+i] where terminal = source_capacity - sink_capcity
left_tr[i * nTransposedStep_norm+j] (NOTE: horizontal edge capacities are stored in transposed form)
right_tr[i * nTransposedStep_norm+j]
top[j * nStep_norm+i]
bottom[j * nStep_norm+i]

(See also p110f in the NPP_Library documentation)

I will test with the graph you have provided to investigate potential issues and report my findings shortly.

-Timo

Greg,

i just tried to use the graph you provided via the .dat files. It seems that i am trying to read them in a noncompatible way or something might have gone wrong when you saved them. Here is what i did:

FILE* file = fopen("terminal.dat", "rb");

Npp32s* data = (Npp32s*) malloc(640*480*sizeof(Npp32s));

fread( data, sizeof(Npp32s), 640*480, file);

The terminal capacities are all positive meaning that the solution would be trivial. Further the edge capacities are non zero at the outwardpointing boundary vertices. Since you said you took care of this i guess that i am looking at bogus data. Can you verify the graph that you attached earlier is correct and/or tell me how to read it correctly?

Thanks,

-Timo

Hi Timo,

Thanks a lot for your help. It is greatly appreciated. I ordered my data in a more natural way. For terminal.dat for instance there are 480 rows of 640 columns.
And for the transposed capacities, there are 640 rows of 480 columns.
It is stored like an image and provide me an easier way to verify my data using a text editor. So you need to do a double for statement to read them correctly.

Concerning the d_labels array I am reading it in a kernel, so no problem on this side.

In “terminal.dat” there are plenty of negative values. They are mainly located to the center of the image where the sponge is.
The edge capacity are 0 on the first rows and last rows of the 4 arrays (I know it’s not right but I did that on purpose to avoid any error).
Normally it should be :
top.dat => first row filled with 0
bottom.dat => last row filled with 0
left_tr.dat => first row filled with 0 ( it corresponds to the first column in the non-transposed version)
right_tr.dat => last row filled with 0 (it corresponds to the last column in the non-transposed version

In my notation “i” was the index of the row. Is it the same for you ? It seems more of the index of the columns.

Also I tried to implement a naive version of graphcut segmentation. For that I used your presentation you gave in San Jose (2009 - Graph Cuts with CUDA) which I find
is the most comprehensive document I’ve seen from the point of view of someone who want to implement it. However, I am not exactly sure how do you get the segmented
parts in the image once the pushrelabel algorithm is terminated (e.g. graph saturated). You are talking about the inverse Breadth First Search (BFS) algorithm.
Is it something difficult to implement or do you simply test every node on their remaining capacity (if there are negative they correspond to the sink).

You are also talking about a global relabeling algorithm. For what I understand it is to speed up the process of convergence. But I am not sure where it should be located in the code.
Do you have to do it every few iterations of the push relabel or just at the end of it ?

Thanks again :smile:

Greg

Greg,

yes, sorry, j = row, i = col

Can you please post a code snippet how to read your .dat files corretly? I am still not seeing the values you are describing. Thx.

There is also a more in depth discussion available on the implementation in the GPU Computing Gems Emerald Edition. There is one breadth first search that is to find the labels which is done only once at then end. I call it inverse because it looks at residual edge capacity in the inverse direction of the search. It is more elaborate than thresholding the residual terminal capacities yes. There is another that is used to compute the distance of any vertex to the closest vertex that has residual capacity to sink. That is the global relabeling strategy to improve convergence speed of the algorithm. You might also want to check the orginal paper of Goldberg & Tarjan or a textbook like the one by cormen et al for more details about the algorithm itself.

-Timo

Here is a function you can use to read terminal.dat and the two non-transposed capacities files:

void ReadIntFile(char* name, unsigned int iWidth, unsigned int iHeight, unsigned int stride, int* data)

{

	int val;

	unsigned int stride_norm = stride/sizeof(int);	

	FILE *file;

	file = fopen(name,"r"); 

	if (file == NULL) 

		perror ("Error opening file");

	else

	{

		for (unsigned int i=0;i<iHeight;++i)

		{

			for(unsigned int j=0;j<iWidth;++j)

			{

				fscanf(file,"%d", &val);

				data[i*stride_norm + j] = val;

			}

		}

		

		fclose(file);

	}	

}

And here is the code for the transposed arrays:

void ReadIntFile_tr(char* name, unsigned int iWidth, unsigned int iHeight, unsigned int stride_tr, int* data)

{

	int val;

	unsigned int stride_tr_norm = stride_tr/sizeof(int);

	FILE *file;

	file = fopen(name,"r"); 

	if (file == NULL) 

		perror ("Error opening file");

	else

	{

		for(unsigned int j=0;j<iWidth;++j)		

		{

			for (unsigned int i=0;i<iHeight;++i)

			{

				fscanf(file,"%d", &val);

				data[j*stride_tr_norm + i] = val;

			}

		}

		

		fclose(file);

	}	

}

Thanks for the document references. I’ll take a look at them.

Greg

Hi Greg,

thanks for the code. I just verified that the result of the NPP graphcut matches the result computed on the CPU with Boykov’s code. So there shouldn’t be an issue with your graphs per se. However, i might know where the problem observed by you might come from. Am i guessing correctly that you based your code on the example posted by alsp? If yes, the Graphcut call in the code

cout << "gc: " << nppiGraphcut_32s8u(d_terminals, d_left_tr, d_right_tr, d_top, d_bottom, step_tr, step, size, d_labels, step_labels, d_buf) << endl;

should actually be

cout << "gc: " << nppiGraphcut_32s8u(d_terminals, d_left_tr, d_right_tr, d_top, d_bottom, step, step_tr, size, d_labels, step_labels, d_buf) << endl;

The step and step_tr arguments got mixed up. This resulted in invalid strides and bogus memory accesses and ultimately led to the problem you observed. I will add a check inside the NPP library to detect invalid step arguments for the next version.

On a side note, there will be some changes and additions to the Graphcut primitve in the soon to come new NPP release. For the graph you have posted I e.g. saw a ~3.5x performance improvement (Win7, GTX470). Make sure you are a registered CUDA developer if you want to get access to the beta as it becomes available.

Hi Timo,

It is working now. Thanks. It’s great :) Indeed the arguments were inverted. It’s frustrating sometimes to see an error like this. I don’t think I will have found it.
A x3.5 improvement sounds really interesting. I will definitely check it out.

Now concerning the inverted BFS, if I understand correctly you initialize it by looking for nodes with negative excess flow (connected to the sink, I call them the N-Nodes)
and then you do an iterative process to find the nodes (with positive or nul excess flow) that are connected to the N-Nodes (edge capacity not null). Is that correct ?

Thanks again for your time.

Gregory

Hi Greg,

thats great news. Glad that it is working correctly now :-)

Yes, thats about right. For the final labeling the initialisation is label 0 for nodes with residual terminal cap < 0 and 255 otherwise. Then the iteration goes by looking at the neighbors of the N-nodes that have residual capacities going from the neighbor to N-node and marking those as 0 until convergence.

-Timo

Thanks :thumbup: It is clear for me now.

Have a good day.

Greg

Hi Timo and Greg, I need your help to solve a problem I’m thinking of.

I currently need a graph cut algorithm (ideally push-relabel) and it seems like nppiGraphcut can do the trick with the 4neighbor and 8neighbor as described in grabcut algorithm :

http://www.nvidia.com/content/GTC/documents/1060_GTC09.pdf

I want to know if npp’s graph cut can help me solve for the logical graph cut outputs for {S,T} (i.e. source set and sink set) needed according to the height (distance) solved on the following graph input.

IN matlab, I have the following going into the BK graph cut ‘mxMaxFlow.mex’ algorithm:

[output, flow]=mxMaxFlow2(nNodes,int32(eFrom2),int32(eTo2),weights(:,1),weights(:,2),sourceWeights,sinkWeights);

  • nNodes is the total number of nodes (not including source/sink), usually in the order of 3000 or more
  • eFrom2 are the ‘from’ nodes of the arc
  • eTo2 are the ‘to’ nodes of the arc
  • weights(:,1) = 1e15 or infinite, for the forward capacities between non-st nodes (i.e. n2->n3->n4…etc)
  • weights(:,2) = 0 or zero, for the reverse capacities between non-st nodes (i.e. n2<-n3<-n4…etc)
    -sourceWeights are the capacities/weights of the source branching to all nodes (excluding target/sink node)
  • sourceWeights are the capacities/weights of the all nodes towards sink (excluding source node)

Can npp graphcut help me solve for the output (i.e. logical 1’s for source set, and 0’s for sink set based on height/distance of each non-st node)? And can it solve for the max flow / min cut?

Your help will be super appreciated!

Thanks!