No results are written to output buffer - why?

Ok, I’ve got a Geforce GTX 275 now. Code doesn’t work though (results are crap), but no errors are signaled by CUDA.

Driver v186.18

CUDA v2.2

Is there anything obviously wrong about this:

#define BLOCKSIZE 512

__global__ static

void pj_inv_pre_kernel (double* lam, double* phi, double* x, double* y, double to_meter, double x0, double y0, double ra) 

{

	const unsigned int xIdx = blockIdx.x * blockDim.x + threadIdx.x;

x [xIdx] = (lam [xIdx] * to_meter - x0) * ra;  

y [xIdx] = (phi [xIdx] * to_meter - y0) * ra;

}

void pj_inv_pre_gpu (double* lam, double* phi, double* x, double* y, double to_meter, double x0, double y0, double ra, unsigned int nPoints) 

{

pj_inv_pre_kernel<<<(nPoints + BLOCKSIZE - 1) / BLOCKSIZE, BLOCKSIZE>>> (lam, phi, x, y, to_meter, x0, y0, ra);

}

void pj_inv_stream (CPJParams* P)

{

pj_inv_pre_gpu (coords.xIn (), coords.yIn (), coords.xOut (), coords.yOut (), P->to_meter, P->x0, P->y0, P->ra, coords.Length ());

}

I have implemented a helper class doing most of the work of managing CPU and GPU side buffers. Please ask back if you need more information or code.

I have to admit that I haven’t really understood how to determine optimal grid and thread parameters for processing an array of some arbitrary length.

Edit:

Changing the code as follows doesn’t help:

#define BLOCKDIM 16

void pj_inv_pre_gpu (double* lam, double* phi, double* x, double* y, double to_meter, double x0, double y0, double ra, unsigned int nPoints) 

{

dim3 dimBlock (BLOCKDIM, BLOCKDIM);

dim3 dimGrid ((nPoints + BLOCKDIM - 1) / BLOCKDIM, (nPoints + BLOCKDIM - 1) / BLOCKDIM); 

pj_inv_pre_kernel<<<dimGrid, dimBlock>>> (lam, phi, x, y, to_meter, x0, y0, ra);

}

It looks like no results are being written to the output buffers.

Actually this stuff looks pretty straightforward, judging from the user guide and the SDK examples. What the heck am I doing wrong?

I am currently evaluating ATI’s Stream Computing and NVidia’s CUDA - not for fun, but for my company. We are offering mostly business to business services and will use one or the other company’s products in our servers for various purposes in the future.

You may call me dumb because I cannot figure my problem myself, but I have invested a good share of time in reading the documentation and SDK samples, and I have figured how to implement my test case with ATI’s Stream Computing, so I should be able to get this going with a little help and a good pointer or two.

I think that it cannot be in NVidia’s interest to lose a potential customer because he doesn’t get any support in getting over initial hurdles in using CUDA; so I thought that someone on the official NVidia forums would and could help me with that. Now maybe I just don’t get a reply because it’s the weekend, so I will happily wait for the next week to hopefully get some help here.

If not, it will be ATI - and be it for the simply reason that I have found CUDA too inaccessible and the disadvantage of not getting any help in case of problems too grave.

Turn the code that interacts with the GPU (from device selection and the initial memcpy to checking the results on the copied back buffer) into a compilable repro case and post it. You haven’t posted enough meaningful code for anyone to help you. More than likely, something’s wrong with your kernel launch and it’s never running on the GPU in the first place. Plus, stop overcomplicating things with these ridiculous helper classes–it will just make it harder to find errors later on (e.g., passing a host pointer to the device).

Also, stop with the sense of entitlement. Nobody here posts on the forums for a living (I just do it when waiting on something else). At the end of the day, whether forum poster karx11erx buys ATI or Intel or NVIDIA is not going to keep me from sleeping.

To, um, “expand” on tmurray’s comments:

You answered your own question here. A forum post on Friday morning was not answered for 24 hours, probably because the usual forum readers were distracted or doing something else. No one is paid to read these forums (apparently not even tmurray), but people post here because they want to be helpful and to learn from others. Sometimes it takes a while to get to a solution because a forum is a poor communication window to understand what you are doing unless you are already very close to the answer yourself. If one of us could stand next to your terminal, this would be much easier to solve. (I assume this is why people pay consultants $150/hour. :) )

Also, I re-read the thread, and no one called you dumb or was even close to doing so. I wouldn’t blame you for being frustrated, though: getting into the “CUDA mindset” is tricky at first.

This is a reasonable statement, though it sounds too much like an ambiguous threat. (I must admit, my response was pretty much the same as tmurray when I got to this point.) Time is money, and you have to make the cost-benefit analysis and decide when the learning curve is not worth additional effort. Forum readers will continue to try to help, but if you need responses faster, you’ll either need to hire a consultant (see the employment forum) or decide that ATI Stream Computing meets your needs and move on. All are fair choices.

Ahead of everything let me state that I am not a novice developer who just finished university. I am a professional for 20 years, with another 10 years of coding experience.

I wrote that I would post more code if needed, but I didn’t want to spam the thread with pages of code (and be attacked for doing that). It might have well been that there was already a mistake I hadn’t noticed in what I had posted. I can post the entire (compileable) project in an archive here, but I guess that’s not what you want either.

Maybe.

You haven’t even seen these classes, but you have the cheek to call them overcomplicated and ridiculous. I have derived this class (one) from an official NVIDIA SDK sample. Obviously your sample coder was an unskilled, ridiculous, overcomplicated noob then? I hope you feel like a fool now.

NVIDIA should have an interest to hook potential customers to their company, instead of having trollish employees like you turn them off and shed a bad light on a company that already has a reputation of being somewhat snotty and arrogant (read the internet if you have been living under a rock).

I have to say that I rather go for ATI than having to deal any more with somebody behaving towards me like you do.

That’s why I said that I’d happily wait until next (i.e. this) week. Would even have waited until Friday. Btw, my post was up last Friday already, which is Thursday evening in the U.S. I don’t know what time zones you’re on, but I suppose that at least tmurray is from the U.S., which means he has posted his above reply on Sunday. So much for the “weekend”.

This is the official NVIDIA forum. There are certainly NVIDIA employees around. If this is the wrong place to look for help for a problem like mine, a single line post with a link to the proper information (and if it was some support guy I’d have to pay) would have done.

Prophylactic statement, wasn’t that clear?

“A threat”. Oh my. Should I laugh or should I cry? If I say something like that after having really tried in some company where people have their minds halfway straight, they’d simply do their best to gain a customer - and not kick me in the teeth like tmurray does. A reply (one that really helps me) this week would be fast enough, that should be clear from what I wrote.

I had needed some help for my ATI Stream Computing implementation too, but the reception and support I had on the ATI forums was worlds apart from the one I had here. I had also been looking for help on other forums where professional people are around (i.e. no fan forums), and they are worlds apart from what a person like tmurray presented here, too. Go figure.

It says a thing or two about this place that while I have to wait for days for an answer regarding a technical problem, people jump at me the very moment they see a chance to feel offended and behave like a troglodyte (tmurray, not you).

karx11erx, I assure you that we do care about your experience and do want you to succeed with CUDA. NVIDIA, as a company, does not regard customer frustration (regardless of how it may be expressed) as something benign or annoying. We have invested a tremendous amount of research, time, and money into our GPU computing resources, and are very interested in seeing customers succeed. More importantly, NVIDIA does not condone any communication to a customer in the tone that Tim’s appears to carry. I sincerely hope that Tim will take the time to clarify or retract his comments.

Separately, I have asked the CUDA team to engage this thread (sadly, I have a 2lb brain that I keep trying to wedge 5lbs of knowledge into, and “CUDA Smarts” is somewhere in that leftover 3 lbs :( ). I will check it from time to time to make sure your concerns are addressed.

Regarding more official support, I’m looking into the right guidance to give you about that as well.

Please accept my apologies for any hurt feelings here – we want everybody to succeed with NVIDIA products and tools. We’re not always successful, but that doesn’t mean we should stop trying.

I think both of us inferred something about the tone of each other’s posts that wasn’t intended to be there. karx11erx, I’m sorry for any misunderstanding there has been.

Getting back to figuring out what’s going on in your app, you need to post a full repro case with as few helper functions (included ones derived from the SDK) as possible. The reason for that is if you’ve got some configuration problem (which is always a possibility), helper functions have a really bad habit of obscuring exactly what’s going on in the application. This is especially true with a lot of the SDK helper functions (I complain about cutil on a regular basis), which may not behave identically in terms of error handling from debug to release.

Something like

/* kernel */

__global__ void kernel(...)

{

...

}

int main(int argc, char** argv)

// initialize host side array, fill with data

// initialize GPU buffers

// memcpy to GPU

// call kernel

// memcpy from GPU to CPU

// confirm correctness of results

would let us solve this very quickly.

The ratio between the performance of a single- and double-precision version of the same code will depend on the application. Even though the ratio of SP and DP units is 8:1 in the current generation, performance ratio is always lower since not all instrutions in the program are DP. For example, even for very arithmetically intensive code like dense matrix-matrix multipliation, the ratio is ~5 (400 GFlops/s SP, 85 GFlops/s DP). For other codes, which have more non-DP instrutions SP and DP performance is even closer. Also, DP codes consume more bandwidth so it’s likely to become a limiting factor as well.

Regarding your question about DP instructions. The current generation hardware does a basic set of DP operations in hardware. The rest (transcendentals, etc.) are built up from this set (meaning that an operation, such as sin, will take a sequence of DP instructions).

Paulius

karx,

as a quick test, can you compile and run the attached code? It’s a tiny test of DP operation. The results should be incorrect if SP arithmetic is used (make sure to compile with ‘-arch=sm_13’ compiler flag).

Paulius

Edit: OK, seems uploading .cu file failed, so I’m pasting it:

#include <stdio.h>

__global__ void test_kernel( double* output, double *input, int n )

{

	int idx = blockIdx.x*blockDim.x + threadIdx.x;

	output[idx] = input[idx]*pow(2.0,-10);

}

int main()

{

	double VALUE = pow(2.0,-20);

	int n = 1024;

	int nbytes = n*sizeof(double);

	double *h_data = (double*)malloc( nbytes );

	double *d_data = 0;

	cudaMalloc( (void**)&d_data, nbytes );

	if( 0==d_data || 0==h_data )

	{

		printf("couldn't allocate memory\n");

		return -1;

	}

	for(int i=0; i<n; i++)

	h_data[i] = VALUE;

	cudaMemcpy( d_data, h_data, nbytes, cudaMemcpyHostToDevice );

	dim3 block(256,1);

	dim3 grid(n/block.x, 1);

	test_kernel<<<grid,block>>>( d_data, d_data, n );

	cudaMemcpy( h_data, d_data, nbytes, cudaMemcpyDeviceToHost );

	double sum=0.0;

	for(int i=0; i<n; i++)

		sum += h_data[i];

	double diff = sum - n*VALUE*pow(2.0,-10);

	if( diff < 0 )

		diff = -diff;

	// !!!!!!!!!!!!!!

	// note this test assumes that results should match exactly, since all numbers are powers of 2

	//   testing within epsilon would be needed in the general case

	if( diff != 0 )

		printf("error in result\n" );

	else

		printf("correct result\n");

	if( h_data )

		free( h_data );

	if( d_data )

		cudaFree( d_data );

	return 0;

}

Thank you all for your replies so far.

Bingo! That was the entire problem!

I know I had been told about this here before, but I am swapping code between my workplace and home machine for doing home office, and apparently I failed to properly update my Visual Studio solution file at home.

It’s the small things that make people happy … :rolleyes:

Thank you very much. External Media

Well, with my double precision test case my ATI Radeon HD 4870 1GB (Sapphire, default clocks) is almost twice as fast as my NVIDIA Geforce GTX 275 (MSI Twin Frozer OC), and that is with trigonometric and transcendental functions implemented by myself for the ATI card, while using the hardware functions on the NVIDIA hardware.

The compute 1.3 devices don’t have hardware trigonometric and transcendental instructions. The CUDA runtime library implements them in software out of standard double precision floating point arithmetic primitives.

So what? I did the same for the ATI test.