No results are written to output buffer - why?

I am using double buffering to pass data to CUDA kernels and receive results from them, but I don’t get any results back. Why is that?

Code snippets (GpuArray is from the smokeParticles demo, I used it to make accessing CUDA easier):

GpuArray<double2> xyStream [2];

GpuArray<double> zStream [2];

double2* xyBuf = NULL;

double* zBuf = NULL;

int nDestXY, nDestZ;

//------------------------------------------------------------------------------

int WriteBuffers (long pointCount, int nStride, double *x, double *y, double *z)

{

for (int i = 0, j = 0; i < pointCount; i++, j += nStride) {

   xyBuf [i].x = x [j];

   xyBuf [i].y = y [j];

   }

if (!z)

   memset (zBuf, 0, pointCount * sizeof (double));

else {

   if (nStride < 2)

	  memcpy (zBuf, z, sizeof (double) * pointCount);

   for (int i = 0, j = 0; i < pointCount; i++, j += nStride)

	  zBuf [i] = z [j];

   }

return 0;

}

//------------------------------------------------------------------------------

int CreateBuffers (long pointCount, int nStride, double *x, double *y, double *z)

{

xyBuf = new double2 [pointCount];

zBuf = new double [pointCount];

nDestXY = 0;

nDestZ = 0;

WriteBuffers (pointCount, nStride, x, y, z);

return 0;

}

//------------------------------------------------------------------------------

void WriteStreams (long pointCount)

{

memcpy (xyStream [!nDestXY].getHostPtr (), xyBuf, sizeof (double2) * pointCount);

memcpy (zStream [!nDestZ].getHostPtr (), zBuf, sizeof (double) * pointCount);

xyStream [!nDestXY].copy (GpuArray<double2>::HOST_TO_DEVICE, 0, pointCount);

zStream [!nDestZ].copy (GpuArray<double>::HOST_TO_DEVICE, 0, pointCount);

}

//------------------------------------------------------------------------------

int CreateStreams (long pointCount)

{

nDestXY = 0;

nDestZ = 0;

xyStream [0].alloc (pointCount, false, false, false);

xyStream [1].alloc (pointCount, false, false, false);

zStream [0].alloc (pointCount, false, false, false);

zStream [1].alloc (pointCount, false, false, false);

WriteStreams (pointCount);

return 0;

}

//------------------------------------------------------------------------------

void ReadStreams (long pointCount)

{

xyStream [nDestXY].copy (GpuArray<double2>::DEVICE_TO_HOST, 0, pointCount);

zStream [nDestZ].copy (GpuArray<double>::DEVICE_TO_HOST, 0, pointCount);

memcpy (xyBuf, xyStream [nDestXY].getHostPtr (), sizeof (double2) * pointCount);

memcpy (zBuf, zStream [nDestZ].getHostPtr (), sizeof (double) * pointCount);

}

//------------------------------------------------------------------------------

int pj_transform (CPJParams *srcdefn, CPJParams *dstdefn, long pointCount, long nStride, double *x, double *y, double *z, bool bUseStreams)

{

CreateBuffers (pointCount, nStride, x, y, z);

CreateStreams (pointCount);

...
__global__ static

void pj_inv_pre_stream (double2* lp, double2* xy, double to_meter, double x0, double y0, double ra) 

{

   const unsigned int xIdx = threadIdx.x;

xy [xIdx].x = (lp [xIdx].x * to_meter - x0) * ra;  

xy [xIdx].y = (lp [xIdx].y * to_meter - y0) * ra;

}

void pj_inv_pre (double2* lp, double2* xy, double to_meter, double x0, double y0, double ra, unsigned int nPoints) 

{

dim3 grid (1, 1, 1);

dim3 threads (nPoints, 1, 1);

pj_inv_pre_stream<<<grid, threads>>> (lp, xy, to_meter, x0, y0, ra);

}

nDestXY = !nDestXY; // switch buffers

pj_inv_pre (xyStream [!nDestXY].getDevicePtr (), xyStream [nDestXY].getDevicePtr (), P->to_meter, P->x0, P->y0, P->ra, nPoints);

ReadStreams ((long) nPoints);

The data in the buffer referenced by xyStream [0].getHostPtr() always is the initial data, and the data in xyStream [1].getHostPtr(), which should contain the computation results after having called pj_inv_pre() is always undefined. I have no clue why.

(The equivalent ATI Stream Computing code works w/o problems btw.)

I really need to know what I am doing wrong so that the destination device buffer isn’t written, and I can’t figure it. There must be somebody here who can help me, right?

From a quick scan of your code, if nPoints is greater than 512, the kernel will not launch.
The limit may be even lower depending on how many registers are used by the kernel.

You are also using doubles, are you compiling with the right flag ( -arch sm_13) and is your GPU double precision capable?

With a single block, your code will be very inefficient, you should read some of the basic CUDA material.

Thanks. I have read that material and found it hard to understand. Imo it lacks simple explanations and examples. Why do I have to bother about blocks at all? I thought CUDA would just throw as much data at the GPU as possible and advance through the data until all is processed? If I have to determine some block size, I would have to do it in a way that works for all CUDA capable hardware, which probably have different block sizes or counts, right?

I am using a Quadro FX 570.

Take a look at the articles on Dr. Dobbs:

http://www.ddj.com/hpc-high-performance-computing/207402986

You basically design a kernel to process a fixed amount of data ( number of threads) and then find out how many of these you need to cover the problem ( number of blocks). This will work and scale on all the CUDA capable hardware.

That stuff is still too complicated. My application doesn’t require any communication between threads. Each thread only has to process a double2 element of a huge array. So I thought that kernel<<<1, elemCount>>> would walk through an array of elemCount elements, processing as many elements as there are threads (let’s say 512), and then process the next 512 elements until it has processed all. Is that wrong?

All threads within a block are run on the same SM (to guarantee that you can use very fast shared memory between all threads in a block). GT200, for example, has 30 SMs. Obviously optimizing for the number of SMs of whatever hardware you have is not safe or sensible unless you do a whole lot of safety/sanity checks, and there is dedicated hardware to very quickly generate and remove blocks for the SMs.

So, in your case, you’d subdivide your work into some number of threads–probably 256 to 512–and launch as many blocks as you need to cover your elements. This is not the traditional streaming model because you have the fast shared memory–it’s actually a lot more flexible.

Ok. I think I know what to do now. Thanks so far.

Well, I have gotten further, but the stuff still doesn’t work.

The following code swaps input and output streams, calls the GPU code and reads the computation results back to the host. coords holds the input and output streams (device memory) and host buffer, P holds parameters.

coords.SwapXY ();

pj_inv_pre_gpu (coords.xIn (), coords.yIn (), coords.xOut (), coords.yOut (), P->to_meter, P->x0, P->y0, P->ra, coords.Length ());

coords.GetStreams ();

This is the GPU code (BLOCKSIZE is 512, nPoints is 12152):

__global__ static

void pj_inv_pre_kernel (double* lam, double* phi, double* x, double* y, double to_meter, double x0, double y0, double ra) 

{

	const unsigned int xIdx = blockIdx.x * blockDim.x + threadIdx.x;

x [xIdx] = (lam [xIdx] * to_meter - x0) * ra;  

y [xIdx] = (phi [xIdx] * to_meter - y0) * ra;

}

void pj_inv_pre_gpu (double* lam, double* phi, double* x, double* y, double to_meter, double x0, double y0, double ra, unsigned int nPoints) 

{

pj_inv_pre_kernel<<<(nPoints + BLOCKSIZE - 1) / BLOCKSIZE, BLOCKSIZE>>> (lam, phi, x, y, to_meter, x0, y0, ra);

}

I have been compiling with -arch sm_13.

This stuff works fine on ATI hardware with ATIs Stream Computing, so I am at the end of my wisdom here again. cutilCheckMsg says something about an “invalid device function”, but why?

(Quadro FX 570, driver v182.08, CUDA 2.1, C2D E8500, 4 GB RAM, WinXP pro SP 3)

That kernel will segfault because you are launching more threads than there data elements and not putting any bounds on when you read from memory. Pass unsigned int length and add if (idx < length) in your kernel.

The device memory is always sized so that length (buffer) % BLOCKSIZE == 0. Example: nPoints = 100, BLOCKSIZE = 512 -> resulting buffer size will be 512. I am doing that to avoid having to check the index against the buffer size all the time:

#ifdef CUDA

#	define	STREAMTYPE	_Tgpu

#else

#	define	STREAMTYPE	::brook::Stream<_Tgpu>

#endif

template <class _Tcpu, class _Tgpu>

class CGpuArray {

   private:

	  STREAMTYPE*				m_gpuBuffer [2];

	  _Tcpu*					 m_cpuBuffer;

	  int						m_nOutStream;

	  int						m_bDoubleBuffer;

	  unsigned int			   m_nLength;

	  unsigned int			   m_nSlack;

	  unsigned int			   m_nStride;

	  unsigned int			   m_blocks [3];

#ifdef CUDA

	  unsigned int			   m_threads [3];

#endif

public:

	  // set the buffer parameters

	  void Setup (const unsigned int length, const unsigned int stride,  const unsigned int blockSize = 16, const int bDoubleBuffer = 1) {

		 Destroy ();

		 m_nLength = length;

		 m_blocks [0] = blockSize;

		 m_nSlack = (m_blocks [0] - length % m_blocks [0]) % m_blocks [0];

		 m_blocks [1] = (sizeof (_Tcpu) / sizeof (_Tgpu)) * (m_nLength + m_nSlack) / m_blocks [0];

		 m_blocks [2] = 1;

#ifdef CUDA

		 m_threads [0] = blockSize;

		 m_threads [1] = (length + blockSize - 1) / blockSize;

		 m_threads [2] = 1;

#endif

		 m_nStride = stride;

		 m_bDoubleBuffer = bDoubleBuffer;

		 m_nOutStream = 1;

		 }

	  // create the streams

	  bool CreateStreams (void) {

		 for (int b = 0; b <= m_bDoubleBuffer; b++) { 

#ifdef CUDA

			cutilSafeCall (cudaMalloc ((void **) &m_gpuBuffer [b], m_blocks [0] * m_blocks [1] * sizeof (_Tgpu)));

			if (!m_gpuBuffer [b])

#else

			if (!(m_gpuBuffer [b] = (m_blocks [0]> 1) ? new ::brook::Stream<_Tgpu> (2, m_blocks) : new ::brook::Stream<_Tgpu> (1, m_blocks + 1))) 

#endif

			   {

			   Destroy ();

			   return false;

			   }

			}

		 SetStream ();

		 return true;

		 }

	  // create and initialize the local CPU memory

	  bool Set (_Tcpu* data) {

		 if (data && (m_nStride < 2) && !m_nSlack)

			m_cpuBuffer = data;

		 else {

			if (!(m_cpuBuffer = new _Tcpu [m_nLength + m_nSlack]))

			   return false;

			if (!data)

			   memset (m_cpuBuffer, 0, sizeof (_Tcpu) * (m_nLength + m_nSlack));

			else {

			   if (m_nStride < 2)

				  memcpy (m_cpuBuffer, data, sizeof (_Tcpu) * m_nLength);

			   else {

				  for (unsigned int i = 0, j = 0; i <m_nLength; i++, j += m_nStride)

					 m_cpuBuffer [i] = data [j];

				  }

			   }

			memset (m_cpuBuffer + m_nLength, 0, m_nSlack * sizeof (_Tcpu));

			}

		 SetStream ();

		 return true;

		 }...

oh, now you’ve edited to say what GPU you’re using.

FX 570 doesn’t support double precision; only Compute 1.3 devices support DP, and FX 570 is Compute 1.1.

No, now you have read that I am using a Quadro FX 570. I had posted it further up already.

So the FX 570 doesn’t support double? I thought NVidia targets the professional market with the Quadro line? I don’t know whether I should cry or laugh, but … well, I feel laughter irresistably bubbling up inside of me …

The FX 570 came out almost a year before there was double precision hardware available. Anyway, Quadro is primarily for visualization, not CUDA development–being able to use every possible CUDA feature is not as important to that audience as other considerations.

Afaik there is no double precision GPU yet. They all emulate double precision with single precision arithmetic. All you could say is that the FX 570 doesn’t support DPFP emulation.

Well, I hope I will not have too hard a fight getting some useable CUDA capable NVidia hardware. The company I work for has little clue that you can do more than just play on them which makes them very suspicious to that kind of order … :wacko: … and I really wonder what they’re gonna say this time, as I already had problems getting a Radeon HD 4870 from them (my workplace machine came preconfigured with the FX 570, which wouldn’t have been my choice if I had been asked … I believe that size matters :rolleyes: ).

Anyway, thanks for your help.

Oh, one more question: Which NVidia gfx card would be comparable to a Radeon HD 4870 1 GB?

No, GT200 has actual double precision units in it–GTX 260, GTX 275, or 285 would be the obvious consumer cards for initial development. 260 or 275 is probably about equivalent in price to the 4870 1GB.

That’s interesting information. I take that I would have double precision built-in trigonometric and transcendental functions?

I am looking for an NVidia card equivalent to a 4870 in performance, not price. GT 275?

Correct. The CUDA Programming Guide lists the precision of these functions in table C-2. It’s generally 1 or 2 ulps.

Another thing to keep in mind is that there is one double precision floating point unit for every 8 single precision floating point units on the GTX 200-series cards, which means that double precision arithmetic is about that same factor slower than single precision.

Well, I can speed up my application 30 times with a Radeon HD 4870, where I had to implement all trig. and transc. functions in double prec. myself. Actually naming a comparably priced NVidia card was the right idea. I want to see where I get more bang for the buck. ATI gets slowed down by about factor 4 when using double compared to single precision. I really want to know how an equivalent NVidia card fares here, so I will see to get a GT 275.

Karx,

Your question was answerd by Seibert.

Possibly 8 times slower but u get fully compliant double-precision arithmetic. (I dont have practical experience with doubles though)