Can you GUESS this without experimenting? Latencies

The following are two code snippets that basically copy a structure from device global memory to shared memory. The size of structure is 28 bytes. (7 floats – size same as 7 ints)

1 block, 512 threads run this code. Run on 8800 GTX hardware.

I want you to pick the snippet that would run faster and explain why you think so.

Greatly appreciate your time and inputs.

void binomial_start(void)

{

	__shared__ struct bno_input option;

	

	option = DPInput[blockIdx.x];

	return;

}

In the following snippet, I have used Pointers. I have made sure that the compiler resolves the pointer correctly to global and shared memory space. No advisory warnings were issued. I have cross-checked PTX code too.

void binomial_start(void)

{

	int i;

	__shared__ struct bno_input option;

	int *dst, *src;

	dst = (int *)&option;

	src = (int *)(DPInput + blockIdx.x);

	for(i = threadIdx.x; i< (sizeof(struct bno_input)/sizeof(int)); i += blockDim.x)

	{

  dst[i] = src[i];

	}

	__syncthreads();

	return;

}

Normally I would guess the second one–because it looks so much more complicated that it must intuitively be faster!.

At a glance, I suppose the second one issues 7 load instructions of a float, and, depending on how the compiler works, the first snippet might issue only a single 32 byte read…In which case the first one would be very fast except that it’s likely not coalesced.

I’m going to guess the first one, simply because the memory access patern looks more natural.

Change the pointer types to int2* in the second snippet and you’ll likely see a further small improvement in performance.

If you’re not sure why one version of the code is faster, read carefully the section on gmem coalescing and then examine the access patterns of your two codes.

Paulius

Also, check how many times the same data is read unnecessarily from gmem by the first code.

With only one block running it is going to be extremely difficult to tell which piece of code runs faster, since the driver and launch overheads will be so large compared to the execution time. I would say it doesn’t matter one bit.

Why are you having all threads in the block read the same value into the same shared memory location? That is completely and utterly wasteful. Just do “if (threadIdx.x == 0) load data” with __syncthreads() around it to avoid all the extra loads.

May be true. But still, we r accessing global memory here. So, it is bound to be slow. And, I measure time using CPU ticks which is like super-granualar. So, One can still see the time difference between the two accesses.

And, you are right about the redundant access. All 512 threads reading the same locations… Truly redundant.

Thanks for all your time.

I will publish the result after the new year.

There really is no mystery here. The first code is terrible from the memory system point of view, since it’s reads are not coalesced and the same data is read 512 times needlessly. If you’re still not comfortable with coalescing, check out the slides titled “07 - G8x hardware, performance optimization, and precision” from here:

http://www.isi.edu/~ddavis/GPU/Course/Slides/

Slides 25 through 28 show explain coalescing visually as well as list experimental results.

Paulius

True! It seems obvious which one will be slow. However, for lesser block numbers the opposite is true. I cant guess why. I have no reason to substantiate what I see. I have attached Bar charts. The code is also present inside the excel sheet.

Chirality,

Thanks for your answer. But I think (IMHO) you need some CUDA enlightment here. The first code looks simple. It simply translates to 7 “load” instructions. BUT All 512 threads execute all the 7 load instructions resulting in too much redundant global memory access. This comes with a big penalty. Check out “Performance Guidelines” in the CUDA manual.

Yeah, I did not read your code close enough and shouldn’t have bothered with a terrible guess.

Oh! Not a problem! But still , I dont understand why lesser block sizes behave differently from bigger block sizes.

Can some1 explain that?

It’s not clear what you’re timing. Do the following:

  1. minimize and clean up your kernel code, taking out the computations (double check the ptx to maker sure that values are still read from gmem into smem). At the very least, put a __syncthreads() after the read in the first kernel - currently some threads will be reading from “option” while others will be writing to it due to the hazard you don’t take care of. This lack of sync probably is where the difference comes from, since the second code does sync, making sure to avoid hazards.

  2. post (1) and the segment of code that launches and times the kernel. I’m not sure how you’re getting CPU clock ticks. Is that from a single launch? The differences are small enough to be noise. For this, why not use CUDA event API to time GPU execution at the CPU clock precision, rather than use the timer on the CPU side?

Paulius

I thought about the __syncthreads() for the first one. But I figured it is not necessary. Because, once a thread has crossed the “option = input_options[blockIdx.x]”, it does NOT matter whether the other threads are still loading it. The values are already present there. So, __syncthreads() would be just a waste of time.

#ifndef __PERFORMANCE_COUNTER_H__

#define __PERFORMANCE_COUNTER_H__

#include <windows.h>

class HPTimer

{

private:

	LARGE_INTEGER tFreq, tStart, tEnd;

	

public:

	HPTimer(void)

	{

  QueryPerformanceFrequency(&tFreq);

  return;

	}

	void start(void)

	{

  QueryPerformanceCounter(&tStart);

	}

	void stop(void)

	{

  QueryPerformanceCounter(&tEnd);

	}

	long TimeInTicks(void)

	{

  return((long)(tEnd.QuadPart - tStart.QuadPart));

	}

	double TimeInSeconds(void)

	{

  return ((double)(tEnd.QuadPart - tStart.QuadPart)/(tFreq.QuadPart));

	}

};

#endif

I use an object of HPTimer called “profiler” to profile time. I say “profiler.start()” before launching the kernel. And say “profiler.stop()” after “cudaThreadSynchronize()”. And then print the time using profiler.TimeInTicks() function.

I run the experiment several times and take an average value for each block size.

This method of timing is NOT precise. I think context switches happening in the CPU can affect this time. But since I am averaging after many experiments, the results that I have posted must be fairly correct. Also, I ignore abnormally large values if I find any as they must be because of context switches. 99% of the time , I got almost the same result when I repeated the experiments.

Let me know if you still want me to do a “GPU” time measurement.

The only problem I have with GPU time measurement is that the include file “cutil.h” does NOT come with TOOLKIT. It comes with the SDK. And, it also needs those libraries that come with it. And, I never understand what that CUT_SAFE_CALL etc… are. They dont find a mention in the manual. So, I figured I am better of using documented ways of doing things. THe QueryPerf… is a microsoft recommended way of high performance profiling. I am sorry if I have over-talked on the “documentation” part. May b, just my ignorance.

Thank you

Best Regards,

Sarnath

Do the GPU time measurements with cudaEventRecord and related friends which are in the 1.1 toolkit, it measures the actual execution time on the GPU. The timer in cutil.h is just a different interface to QueryPerformanceCounter.

On a side note, do try

if (threadIdx.x == 0)

   option = DPInput[blockIdx.x];

__syncthreads():

In my kernels with similar access patterns, I find it MUCH faster than having every thread read the same value.

Another thing to try would be to put DPInput into constant memory. The size of constant mem would limit you to data for only ~2000 blocks, but it will probably read much faster than global mem. Also, constant mem reads best when all threads in a warp read the same value, so you can probably get decent performance without a syncthreads, like this:

void binomial_start(void)

{

struct bno_input option;

option = DPInput[blockIdx.x];

return;

}

Note that I left option in registers, since having all threads write to the shared memory location is asking for weird race conditions and will slow things down considerably by serializing access to shared memory. Depending on how option is used, this may adversely increase your register count, though I would guess it won’t change much.

I very strongly suggest using CUDA events for timing execution on the GPU. If you do end up using CPU timers, make sure to call cudaThreadSynchronize() not only prior to stopping the timer, but also before starting it as well, so you don’t end up including previous CUDA calls.

To add to MisterAnderson42’s comment on reading same data by all threads in a block (both suggestions are good, the constant memory approach is usually faster), you can also consider reading the option from a texture bound to linear memory. In some of my code, fetching the same data by all threads was faster than having one thread read into smem and sync.

Paulius