CUDA C Programming Best Practices Guide released Optimization guidelines

We’ve just released the CUDA C Programming Best Practices Guide. This guide is designed to help developers programming for the CUDA architecture using C with CUDA extensions implement high performance parallel algorithms and understand best practices for GPU Computing. Chapters on the following topics and more are included in the guide:

    Introduction to Parallel Computing with CUDA

    Performance Metrics

    Memory Optimizations

    Execution Configuration Optimizations

    Instruction Optimizations

    Control Flow


    Numerical Accuracy and Precision

    Performance Optimization Strategies

This will be included with the 2.3 toolkit, but we decided to release it now because it’s definitely worthwhile reading for any CUDA C developer (a lot of collected internal wisdom on proven optimization strategies, for example). Feel free to post any questions or comments in this thread.

Thanks a bunch for this, it is much more accessible and concise than the other documentation.

The email that announced this to registered developers had some confidentiality language in it, but given that tmurray posted the link to the PDF, I believe I am ok to comment (the email actually invited comments and suggestions for improvement in the announcements forum, but this is a better thread).

In general, this is excellent. Terrific job.

Unfortunately, this document contains less information than included in various tutorial slides NVIDIA employees posted at some place or another, see for instance

One thing I particularly dislike is that the first code example in the document contrasts the driver API and the runtime API (which, btw, is now called low level C interface and high-level C++ interface in the progguide and the reference manual version 2.2). This document still does not lower the CUDA entry point. If I were starting with CUDA now, I’d ignore all the text in the document, fast-forward to the first actual code example, copy and paste it into an editor, and get my hands dirty. Pretty much orthogonal to the approach taken in the “best practice guide”.

This is why we took the simple stupid axpy kernel and turned it into some standalone code and a Makefile (VS solution) when designing the CUDA section on (check out the “minimalistic CUDA tutorial” at The SDK is nice, no doubt about it, but it almost certainly discourages newbies. There is no documentation on where to start, and the SDK build system (in my opinion) discourages proper CUDA-CPU comparison practice (and is a pain in the butt when integrating CUDA into a moderately complex existing code base). Code up a hack that gives the same result, call it “gold”, time it and publish 1000x speedups :) (Now that I’m at it, I might as well continue tmurray’s rant against cutil)

Now here’s some constructive criticism: Launch a book. “CUDA-GEMS” is a tentative title based on the successful GPU-GEMS series. Volume 1 would be a collection of the current SDK whitepapers and examples, presented in a way that does not artificially increase the learing curve like the current SDK does. Requires code duplication, probably, but well worth it. If you don’t go for a book, publish something like the axpy example, the reduction whitepaper and something on scan prominently on the CUDA web page. This should, it backed up with some single-file code, lower the learning curve.

Sorry for the rant,


Dom, don’t worry. We are certainly aware that the current documentation can be discouraging to new users, and we’re working on various things to correct this.

Given that I’m occasionally teaching CUDA, I have to worry :)

In my experience, a very simple “what should I read to get started” guide, posted very prominently on the CUDA web page, would do the trick. You have all the material ready! Currently I refer people to the slides Mark used at his USW workshop (link in my previous post), which in turn references whitepapers in the SDK. The 2.2 “Quickstart” document is a good start actually, but it essentially just steps through verifying that installing the toolkit and building the SDK worked well. Add a chapter to that document on how to continue now that the most primitive SDK example runs fine, and I’ll shut up :)

Keep up the good work! All I am complaining about is that the current state of the CUDA documentation is not optimal for self-studying. If you know where to look for conference tutorials, you are fine.

There isn’t anything on fixpoint:
Is the throughput for fixpoint integer mul similar to integer mul with 24 bit operands?

" Filtering Valid only if the texture reference returns floating-point data"

This is yet another missed opportunity of pointing out that float4 is also a floating-point datatype - possibly the most efficient one in this context.

Section “1.1.1 Differences Between Host and Device” ignores that the CPU is not scalar but has a 4 element vector unit - which some smartass will always be keen to point out, diverting the discussion …

Looking good so far :)
I’m no expert on the driver API, but I believe that on page 9


should be replaced by:



Low-level and High-level C++ refer to the different types of functionality available in the Runtime API, which can be used in a strictly C setting (low-level functions only), or a mixed C/C++ setting (low and high level functions can all be used). The Driver API provides only C entry points. In reality, the high-level C++ API is just a bunch of convenience wrappers for templating some of the low-level Runtime API functions.

Would the following slides cover the need you desribe? These are the slides used in the basic CUDA webinar.
cuda_basics.pdf (1.52 MB)

I like “walkthruogh 2”.

This is the best and most concise “newbie presentation” I’ve seen to date. Excellent!

Some minor suggestions for improvement:

  • any particular reason why integers are used instead of floats in the example? I am still under the impression that most people compute on FP data :). Well, it makes memset (in example 1) a little fuzzier to use in walkthrough 1, so that’s probably why.

  • page 16: these are maximum dimensions, I believe the reader will benefit from learning, here already, that say a 1D grid of 1D blocks is perfectly fine. You actually say this (“up to…”) on page 20.

  • page 17: maybe add the restriction that kernels can’t allocate device memory

  • page 22 is just excellent, there is simply no way to convey more information!

  • typo Walkthruogh on page 23

  • slide 31: I always was under the impression that while events are indeed clock-cycle accurate, cudaEventElapsedTime() only gives a maximum granularity of ~0.5ms because it returns a float. I believe all walkthroughs in this presentation would not give meaningful timings when “benchmarked” this way.

  • slide 32: one CPU thread can control several GPUs at the price of context switching (cudaSetDevice() called repeatedly)

  • slide 34: lifetime: kernel

  • slide 38: occasionally you write “threadblock” and occasionally “thread-block” :)

  • slide n-1 (before the marketing starts :) ): Further reading: Check for what I usually suggest. Maybe add one slide with pointers to whitepapers (which are nowadays even included in the SDK), the reduction thingy, the scan example, simpleTextures, you name it. The “art” is to prioritise these whitepapers or, broadly speaking, providing a commented index of the SDK apps. But that’s probably beyond the scope of these slides

Enough nitpicking, these slides are excellent, no doubt about that, and I am 100% sure that making them available prominently on the CUDA weg page would help a great deal.


Good feedback, thanks. I’ve made a few suggested additions. I’ll check on the resolution returned by cudaEvenElapsedTime. You’re right, all FD computation is at least SP FP. I used integers purely for convenience - much easier to print values concisely to the console and check arithmetic results when time is very limited (the walkthroughs were used for hands-on portion of training sessions, where everyone would code from scratch (including me on the projector), instead of looking at the finished code). I think these do get posted somewhere on the webinar portion of the website, I’ll inquire about making them more prominent - the intent was to get someone coding CUDA from scratch in a short amount of time.


Read the cudaEventElapsedTime description in the reference - resolution is ~0.5us (microseconds). Not quite clock period grain, but still pretty good.



Great basic tutorial - I think its evident from the forums that many new users need it.

A few suggestions though:

-. Personally I think a PDF/HTML is far better than PPT.

-. Explaination about emulation mode and the difference between emulation and release is something many new users fail to understand.

-. In the samples you put cudaMemcpy after the kernel invocation - many people fail to understand that cudaMemcpy will implicitly call

cudaThreadSynchronize and therefore you see code that call kernels and doesnt synchronize correctly. Maybe a description about

implict and explicit synchronization should be added as well. Page 29 talks about it, but there is no code sample showing how/where to use it.

-. Doubles vs floats - arch sm_XX is also something new users dont take into account.

-. More about why a kernel would fail and how to see whats causing it. People run kernels (which fail because of too many resources or

access violations) and think that after ten minutes of coding they’ve achieved a x1000 performance boost. Users should understand

how to check for errors. Page 30 address this a bit, I think it can be extended as this is one of the most common pitfalls of new users.

-. Some more info maybe on kernel resources: register pressure and how to see the kernel resource usage: --ptxas-options="-v -mem"

-. Differences between shared memory and global memory - people think that to boost the application they simply need to use shared

memory instead of global memory. Sometimes people fail to understand that its not just a matter of choosing the memory to use

but you need to understand how to load data, sync it and use shared memory wisely in order to gain performance.

-. I would also suggest people to get familiar with threading issues on the CPU before coding the GPU. People who dont understand

CPU threads, synchronization issues, data dependency et al will never be able to use GPUs correctly.

-. Maybe add some “nVidia metodology” as to how to find the bottlenecks, debug (for example on windows without debugger), reduce

resource pressure and stuff like that. I know i’d like to hear what nVidia thinks :)

-. Maybe mention the dead-code optimizer. People sometimes dont understand that the kernel was optimized out and think that the kernel gave a x1000 boost.

I understand that some of those issues might add some more pages, but I think that those (along with what the document already addresses

and what Dominik wrote) are the most common issues and misunderstanding new users are facing.

my 1 cent,


Thanks for the comments. I think most of your requests fall into the optimization category (slides above are intended as a minimal basic starter). There are separate presentations (and weibinars) on CUDA optimization techniques. And, of course as the forum-thread title suggests, there’s the best practices guide that covers the issues in more detail.


This certainly looks good and covers important subjects for users who are taking CUDA to the next level.

Thanks for this, nice job!

Thnaks for the document, it was worth reading albeit anything is in the programmer’s guide and reference documentation, it focus on obtaining fast results :-)

That is great work man…
Thank you for sharing this with us…

That is great work man…
Thank you for sharing this with us…

I tried executing the below program in my windows XP PC in emulation mode(as i don’t have CUDA enabled GPU) but it’s not giving correct output.can any body tell me how to get the exact output.

the cuda source code is:

#include <stdio.h>

/* vector elements */

#define N 4

/* definition of an function executed on GPU */

__global__ void vecAdd(float *A, float *B, float *C)


	int i = threadIdx.x;

	C[i] = A[i] + B[i];


/* prints contents */

void vecPrint(const char *str, const float *A, int size)


	if (str)




	for(int i=0;i<size;i++)


		printf("\t%+f", A[i]);




/* an entry point */

int main(int argc, char *argv[])


	float A[N], B[N], C[N];

	float *devA,*devB,*devC;

	/* device memory allocations */

	cudaMalloc((void**)&devA, N*sizeof(float));

	cudaMalloc((void**)&devB, N*sizeof(float));

	cudaMalloc((void**)&devC, N*sizeof(float));

	/* setting initialize values */

	A[0] = 0.0f;

	A[1] = 1.0f;

	A[2] = 2.0f;

	A[3] = 3.0f;

	B[0] = 4.0f;

	B[1] = 5.0f;

	B[2] = 6.0f;

	B[3] = 7.0f;

	C[0] = 0.0f;

	C[1] = 0.0f;

	C[2] = 0.0f;

	C[3] = 0.0f;

	vecPrint("  A   = ", A, N);

	vecPrint("  B   = ", B, N);

	/* data copying from host to device */

	cudaMemcpy(devA, A, N*sizeof(float), cudaMemcpyHostToDevice);

	cudaMemcpy(devB, B, N*sizeof(float), cudaMemcpyHostToDevice);

	cudaMemcpy(devC, C, N*sizeof(float), cudaMemcpyHostToDevice);

	/* execution on GPU */

	vecAdd<<<1, N>>>(devA, devB, devC);

	/* data copying from device to host */

	cudaMemcpy(C, devC, N*sizeof(float), cudaMemcpyDeviceToHost);

	vecPrint("A + B = ", C, N);

	/* freeing device memory */




	return 0;


i executed it with nvcc

D:\CUDA programs>nvcc





D:\CUDA programs>a.exe

A = +0.000000 +1.000000 +2.000000 +3.000000

B = +4.000000 +5.000000 +6.000000 +7.000000

A + B = +0.000000 +0.000000 +0.000000 +0.000000

in the above insted of giving 4,6,8,10 it gave all zeros.

please email me the solution to Email me