CUDA SUCKS!!! Why <block, thread> cannot be judged by itself

I do not think CUDA framework is well designed and mature enough, which lead to development cost very high and not easy to learn and use.

One example: Until CUDA SDK 6.0, it start to support Unified Memory. before that programmer still needs spend time and energy to consider how to allocate memory and write specific code to move variables between CPU and GPU memory. These Kind of low level jobs should be handled by CUDA runtime itself, but not bother developer. A mature framework should let developer focus on program logic but not memory things. I am not sure how many people indeed use unified memory.

Another issue is more silly. That is developer still need to specify <<<blocks, thread per block>>> values explicitly. Such as that vector add function: VecAdd<<<1, N>>>(A, B, C) example. You need to tell the vector length explicitly, then CUDA will distribute them into N thread to do mini-add. WHY CUDA Runtime CANNOT judge vector length by itself??? IS IT VERY HARD??? Why CUDA ask developer to tell how many compute resource each function needs??? Why CUDA runtime itself CANNOT determine and automatically optimize how many resources(gird/block/thread) it should be.

This is only a simple example, if you do M*N-Dim matrix complex calculation. it is very boring to determine how many blocks or thread…

EVERTHING CUDA needs developer to assign. Is CUDA a infant or CUDA Designers are infants?

I just want to know who are those CUDA designers??? Do you know what should encapsulate by CDUA runtime itself and what should expose to developers???

BTW, why it uses three ‘<’ and ‘>’. I need press ‘<’ and ‘>’ keys for 6 times. Can you use shorter and less ugly expression?

Sorry that CUDA is too hard for you.
I understand that basic math is very difficult and boring.
You should stick with something easy and exciting like Visual Basic.

Usually people do not have such problems, so you can always use something else for your development.
So please do that and leave CUDA development for others who do not find it so incredibly complicated.

Depending on your use case, if you find that CUDA itself offers too much detailed programmer control for your tastes, you may want to consider a middleware product that can use CUDA under the hood, such as Matlab (, ArrayFire (, or PyCUDA ( The preceeding is not meant to be a complete list or an endorsement, merely a sampling of such products. There are many more CUDA-enabled frameworks (e.g. for machine learning and image processing) so you might want to look around to find one that best fits your needs.

It is indeed hard to let a 3rd class company to design a easy to use framework. If your NVidia guys good at math and know optimization theory, why you expose those blocks, thread parameters to developer??? I HAVE TO LOOK DOWN YOU!!!

Original poster does raise some valid points. Launching a kernel with just one linear dimension should be possible, and cuda api could figure out the rest, the cuda api could then assume all elements on the linear dimension are parallizeable. Just like a harddisk has a linear dimension for sector number. The rest the harddisk figures out; cylinder, head, etc.

A bigger worry is backwards compatible for PTX, PTX seems to be schizo… two flavors: sm_20 and sm_30 and above… probably because of use of new compiler/switch to llvm… it would be good to streamline ptx so all ptx version from sm_10 up to sm_53 can work on sm_xx and above. Currently this is not the case, so compiler needs more work. Otherwise developer will end up having to compile a PTX kernel for each architecture… there are already 12 different architectures… this is getting a bit much ! and in total it would be 24 if 32 bit and 64 bit versions must be included.

Also it would be nice if nvcc.exe would display version information when it is executed… instead of requiring --version parameter… just a little remark there… not displaying version information by default feels noobish ;) :)

Also another big issue with the 6 dimensionality of current cuda api and such is 16 instructions and some registers needed to compute a linear index… I hope this could be speed up by perhaps a special instruction or so ?

I feel like what makes CUDA actually hard is that hardware ignorance is much more punishing than with CPU programming.

Modern CPUs are so good, they let most people become “bad” programmers. CUDA’s hardware is so young in comparison that if you’re a “bad” programmer, it’s obvious. Gaining an understanding of the GPU architecture is tough, man.

Understanding how CUDA works vs managing my own memory and grids is pretty lol-worthy, I gotta say.

I need continue bash!

What more sucks is CUDA use 3-levels index, grid->block->thread. They expose these low level things to developer!!! Just let a man dig a big hole on the trousers to let other ones to see his underpants!!! Why other ones care about your “underpants”??? WHY CUDA ask developer use complex formula like (x + yDx + zDx*Dy) to calculate thread index??? Why CUDA cannot Handle by itself???

Can CUDA Designers make framework more transparent??? IT should be a “black box” for developers. DOES those CUDA Designers have enough sense to build a lightweight framework??? I highly doubt!!!

I didn’t even know it was possible to have a troll on this board. I’m so happy no part of the internet is safe from trolling.

The last part may’ve sounded sarcastic but I swear, I’m laughing right now.

On the other hand I've got to say that it's way easier to reach high utilization levels on a GPU after X hours of hard work. Programming a multicore CPU with multithreading and hand-coded SSE instructions is actually way harder than CUDA (IMO), because in most cases you can't just write some OpenMP pragmas and hope that the compiler takes care of the rest of the vectorization for you.

The initial threshold to get good at CUDA is definetly high but once your over that it’s by far more
efficient than multi-core CPU programming.

Great Point!

I must say that your (Jimmy P) CUDA code is particularly easy to follow and I probably learned more from examining your sum reduction code than I learned from the GPU programming course I took years ago.

Not mention that your sum and min/max reduction code is significantly faster than the equivalent calls in thrust(at least on Kepler GTX 780ti, GTX Titan and Tesla K20-K40).

I agree with this more than my own point, actually.

Very well-said.

CPU programming definitely has a lower threshold to entry. And what really helps is that CPUs are conducive to introductory algorithms and data structures. You can use an if-statement without worrying.

Granted, I’ve never done any SSE instructing in my C/C++ days but bear this in mind, C++11 got std::thread and lambdas. Multithreading is now ridiculously easy in C++ to implement (not perfect [verb perfect]). Maximizing a CPU has always been hard.

Aso, seeing CudaaduC’s post above, links to source code? I wanna learn how to do good CUDA too!

But really though, CUDA’s kind of my new jam.

This is JP’s sum reduction;

The person that has the most influence on how I do CUDA is definetly Volkov:

I highly recommend checking out his talks.

let me take one example about what is a smart design:

when you save a file under Windows or Linux. It never asks you which Cylinder/sector/tracker…of hard disk you need to use. It encapsulate all these low level things and users do not need to know these concepts at all.

The first thing to design programming framework is to consider what the boundary is. In other word, what is the framework’s job; what is the developer’s job.

There is abnormal value of techies. That is, only complex and difficult things are advanced; easy and simple things must be cheap. Mystify can show one’s capability. When you make presentation, if you do not post some concepts that other people do not know and hard to understand, then the presentation must be too shallow.

Above Phenomenon are very common, like C++ developers look down Java and C# developers. Computer Vision Guys look down computer graphic guys, because graphic use very limited math and always hack in parameters, which cannot give a reasonable explanation mathematically, especially in image render algorithm.

my judgment: To see how smart a person is to see the capacity of simplify problem, because make things complex is toooooo easy.

It’s already been pointed out to you that there are many higher-level frameworks for programming GPUs that do not expose the programmer to the grid/block/thread hierarchy. CUDA is the lowest level framework currently available to program NVIDIA GPUs for compute tasks. Although it is not assembly language, it is arguably only one step above assembly language. Therefore, it stands to reason that it would have the lowest level of abstraction of underlying hardware concepts. Many low-level concepts are exposed to the programmer.

If you don’t like that, you should take a look at other strategies like OpenACC, ArrayFire, Matlab, etc. None of those expose grid/block/thread hierarchy to the programmer.

Tools are generally suited for a particular type of task, and may not be suited for other task types. If you told an end user or applications programmer that the way to access a hard disk is via a transactional interface (fopen/fread), they would probably be perfectly happy. If you told an operating system programmer that the only way to access a hard disk is via a transactional interface, they might be concerned. Is that the fastest way to handle all cases? What if I want to create hidden structures on the disk? How can I make my hard disk have the same logical abstraction as a network resource? What if I want to format a disk? Mark a bad sector? Create a high-performance RAID array? None of these questions are relevant for an end user or applications programmer, but at a lower level, they may be important.

Criticizing CUDA because it offers exposure to low-level concepts that you don’t need/don’t want ignores that there may be a legitimate purpose for those. If there is a legitimate purpose, then to hide it would be a problem. If CUDA were the only way to access GPUs for compute purposes, then the ecosystem would be considerably weaker (as it was in circa 2007-2008). But today, there are many approaches to tapping into the compute power of GPUs, some of which include fairly high levels of abstraction. You should consider taking a look at some of those.

my judgement: simplicity is not always a desirable goal. simplicity vs. complexity as a goal depends on the task at hand. Simplicity is desirable for things I don’t care about. Complexity is desirable for things that I wish to have a great deal of control over. I’m certain you can find complexity in Microsoft Word. It’s just not related to accessing data on the disk. It’s more likely to be found in the ways that you can format the written word on the page. Microsoft Word gives you exquisite control over that.

And before you run off and say CUDA should do both, I suggest that one-size-fits-all tools are rarely successful. Do one thing well. By building a low-level framework, and doing it well, with an appropriate level of complexity, CUDA enables higher level frameworks that abstract and simplify various aspects for various needs. The proof is in the pudding: the strategy appears to be working, as there are quite a few other programming languages and frameworks that can now use CUDA as a lower level framework. For a resource-constrained company, doing one thing well is a far more tractable proposition than trying to serve every need from the outset.

Sometimes I wish there were a feature on this site for awarding a 50 point bonus to an answer :-) A few additional quick thoughts.

(1) While CUDA was consciously designed with a C-style low-level abstraction that can serve as the basis for many other frameworks with higher level of abstraction, an important part of CUDA (and the CUDA design philosophy) from day one have also been domain-oriented libraries that hide hardware details (almost) completely. I would claim that this is a substantial part of the reason for CUDA’s success compared to competing approaches.

(2) For every CUDA programmer clamoring for higher levels of abstraction there is another one who desires less in the way of abstraction and more detailed control, from access to native assembly language to scheduling customizations. Usually equally-frequent demands from both extreme ends of a spectrum are indicative of a workable middle-of-the-road approach with wide appeal.

(3) As the past 30 years of parallel programming technology have demonstrated, parallel programming is inherently more complex than sequential programing. Prior to the impending death of Moore’s Law, relatively few software engineers were compelled to face that reality. In parallel programming there is often a hierarchy of mechanisms to consider, for example on CPUs one might typically have SIMD vectorization (often still requiring the use of intrinsics), OpenMP (for thread-based parallelism), and MPI (for inter-node parallelism) in combination. In that context, the complexity of CUDA seems par for the course. And as someone who wrote traditional explicitly-parallel SIMD code for some years, I would claim the implicit parallelism of CUDA’s SIMT model makes programming significantly easier.

I thought about it some more… currently CUDA graphics cards are like SSE or SIMD machines… which is not really what I would want, but it’s what we got right now. The fact that more expensive cards have multiple processors is interesting. Perhaps using blocks only could be interesting. And threads can be seen more as a SSE feature… So there could still be some adventage to this 6 dimensionality thing.

But that has little to do with “parallelism” in it’s true meaning… SSE and SIMD is more of a trick… to apply one instruction to a whole bunch of data. It’s limited/almost domain specific applications of a restricted computational model =D.

In some sense it would be weird to call the SSE units of CPUs “parallel units”. Or a parallel feature… though sometimes that might make sense… sometimes not… same with it processing bits in parallel… like a 32 bit and… would you call that a parallel architecture ?

These “cuda cores” are in that sense a bit misleading… another form of misleading by NVIDIA to some degree. If these were truely parallel compute units, then these 6 dimensionality was not needed.
One last argument for 6 dimensionality would be: “multiple chip design and such”.
However those chips are not designed in 6 dimensions.
So there mere fact that this API has 6 dimensions is weird and a sign of something strange going on to say it somewhat more nicely ;)

Anyway… there is something else that s buggy me a bit: “event timer in cuda limited to milliseconds” this seems very noobish. Why not microseconds or nanoseconds ? Memory works at nanoseconds so again… weird.

And ofcourse launch time of cuda kernels, make cuda probably unsuited for small parallel computation tasks… only the bigger problems/data problems seem worth solving… so many limitations in cuda currently.

(Though I haven’t yet had much experience with cuda kernel launch times… the event timer limited to milliseconds kinda prevents measuring that accurately funny enough… perhaps synchronize and cpu’s high performance timer could shine some more light on it ;))

The cudaEvent timer is not limited to milliseconds. cudaEventElapsedTime happens to return a float quantity, scaled such that 1.0f = 1.0 millisecond. That does not mean the resolution is limited to milliseconds. The stated resolution is around 0.5 microseconds:

You’ve referred a few times now to 6 dimensions. I guess you mean 3 grid dimensions and 3 block dimensions. In a cc3.0 + device, there is no particular requirement to use 6 dimensions. The first grid dimension is constrained to 2^31 - 1 blocks, and the block dimension works just fine one dimensionally at 1024 threads per block (max). In either case, going to 2 or 3 dimensions is largely a matter of convenience for the programmer, and definitely not a requirement. Many codes are actually simpler looking when the full 2- or 3- dimensional support is used, but it’s by no means mandatory.

The division between threads and blocks arises directly out of the hardware design of the chip. Since this hardware design must be exploited correctly to achieve full performance, it makes sense (at some low level of abstraction) to expose this difference to the programmer.

The API does indeed have a division between grid and block. This arises out of the hardware design of the chip (broken into SMs) and does not indicate anything “strange” going on – other than exposing that to the programmer. Beyond these 2 “dimensions”, the afforded ability to go to 6 dimensions (if desired) is largely at the discretion of the programmer, not mandatory, and in fact helpful for some codes.

I don’t agree with your comparison of SIMD quantities to processing 32-bits “in parallel”. The vast majority of SIMD use cases involve processing quantities in parallel, that are not all part of the same word-space. Most of the time, a 32-bit quantity is not a concatenation of smaller quantities from separate spaces (e.g. separate pixels, separate matrix elements, etc.) In virtually all scientific use cases, the concatenation of quantities into a SIMD vector word does involve separate spaces (e.g. separate pixels, separate matrix elements, etc.) and the number of times people use a SIMD calculation to compute a 128x32 or 256x32 bit single-word quantity is nearly zero, by comparison to other forms of scientific usage. Most of the extant literature, including that from intel, discusses SIMD vectorization as a key methodology to extracting parallelism inherent in a problem (at least for SIMD-capable processors):

“Simply put, Single Instruction Multiple Data, or SIMD, does multiple data computing in one instruction and is a kind of parallel computing technology…”