CUDA needs ipad/ipod/iphone approach to mass success, software-hardware integration, and consumer ma

CUDA needs ipad/ipod/iphone approach to mass success, software-hardware integration, and consumer market restart

First I refer to a bit previous situation of market with some accent to desktop and workstation market. Wonder how would it evolve in the future.
In general, CUDA needs Apple ipad ipod iphone approach, everything is done properly and simple.
Separate IDE for cuda applications for windows and Linux with same functionality is a must. No need to stick with Microsoft compiler, it only causes version mismatch and problems with cuda integration into msvs, intel compiler etc. Separate IDE does not need huge functionality for relatively small kernels, it should produce dll, where cuda kernels are presented as dll functions. So host program links cuda dll and user’s kernels dlls. Host program could be written in any language and with any IDE.
It calls cuda kernells this way
just like normal functions.
This cuda IDE should integrate debugger, so you have ONE developer download, you just install it and it works. IDE should integrate software emulation mode with automatic error checking not buggy memcheck, which executes generated ptx instructions on cpu, that way you can program anywhere. It works with out problems and tells me line of error and memory state. And emulation mode has nice feature, you can select gpu model to emulate, so that you can handle a lot of errors with number of blocks and threads and memory by different gpus with out installing actual gpu.
No need of different memory models, cudaarrays etc, just normal read-write memory and texture read-only memory, by texture units. And texture fetching should be possible for blocks of arbitrary size, maybe with some speed penalty. Now it is stupid situation, I test program it works, user uses large data and program fails, cause texture is too large.
float4 a[9000];
second case assumes that a is read-only, cause it uses read-only texture cache. And nothing more. And a could be any size, if size is large, additional clocks is added to put texture units on necessary memory region.
So I use cuda malloc for both cases and after use either normal or texture fetch.
No c++, cause it is only additional error source, at least for now, just c with function pointers, maybe a few syntax features like operators and templates, but with 99.9% reliable compiler. So I should be 99.9% sure that error is on my side.
Precompiled DLL for each cuda version and main operation systems with necessary library functions like prefix sum, sort etc, so I call
prefixsumfloat(a,b,1000); sortfloat(a,b,1000); compactfloat(a,b,1000);
and it works.
Not to include headers of some libraries with a lot of useless functions which could not be compiled for some reason on my system.
With version 4.1 I have two things on my system, sometimes after my program I could not run other cuda programs, cause system does not see cuda device anymore. Also sometimes my win64 was just rebooted at start of my cuda program. This was some stack errors with wrong parameters to kernel or something, I did not fully recompile program after changing structures. I could not wait forever. Pinned memory, paged memory, concurrent copy and run, which works only sometimes somehow, I do not need this all, like 90-95% of developers I think, I just need to put data on gpu and get it back, but this should work reliable. I suspect number of features prevents reliability of cuda. Also this makes cuda behavior too much OS dependent. You need to test cuda program on each OS. And wrong parameters to functions should not mess my computer and user’s computer.
Cuda adds two new dimensions of errors, multi threading errors, and system errors, it is too much to handle for most developers with very limited debugging tools. I do not get how nsight was not released for linux, how do you think to debug problematic program full of errors by different reasons? I see I spend too much time dealing with different fugitive issues instead of actual programming code and algorithm.
Now I see cuda for some server applications those are built for concrete gpu, they run on one configuration long time after period of testing. Workstation with some 3d editor using cuda features, so consumer buys whole system. With present situation and development way I suggest better integration of cuda developer teams and hardware re sellers, that way cuda programs always sold with tested hardware. Programs also could be just sold with particular gpu model. Developers know ahead target gpu and work on it. Also price model could be varied.
Another idea for consumer market is total restart of cuda with next generation GPU. Really powerful and easy to program. It will have big label CUDAextreme or something, so developers abandon all old gpus and consumers will know that if they want cuda they should buy new one, not old cheap stuff. Today every small gpu is labeled cuda-capable, but in fact with little number of SM it is not really powerful, it only confuses developers and consumers.

Lev, CUDA5.0 covers for a lot of the feature requests that are made here. The 5.0 release hightlights are listed here

CUDA 5.0 already supports separate compilation to create static libraries and adds Nsight Eclipse Edition a full-fledged CUDA IDE that supports CUDA syntax highlighting, autocomplete, integrated build, debugger and profiler support. There is also one CUDA download package now that installs everything required to get started with CUDA. On windows the Nsight IDE is not yet integrated into the 5.0 package but that should be expected to be fixed in the next release. This combined CUDA installer also cross checks host compiler dependencies to avoid versioning issues with the host compiler.

The GPU emulation mode isn’t coming back:-) you should expect more powerful features in the IDEs and cuda-gdb that lets you remotely debug targets from your host sytem.

With every CUDA release the goal is to make it simpler and easy to use CUDA, so agreed on the approach to mass success and your desire to help improve CUDA.

It’s great, I will use it when windows version will be released. Btw, I checked that texture cache load instruction is introduced in gk110 as I suggested here, but not in desktop where it is most needed. And such instruction could be added long ago with some performance loss and now only could be introduced more speedy version of it.