I did look at the implementations of the functions in question. There is nothing obvious that would explain the size limitation you’re experiencing.
Some of our oldest implementations used 1D textures for data input to the kernels, which have a size limit, but not the ones you’re having trouble with.
In order for me to even attempt to reproduce this, I would need to know what HW and OS you’re running this on, and how the crash manifests itself, e.g. are there error return codes? Interal exceptions? Can you step through this in a debugger and nail down which function invocation causes the problem?
Even then, it may be difficult for me to reproduce if it is some kind of interaction between your kernels, e.g. if you don’t allocate memory of sufficient size, kernels may overwrite memory and corrupting general state. Obviously, if you could post the code for a simple stand-alone repro of the issue, that improves the chance of us getting this figured out and fixed.
PS: I forgot to properly answer your original question, which was if we could add a StdDev primitive to the 4.0 final release. Sadly, no. Our 4.0 release branch has been in feature freeze for a while and I cannot add new functionality at this point. I did put your request onto our 4.1 list and there’s a decent chance that we’ll get to implementing for that release.