thank you @MatColgrove again for your prompt and helpful reply.
it appears that malloc()+memset()
does work, calloc()
is not supported.
I noticed that using this dynamic allocation of thread-private buffer does work for nvc, but it produced a significant performance hit to my program - slowing it down by 5x to 10x compared to a static sized local array, say float buf[10] = {0.f};
You can see this significant speed difference by checking out the latest version of the code that I used in another thread.
here is how I compared the performance hit
if you download the latest code from GitHub - fangq/umcx: micro mcx, and compile it using make nvc
, it uses the static array, then run the benchmark
git clone https://github.com/fangq/umcx.git
cd umcx/src
make clean
make nvc # use static private array
../bin/umcx --bench cube60 -n 1e7
you can see that the above simulation runs relatively fast (it is about 3x slower than my tests in an earier benchmark due to the template is not working in nvc)
however, if you recompile it using malloc/memset
make clean
make nvc USERCXXFLAGS=-DUSE_MALLOC
../bin/umcx --bench cube60 -n 1e7
you can see that the speed is now 5x to 10x slower.
I am wondering if you can suggest an alternative approach to create a thread-private buffer that does not have such a high overhead?