Sharing a single counter (variable) across multiple thread(s) block(s)

Here is what I have I have done so far - each thread block element performs calculation and depending on the outcome, sometimes I get no results of interest, so calculation results can be disregarded and do not need to be processed by CPU and saved into storage.

To improve routine performance, I wanted to find a way to signal from within CUDA kernel back to CPU execution code that there is at least one interesting solution found, and results then need to be parsed by CPU. Otherwise, results can be omitted and next calculation cycle can be executed. Here is the high level pseudo-code

__global__ subKernel(long long unsigned int * varCounter)
{
  // do some calculations for the given thread block
  if (varResult > 0)
    (*varCounter)++; // to signal to CPU that positive result was found in at least one thread block
}

int main()
{
  // initial setup 
  for (i=0;i<varCycleCount;i++)
  {
    varCounter = 0;
    subKernel<<<(parameters)>>> (varCounter);
    if (varCounter > 0)
       // process results, save into storage
}

Needless to say, when I try to execute something like this, I get failure to launch kernel code with the message back: “an illegal memory access was encountered”, indicating line where varCounter is incremented.

varCounter is created in shared memory space

long long unsigned int * varCounter = NULL;
cudaMallocManaged(&varCounter , sizeof(long long unsigned int));

but that does not seem to have helped. I am passing it as a pointer into kernel and then dereferencing to increment the value.

I was looking for similar topics / problems, but most solutions revolve around creating an array and examining each and every position outside of the kernel, something that relies on CPU and slows down execution substantially. I am looking for a way to signal something back from GPU to CPU

Thanks

M

Do you perform error checking on the call to cudaMallocManaged()? What GPU are you using? Have you worked through the introductory blog post and example code for unified memory use here:

[url]https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/[/url]

Error checking is done both on malloc (no error there, it is just a single variable) and the call to subKernel (this is where the error is signalled).

I am working on GTX1080 at this time - I did find atomicAdd option, but whenever I add it to the code, execution stalls, irrespective of the compilation options (I am already setting target architecture to computer 6.1 compatible).

When you dump out device properties, what is the value for the property concurrentManagedAccess ? I am reasonably sure that this must report as ‘1’ or ‘true’ for your code to work.

this might be a dumb question but how to “dump out device properties”?

If your claim is correct, it should require minimal effort to convert this from pseudo code to a real, complete code example, that is not much longer than what you have shown here, and demonstrate the issue. If you can do that, you can probably get some help.

If you choose to do so, please indicate platform (windows/linux) and CUDA version, as well as compile command line. All of that matters for decoding expected managed memory behavior.

Use the secret powers of documentation :-)

[url]CUDA Runtime API :: CUDA Toolkit Documentation

The debug is simple (IMO) - I disable the one line in which varCounter is incremented, I execute the code with full mem-check to verify memory leaks, all executes correctly. No issues at all. Next, I uncomment the line in which varCounter is incremented, compile, run again and indicated error takes place. mem-check does not really add anything to clarity with the information on “illegal memory access” - given that it is a single variable shared across all threads, and not an array where position needs to be calculated, I am only led to conclude there is some race condition and multiple threads attempt to write into memory location at the same time. Kind of makes sense but then what is the way out ?

Windows 10, with Visual C++ 2017 (15.4.5), with CUDA (9.1/9.0 - both tested and NOT making any difference)

I will share compile parameters when PC unlocks - seems like I need a hard reset to get it out of the loop ;(

aha.

You should be doing:

*varCounter = 0;

not

varCounter = 0;

in your host code.

and after the kernel, you should be doing this:

subKernel<<<(parameters)>>> (varCounter);
    cudaDeviceSynchronize();
    if (*varCounter > 0)

Here are promised compilation parameters

C:\Users\User\source\repos\delimiter\delimiter>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.1\bin\nvcc.exe" -gencode=arch=compute_60,code=\"sm_60,compute_60\" --use-local-env --cl-version 2017 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio017\Community\VC\Tools\MSVC4.11.25503\bin\HostX86\x64" -x cu  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.1\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.1\include"     --keep-dir x64\Release -maxrregcount=0  --machine 64 --compile -cudart static     -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi  /MD " -o x64\Release\main.cu.obj "C:\Users\User\source\repos\delimiter\delimiter\main.cu"

as mentioned, I am forcing compute compatibility to 6.0 at this time. No compilation errors.

Fair enough, changes were made accordingly. I will come back when code runs for some time to confirm whether it fixed the issue :)

Thanks! I did not connect the dots - cudaGetDeviceProperties is what I know about and use it already, just for some reason did not click it together :)

Having checked it, concurrent memory access is set to: 0 at this time. It does not seem to affect the operation with fixes per

suggestions - below i can see that I have non-zero number of saved candidate solutions, so the fixes suggested dealt with the problem correctly. It was dumb of me NOT to have noticed the issue beforehand :(

.\delimiter.exe all-delimiters 32 15 2 3 512 2048 1 out 0x00
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used; concurrent memory access is set to: 0

GPU memory usage: used = 1.3943 GB, free = 6.6057 GB, total = 8.0000 GB

CUDA kernel launch (initialize delimiters) with 2048 blocks and 512 threads per block, thread block multiplier 1:
. completed

GPU memory usage: used = 1.4141 GB, free = 6.5859 GB, total = 8.0000 GB

CUDA kernel launch (delimiter calculations) with 2048 blocks and 512 threads per block, thread block multiplier 1, target iteration cycles 4096

Examined:      1072693248 sequences, cycle:   1023, saved:         72, last candidate: 11111111111111111111111111111100
Examined:      2146435072 sequences, cycle:   2047, saved:        127, last candidate: 11111111111111111111111111111110
Examined:      3220176896 sequences, cycle:   3071, saved:        171, last candidate: 11111111111111111111111111111101
Examined:      4293918720 sequences, cycle:   4095, saved:        252, last candidate: 11111111111111111111111111111111
Examined:      4294967296 sequences, cycle:   last, saved:        252, last candidate: 11111111111111111111111111111111

===================================
Last Examined Delimiter:
 --- hex: 0xffffffff
 --- bin: 11111111111111111111111111111111
Next Start Delimiter:
 --- hex: 0x00000000
 --- bin: 00000000000000000000000000000000
Execution Statistics:
 --- blocks: 2048
 --- threads: 512
 --- thread block size: 1048576
 --- elapsed: 10 seconds
 --- delimiters examined: 4294967296
 --- delimiters / second: 429496729
 --- delimiters saved: 252
 --- delimiters saved to files:
    >>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.1.txt
    >>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.2.txt
    >>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.3.txt
    >>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.4.txt
===================================

I guess topic can be closed :) Much appreciate !

M.

concurrent managed access would be expected to be zero. CUDA 9.1 with Windows and a Pascal GPU means you will be operating in the pre-pascal UM regime:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

One implication of this is that concurrent managed access is not possible in this regime, therefore the property will be zero. Furthermore, in this regime, host-code access to managed memory after a kernel call must be preceded by a cudaDeviceSynchronize() call, which is why I inserted that:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-gpu-exclusive

As an aside, multi-thread unprotected access to a single memory location may lead to unpredictable behavior:

(*varCounter)++;

If you are expecting that variable to track the actual number of times device code attempted to increment it (i.e. to actually keep count) that is unreliable. If you are simply expecting it to be non-zero if one or more threads touches it, that is probably safe (barring overflow). If you do not actually need to keep count, but only use it as a flag that one or more threads indicated a “candidate find”, then this is safe, and predictable:

*varCounter = 1;

I understand the number might NOT be accurate and it is fine. I started off with the notion of just a simple boolean flag set to true whenever a result needs to be examined by CPU and saved to storage. I guess I will back convert long long unsigned int to bool and follow your suggestion - that is ultimately what I set out to do anyway :)

Converted into a simple boolean and works just fine as well

.\delimiter.exe all-delimiters 36 17 2 3 512 2048 32 out 0x00
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used; concurrent memory access is set to: 0

GPU memory usage: used = 1.3943 GB, free = 6.6057 GB, total = 8.0000 GB

CUDA kernel launch (initialize delimiters) with 2048 blocks and 512 threads per block, thread block multiplier 32:
................................ completed

GPU memory usage: used = 1.9884 GB, free = 6.0116 GB, total = 8.0000 GB

CUDA kernel launch (delimiter calculations) with 2048 blocks and 512 threads per block, thread block multiplier 32, target iteration cycles 2048

Examined:     17146314752 sequences, cycle:    511, saved:         43, last candidate: 111111111111111111111111111111111100
Examined:     34326183936 sequences, cycle:   1023, saved:         70, last candidate: 111111111111111111111111111111111110
Examined:     51506053120 sequences, cycle:   1535, saved:         96, last candidate: 111111111111111111111111111111111101
Examined:     68685922304 sequences, cycle:   2047, saved:        127, last candidate: 111111111111111111111111111111111111
Examined:     68719476736 sequences, cycle:   last, saved:        127, last candidate: 111111111111111111111111111111111111

===================================
Last Examined Delimiter:
 --- hex: 0xffffffff0f
 --- bin: 111111111111111111111111111111111111
Next Start Delimiter:
 --- hex: 0x0000000010
 --- bin: 000000000000000000000000000000000000
Execution Statistics:
 --- blocks: 2048
 --- threads: 512
 --- thread block size: 33554432
 --- elapsed: 138 seconds
 --- delimiters examined: 68719476736
 --- delimiters / second: 497967222
<b> --- delimiters saved: 127</b>
 --- delimiters saved to files:
    >>> out.DEL.36.HAM.17.DC.2.RUN.3.TH.512.BK.2048.CPUTH.1.txt
    >>> out.DEL.36.HAM.17.DC.2.RUN.3.TH.512.BK.2048.CPUTH.2.txt
    >>> out.DEL.36.HAM.17.DC.2.RUN.3.TH.512.BK.2048.CPUTH.3.txt
    >>> out.DEL.36.HAM.17.DC.2.RUN.3.TH.512.BK.2048.CPUTH.4.txt
===================================