Sharing a single counter (variable) across multiple thread(s) block(s)

mbchtr97 · December 27, 2017, 6:52pm

Here is what I have I have done so far - each thread block element performs calculation and depending on the outcome, sometimes I get no results of interest, so calculation results can be disregarded and do not need to be processed by CPU and saved into storage.

To improve routine performance, I wanted to find a way to signal from within CUDA kernel back to CPU execution code that there is at least one interesting solution found, and results then need to be parsed by CPU. Otherwise, results can be omitted and next calculation cycle can be executed. Here is the high level pseudo-code

__global__ subKernel(long long unsigned int * varCounter)
{
  // do some calculations for the given thread block
  if (varResult > 0)
    (*varCounter)++; // to signal to CPU that positive result was found in at least one thread block
}

int main()
{
  // initial setup 
  for (i=0;i<varCycleCount;i++)
  {
    varCounter = 0;
    subKernel<<<(parameters)>>> (varCounter);
    if (varCounter > 0)
       // process results, save into storage
}

Needless to say, when I try to execute something like this, I get failure to launch kernel code with the message back: “an illegal memory access was encountered”, indicating line where varCounter is incremented.

varCounter is created in shared memory space

long long unsigned int * varCounter = NULL;
cudaMallocManaged(&varCounter , sizeof(long long unsigned int));

but that does not seem to have helped. I am passing it as a pointer into kernel and then dereferencing to increment the value.

I was looking for similar topics / problems, but most solutions revolve around creating an array and examining each and every position outside of the kernel, something that relies on CPU and slows down execution substantially. I am looking for a way to signal something back from GPU to CPU

Thanks

M

njuffa · December 27, 2017, 7:22pm

Do you perform error checking on the call to cudaMallocManaged()? What GPU are you using? Have you worked through the introductory blog post and example code for unified memory use here:

[url]https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/[/url]

mbchtr97 · December 27, 2017, 7:28pm

Error checking is done both on malloc (no error there, it is just a single variable) and the call to subKernel (this is where the error is signalled).

I am working on GTX1080 at this time - I did find atomicAdd option, but whenever I add it to the code, execution stalls, irrespective of the compilation options (I am already setting target architecture to computer 6.1 compatible).

njuffa · December 27, 2017, 7:41pm

When you dump out device properties, what is the value for the property concurrentManagedAccess ? I am reasonably sure that this must report as ‘1’ or ‘true’ for your code to work.

mbchtr97 · December 27, 2017, 7:42pm

this might be a dumb question but how to “dump out device properties”?

Robert_Crovella · December 27, 2017, 7:45pm

If your claim is correct, it should require minimal effort to convert this from pseudo code to a real, complete code example, that is not much longer than what you have shown here, and demonstrate the issue. If you can do that, you can probably get some help.

If you choose to do so, please indicate platform (windows/linux) and CUDA version, as well as compile command line. All of that matters for decoding expected managed memory behavior.

njuffa · December 27, 2017, 7:45pm

Use the secret powers of documentation :-)

[url]CUDA Runtime API :: CUDA Toolkit Documentation

mbchtr97 · December 27, 2017, 7:52pm

The debug is simple (IMO) - I disable the one line in which varCounter is incremented, I execute the code with full mem-check to verify memory leaks, all executes correctly. No issues at all. Next, I uncomment the line in which varCounter is incremented, compile, run again and indicated error takes place. mem-check does not really add anything to clarity with the information on “illegal memory access” - given that it is a single variable shared across all threads, and not an array where position needs to be calculated, I am only led to conclude there is some race condition and multiple threads attempt to write into memory location at the same time. Kind of makes sense but then what is the way out ?

Windows 10, with Visual C++ 2017 (15.4.5), with CUDA (9.1/9.0 - both tested and NOT making any difference)

I will share compile parameters when PC unlocks - seems like I need a hard reset to get it out of the loop ;(

Robert_Crovella · December 27, 2017, 8:00pm

aha.

You should be doing:

*varCounter = 0;

not

varCounter = 0;

in your host code.

and after the kernel, you should be doing this:

subKernel<<<(parameters)>>> (varCounter);
    cudaDeviceSynchronize();
    if (*varCounter > 0)

mbchtr97 · December 27, 2017, 8:26pm

Here are promised compilation parameters

C:\Users\User\source\repos\delimiter\delimiter>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.1\bin\nvcc.exe" -gencode=arch=compute_60,code=\"sm_60,compute_60\" --use-local-env --cl-version 2017 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio017\Community\VC\Tools\MSVC4.11.25503\bin\HostX86\x64" -x cu  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.1\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.1\include"     --keep-dir x64\Release -maxrregcount=0  --machine 64 --compile -cudart static     -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi  /MD " -o x64\Release\main.cu.obj "C:\Users\User\source\repos\delimiter\delimiter\main.cu"

as mentioned, I am forcing compute compatibility to 6.0 at this time. No compilation errors.

Fair enough, changes were made accordingly. I will come back when code runs for some time to confirm whether it fixed the issue :)

mbchtr97 · December 27, 2017, 8:41pm

Thanks! I did not connect the dots - cudaGetDeviceProperties is what I know about and use it already, just for some reason did not click it together :)

Having checked it, concurrent memory access is set to: 0 at this time. It does not seem to affect the operation with fixes per

suggestions - below i can see that I have non-zero number of saved candidate solutions, so the fixes suggested dealt with the problem correctly. It was dumb of me NOT to have noticed the issue beforehand :(

.\delimiter.exe all-delimiters 32 15 2 3 512 2048 1 out 0x00
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used; concurrent memory access is set to: 0

GPU memory usage: used = 1.3943 GB, free = 6.6057 GB, total = 8.0000 GB

CUDA kernel launch (initialize delimiters) with 2048 blocks and 512 threads per block, thread block multiplier 1:
. completed

GPU memory usage: used = 1.4141 GB, free = 6.5859 GB, total = 8.0000 GB

CUDA kernel launch (delimiter calculations) with 2048 blocks and 512 threads per block, thread block multiplier 1, target iteration cycles 4096

Examined:      1072693248 sequences, cycle:   1023, saved:         72, last candidate: 11111111111111111111111111111100
Examined:      2146435072 sequences, cycle:   2047, saved:        127, last candidate: 11111111111111111111111111111110
Examined:      3220176896 sequences, cycle:   3071, saved:        171, last candidate: 11111111111111111111111111111101
Examined:      4293918720 sequences, cycle:   4095, saved:        252, last candidate: 11111111111111111111111111111111
Examined:      4294967296 sequences, cycle:   last, saved:        252, last candidate: 11111111111111111111111111111111

===================================
Last Examined Delimiter:
 --- hex: 0xffffffff
 --- bin: 11111111111111111111111111111111
Next Start Delimiter:
 --- hex: 0x00000000
 --- bin: 00000000000000000000000000000000
Execution Statistics:
 --- blocks: 2048
 --- threads: 512
 --- thread block size: 1048576
 --- elapsed: 10 seconds
 --- delimiters examined: 4294967296
 --- delimiters / second: 429496729
 --- delimiters saved: 252
 --- delimiters saved to files:
    >>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.1.txt
    >>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.2.txt
    >>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.3.txt
    >>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.4.txt
===================================

I guess topic can be closed :) Much appreciate !

M.

Robert_Crovella · December 27, 2017, 8:51pm

concurrent managed access would be expected to be zero. CUDA 9.1 with Windows and a Pascal GPU means you will be operating in the pre-pascal UM regime:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

One implication of this is that concurrent managed access is not possible in this regime, therefore the property will be zero. Furthermore, in this regime, host-code access to managed memory after a kernel call must be preceded by a cudaDeviceSynchronize() call, which is why I inserted that:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-gpu-exclusive

As an aside, multi-thread unprotected access to a single memory location may lead to unpredictable behavior:

(*varCounter)++;

If you are expecting that variable to track the actual number of times device code attempted to increment it (i.e. to actually keep count) that is unreliable. If you are simply expecting it to be non-zero if one or more threads touches it, that is probably safe (barring overflow). If you do not actually need to keep count, but only use it as a flag that one or more threads indicated a “candidate find”, then this is safe, and predictable:

*varCounter = 1;

mbchtr97 · December 27, 2017, 8:54pm

I understand the number might NOT be accurate and it is fine. I started off with the notion of just a simple boolean flag set to true whenever a result needs to be examined by CPU and saved to storage. I guess I will back convert long long unsigned int to bool and follow your suggestion - that is ultimately what I set out to do anyway :)

mbchtr97 · December 27, 2017, 9:01pm

Converted into a simple boolean and works just fine as well

.\delimiter.exe all-delimiters 36 17 2 3 512 2048 32 out 0x00
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used; concurrent memory access is set to: 0

GPU memory usage: used = 1.3943 GB, free = 6.6057 GB, total = 8.0000 GB

CUDA kernel launch (initialize delimiters) with 2048 blocks and 512 threads per block, thread block multiplier 32:
................................ completed

GPU memory usage: used = 1.9884 GB, free = 6.0116 GB, total = 8.0000 GB

CUDA kernel launch (delimiter calculations) with 2048 blocks and 512 threads per block, thread block multiplier 32, target iteration cycles 2048

Examined:     17146314752 sequences, cycle:    511, saved:         43, last candidate: 111111111111111111111111111111111100
Examined:     34326183936 sequences, cycle:   1023, saved:         70, last candidate: 111111111111111111111111111111111110
Examined:     51506053120 sequences, cycle:   1535, saved:         96, last candidate: 111111111111111111111111111111111101
Examined:     68685922304 sequences, cycle:   2047, saved:        127, last candidate: 111111111111111111111111111111111111
Examined:     68719476736 sequences, cycle:   last, saved:        127, last candidate: 111111111111111111111111111111111111

===================================
Last Examined Delimiter:
 --- hex: 0xffffffff0f
 --- bin: 111111111111111111111111111111111111
Next Start Delimiter:
 --- hex: 0x0000000010
 --- bin: 000000000000000000000000000000000000
Execution Statistics:
 --- blocks: 2048
 --- threads: 512
 --- thread block size: 33554432
 --- elapsed: 138 seconds
 --- delimiters examined: 68719476736
 --- delimiters / second: 497967222
<b> --- delimiters saved: 127</b>
 --- delimiters saved to files:
    >>> out.DEL.36.HAM.17.DC.2.RUN.3.TH.512.BK.2048.CPUTH.1.txt
    >>> out.DEL.36.HAM.17.DC.2.RUN.3.TH.512.BK.2048.CPUTH.2.txt
    >>> out.DEL.36.HAM.17.DC.2.RUN.3.TH.512.BK.2048.CPUTH.3.txt
    >>> out.DEL.36.HAM.17.DC.2.RUN.3.TH.512.BK.2048.CPUTH.4.txt
===================================

Topic		Replies	Views
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52442	June 7, 2022
Cuda code performance CUDA Programming and Performance	14	3156	December 16, 2014
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	888	July 27, 2017
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7958	June 14, 2013
CUDA Memory Consistency CUDA Programming and Performance	23	55585	March 8, 2007
Conceptual questions on memory and blocks limit CUDA Programming and Performance	2	717	May 29, 2018
[ON HOLD] Issue with cuda_occupancy and cudaDeviceSetCacheConfig(...) CUDA Programming and Performance	7	2318	June 26, 2018
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008
Using Shared Memory in CUDA C/C++ Technical Blog	36	1999	October 8, 2020
Will calling a kernel from a kernel help on performance? CUDA Programming and Performance	9	2215	December 23, 2022

Sharing a single counter (variable) across multiple thread(s) block(s)

Related topics