[Solved] How to update host memory variable from device during OpenCL kernel execution

I would like to print a progress bar for my OpenCL code during the kernel execution. My CUDA equivalent of this code was able to achieve this using pinned memory, I was trying to implement the same using CL_MEM_ALLOC_HOST_PTR and clEnqueueMapBuffer, but the result is quite strange.

here is a snippet of the relevant code. This code works as expected on AMD GPUs using amdgpu-pro driver; a progress bar was printed and updated; but it failed to work on NVIDIA OpenCL.

void host_function(){  
  
  
     cl_uint *progress=NULL;  
     cl_mem *gprogress;  
  
  
     gprogress=(cl_mem *)malloc(1*sizeof(cl_mem));  
       
     // define a host_ptr buffer, alloc in the pinned memory  
  
  
     OCL_ASSERT(((gprogress[0]=clCreateBuffer(mcxcontext,(CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR), sizeof(cl_uint),NULL,&status),status)));  
  
  
     // initialize the pinned memory buffer  
     progress = (cl_uint *)clEnqueueMapBuffer(mcxqueue[0], gprogress[0], CL_TRUE, CL_MAP_WRITE, 0, sizeof(cl_uint), 0, NULL, NULL, NULL);  
     *progress=0;  
     clEnqueueUnmapMemObject(mcxqueue[0], gprogress[0], progress, 0, NULL, NULL);  
  
  
     OCL_ASSERT((clSetKernelArg(mcxkernel[i],10, sizeof(cl_mem), (void*)(gprogress))));  
  
  
     // launch kernel  
     OCL_ASSERT((clEnqueueNDRangeKernel(mcxqueue[devid],mcxkernel[devid],1,NULL,&gpu[devid].autothread,&gpu[devid].autoblock, 0, NULL, NULL)));  
  
  
     if((param.debuglevel & MCX_DEBUG_PROGRESS)){  
             // after launching the kernel, check progress by reading gprogress[0]  
  
  
             progress = (cl_uint *)clEnqueueMapBuffer(mcxqueue[0], gprogress[0], CL_FALSE, CL_MAP_READ, 0, sizeof(cl_uint), 0, NULL, NULL, NULL);  
             do{  
                 ndone = *progress;  
                 MCX_FPRINTF(cfg->flog,"progress=%d\n",ndone);  
             }while (ndone < maxcount);  
             clEnqueueUnmapMemObject(mcxqueue[0], gprogress[0], progress, 0, NULL, NULL);  
     }  
       
     OCL_ASSERT((clFinish(mcxqueue[devid])));  
}

inside the kernel, I incremented gprogress[0]. I was hoping that do/while loop could read out the updated value to progress, and print out during kernel execution.

However, what I see is that it keeps printing progress=0, until the kernel is completed, it jumps to the final number.

can someone tell me if this is the correct way to implement a progress bar in OpenCL? how can I make it work?

thanks

a sample output log is attached below. I have CUDA 9 with a Titan V GPU. I run this on Ubuntu 14.04.

fangq@taote:~/space/git/Project/github/mcxcl/example/benchmark$ ./run_benchmark1.sh -D P
...
==============================================================================
- code name: [Vanilla MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] Logistic-Lattice [Seed Length] 5
initializing streams ...	init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL  -DUSE_ATOMIC
build program complete : 25 ms
- [device 0(1): Graphics Device] threadph=152 oddphotons=385280 np=100000000.0 nthread=655360 nblock=64 repetition=1
set kernel arguments complete : 25 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 
Progress: [>                                                                                                                                                                                               ]   0%
progress=0
progress=0
progress=0
progress=0
progress=0
<repeated many times>
progress=0
progress=0
progress=0
progress=0
progress=0
progress=230
Progress: [=========================================================================================] 100%
kernel complete:  	4539 ms
retrieving flux ... 	transfer complete:        4539 ms
normalizing raw data ...	normalization factor alpha=2.000000
saving data to file ... 216000 1	saving data complete : 4542 ms

simulated 100000000 photons (100000000) with 1 devices (repeat x1)
MCX simulation speed: 22153.30 photon/ms
total simulated energy: 100000000.00	absorbed: 17.69290%
(loss due to initial specular reflection is excluded in the total)

wondering if anyone can comment on this issue? I still haven’t found a solution. my cuda version works fine for this feature.

I’m not an OpenCL expert, but this strikes me as odd:

CL_MEM_READ_ONLY

When I look up the definition for that:

https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html

I find:

"CL_MEM_READ_ONLY
This flag specifies that the memory object is a read-only memory object when used inside a kernel.

Writing to a buffer or image object created with CL_MEM_READ_ONLY inside a kernel is undefined."

If I were attempting this, I can’t imagine why I would specify that. I’m not suggesting this is the issue, or that I’ve tested it, or anything like that. I haven’t tested it. It may be irrelevant. I don’t have a complete code to work with at the moment. Just speculating.

If you want to provide a self-contained test case, of less than 100 lines of code, posted right here in the forum and not via an external link, with command-line compilation instructions, I will take a look as time permits.

To be clear, I have no idea if this is workable at all with NVIDIA OpenCL.

@Robert, I had tried several combinations of the buffer flags, see

https://github.com/fangq/mcxcl/commit/ad29b2a5c8fea2933ada6c33a47d0e654aa485f5#diff-8562c9c7f2925a6eaf599db309acade0L30

and the one used in the above example code was the only one that worked on the AMD GPUs (never worked on nvidia GPUs). Just tested, still works on AMD GPUs (as well as Intel GPU).

I just changed it to “CL_MEM_READ_WRITE|CL_MEM_ALLOC_HOST_PTR”, it does not seem to change anything - the progress bar is only printed after the kernel is completed (i.e. jumping from 0% to 100%).

The host memory, progress, was mapped to the pinned gpu buffer gprogress, using clEnqueueMapBuffer(CL_MAP_WRITE) before the kernel to reset the value, and then use clEnqueueMapBuffer(CL_MAP_READ) to read during the execution. I got these from several sample scripts online.

Here are the debug logs for running the sample code on 3 GPUs and 1 CPU (all Linux host - Ubuntu 18.04).

Among the 4 devices, the Intel GPU was the most responsive one (i.e. the updates to the pinned memory was updated in the host timely); the AMD GPU reported sparse updates, still somewhat useful. The Intel CPU only did 1 update in the middle, despite the sufficiently long run-time. NVIDIA GPUs are consistently printing the progress only after the kernel returns.

on NVIDIA GPU (Titan V)

fangq@taote:~/space/git/Project/github/mcxcl/example/benchmark$ ./run_benchmark1.sh -n 1e8 -D P -G 1
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
...
==============================================================================
- variant name: [Detective MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] xoroshiro128+ [Seed Length] 4
initializing streams ...	init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL  -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DINTERNAL_SOURCE
build program complete : 21 ms
- [device 0(1): TITAN V] threadph=305 oddphoton=57600 np=100000000.0 nthread=327680 nblock=64 sharedbuf=1536
set kernel arguments complete : 21 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 
Progress: [>                                                                                                                                                                                                                                 ]   0%progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=0
progress=459
Progress: [==================================================================================================================================================================================================================================] 100%
kernel complete:  	2932 ms
retrieving flux ... 	detected 299304 photons, total: 299304	transfer complete:        2937 ms
normalizing raw data ...	source 1, normalization factor alpha=2.000000
saving data to file ... 216000 1	saving data complete : 2940 ms

simulated 100000000 photons (100000000) with 1 devices (repeat x1)
MCX simulation speed: 34352.46 photon/ms
total simulated energy: 100000000.00	absorbed: 17.69119%
(loss due to initial specular reflection is excluded in the total)

on AMD GPU (vega10)

fangq@pangu:~/space/git/Project/github/mcxcl/example/benchmark$ ./run_benchmark1.sh -D P 
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
...
==============================================================================
- variant name: [Detective MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] xoroshiro128+ [Seed Length] 4
initializing streams ...	init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL  -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DINTERNAL_SOURCE
build program complete : 880 ms
- [device 0(1): Vega 10 XT [Radeon RX Vega 64]] threadph=610 oddphoton=57600 np=100000000.0 nthread=163840 nblock=64 sharedbuf=1536
set kernel arguments complete : 880 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 
Progress: [>                                                                                                                                                                                                                                 ]   0%progress=0
progress=18
Progress: [====>                                                                                                                                                                                                                             ]   2%progress=18
progress=18
progress=18
progress=166
Progress: [==========================================>                                                                                                                                                                                       ]  18%progress=166
progress=166
progress=306
Progress: [=============================================================================>                                                                                                                                                    ]  34%progress=306
progress=306
progress=306
progress=306
progress=306
progress=306
progress=306
progress=306
progress=317
Progress: [================================================================================>                                                                                                                                                 ]  35%progress=329
Progress: [===================================================================================>                                                                                                                                              ]  37%progress=329
progress=329
progress=434
Progress: [==============================================================================================================>                                                                                                                   ]  49%progress=499
Progress: [==============================================================================================================================>                                                                                                   ]  56%progress=499
progress=499
progress=499
progress=611
Progress: [===========================================================================================================================================================>                                                                      ]  69%progress=611
progress=611
progress=611
progress=611
progress=611
progress=611
progress=611
progress=611
progress=611
progress=627
Progress: [===============================================================================================================================================================>                                                                  ]  70%progress=660
Progress: [=======================================================================================================================================================================>                                                          ]  74%progress=723
Progress: [=======================================================================================================================================================================================>                                          ]  81%progress=723
progress=806
Progress: [=============================================================================================================================================================================================================>                    ]  91%progress=816
Progress: [===============================================================================================================================================================================================================>                  ]  92%progress=816
progress=816
progress=916
Progress: [==================================================================================================================================================================================================================================] 100%
kernel complete:  	5392 ms
retrieving flux ... 	detected 299965 photons, total: 299965	transfer complete:        5407 ms
normalizing raw data ...	source 1, normalization factor alpha=2.000000
saving data to file ... 216000 1	saving data complete : 5419 ms

simulated 100000000 photons (100000000) with 1 devices (repeat x1)
MCX simulation speed: 22163.12 photon/ms
total simulated energy: 100000000.00	absorbed: 17.69396%
(loss due to initial specular reflection is excluded in the total)

on Intel GPU (HD 620)

fangq@taote:~/space/git/Project/github/mcxcl/example/benchmark$ ./run_benchmark1.sh -n 1e7 -D P -G 3
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
...
==============================================================================
- variant name: [Detective MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] xoroshiro128+ [Seed Length] 4
initializing streams ...	init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL  -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DINTERNAL_SOURCE
build program complete : 376 ms
- [device 0(3): Intel(R) Gen9 HD Graphics NEO] threadph=930 oddphoton=640 np=10000000.0 nthread=10752 nblock=64 sharedbuf=1536
set kernel arguments complete : 376 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 
Progress: [>                                                                                                                                                                                                                                 ]   0%progress=0
progress=97
Progress: [================>                                                                                                                                                                                                                 ]   7%progress=169
Progress: [============================>                                                                                                                                                                                                     ]  12%progress=257
Progress: [==========================================>                                                                                                                                                                                       ]  19%progress=353
Progress: [==========================================================>                                                                                                                                                                       ]  26%progress=466
Progress: [=============================================================================>                                                                                                                                                    ]  34%progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=466
progress=486
Progress: [=================================================================================>                                                                                                                                                ]  36%progress=571
Progress: [===============================================================================================>                                                                                                                                  ]  42%progress=703
Progress: [=====================================================================================================================>                                                                                                            ]  52%progress=789
Progress: [===================================================================================================================================>                                                                                              ]  58%progress=882
Progress: [===================================================================================================================================================>                                                                              ]  65%progress=931
Progress: [===========================================================================================================================================================>                                                                      ]  69%progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=931
progress=965
Progress: [=================================================================================================================================================================>                                                                ]  71%progress=1054
Progress: [===============================================================================================================================================================================>                                                  ]  78%progress=1128
Progress: [============================================================================================================================================================================================>                                     ]  83%progress=1219
Progress: [===========================================================================================================================================================================================================>                      ]  90%progress=1313
Progress: [===========================================================================================================================================================================================================================>      ]  97%progress=1385
Progress: [==================================================================================================================================================================================================================================] 100%
kernel complete:  	5383 ms
retrieving flux ... 	detected 26383 photons, total: 26383	transfer complete:        5387 ms
normalizing raw data ...	source 1, normalization factor alpha=20.000000
saving data to file ... 216000 1	saving data complete : 5388 ms

simulated 10000000 photons (10000000) with 1 devices (repeat x1)
MCX simulation speed: 1997.20 photon/ms
total simulated energy: 10000000.00	absorbed: 17.68671%
(loss due to initial specular reflection is excluded in the total)

on Intel CPU (i7-7700k)

fangq@taote:~/space/git/Project/github/mcxcl/example/benchmark$ ./run_benchmark1.sh -n 1e6 -D P -G 4
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
...
==============================================================================
- variant name: [Detective MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] xoroshiro128+ [Seed Length] 4
initializing streams ...	init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL  -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DINTERNAL_SOURCE
Kernel build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <mcx_main_loop> was not vectorized
Done.
build program complete : 379 ms
- [device 0(4): Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz] threadph=1953 oddphoton=64 np=1000000.0 nthread=512 nblock=64 sharedbuf=1536
set kernel arguments complete : 381 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 
Progress: [>                                                                                                                                                                                                                                 ]   0%progress=0
progress=1953
Progress: [===========================================================================================================================================================>                                                                      ]  68%progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=1953
progress=2930
Progress: [==================================================================================================================================================================================================================================] 100%
kernel complete:  	4913 ms
retrieving flux ... 	detected 3070 photons, total: 3070	transfer complete:        4916 ms
normalizing raw data ...	source 1, normalization factor alpha=200.000000
saving data to file ... 216000 1	saving data complete : 4917 ms

simulated 1000000 photons (1000000) with 1 devices (repeat x1)
MCX simulation speed: 220.65 photon/ms
total simulated energy: 1000000.00	absorbed: 17.73232%
(loss due to initial specular reflection is excluded in the total)

This code seems to work for me:

$ cat t7.cpp
#include <CL/opencl.h>
#include <stdio.h>
#include <stdlib.h>

#define TILE_WIDTH 16
#define DS 16384

const char source[] =
"__kernel void matrix_multiply(__global float *A, __global float *B,"
" __global float *C, volatile  __global int *p, int width)"
"{"
"     __local float Ashare[TILE_WIDTH][TILE_WIDTH];"
"     __local float Bshare[TILE_WIDTH][TILE_WIDTH];"
"   int bx = get_group_id(0);"
"   int by = get_group_id(1);"
"   int tx = get_local_id(0);"
"   int ty = get_local_id(1);"
"   int row = by * TILE_WIDTH + ty;"
"   int col = bx * TILE_WIDTH + tx;"
"   float result = 0;"
"   for (int m = 0; m < width / TILE_WIDTH; m++) {"
"     Ashare[ty][tx] = A[(row * width) + (m * TILE_WIDTH) + tx];"
"     Bshare[ty][tx] = B[(((m * TILE_WIDTH) + ty) * width) + col];"
"     barrier(CLK_LOCAL_MEM_FENCE); "
"       for (int k = 0; k < TILE_WIDTH; k++) {"
"         result += Ashare[ty][k] * Bshare[k][tx];"
"       }"
"     barrier(CLK_LOCAL_MEM_FENCE); "
"   }"
"   atomic_add(p, 1);"
"   mem_fence(CLK_GLOBAL_MEM_FENCE);"
"   C[(row * width) + col] = result;"
" };"

;

int main(int argc, char *argv[])
{
  cl_platform_id platform;
  cl_device_id device;
  cl_context context;
  cl_command_queue queue1, queue2;
  cl_program program;
  cl_mem mem1, mem2, mem3, mem4;
  cl_kernel kernel;
  cl_int err;

  err = clGetPlatformIDs(1, &platform, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
  queue1 = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, NULL);

  const char *sources[1] = {source};
  program = clCreateProgramWithSource(context, 1, sources, NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clBuildProgram(program, 1, &device, "-D TILE_WIDTH=16", NULL, NULL);
  if (err == CL_BUILD_PROGRAM_FAILURE) {
    // Determine the size of the log
    size_t log_size;
    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);

// Allocate memory for the log
    char *log = (char *) malloc(log_size);

    // Get the log
    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);

    // Print the log
    printf("%s\n", log);
  }

  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  mem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, NULL);
  mem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, NULL);
  mem3 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, NULL);
  cl_mem (*clCreateBufferNV)(cl_context,cl_mem_flags, cl_mem_flags_NV, size_t, void*, cl_int*) = (cl_mem (*)(cl_context,cl_mem_flags, cl_mem_flags_NV, size_t, void*, cl_int*)) clGetExtensionFunctionAddressForPlatform(platform, "clCreateBufferNV");
  if (clCreateBufferNV == NULL) {printf("invalid function pointer request\n"); return -1;}

  mem4 = clCreateBufferNV(context, CL_MEM_READ_WRITE , CL_MEM_LOCATION_HOST_NV, (size_t)4, NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}

  float *hdata = new float[DS*DS];
  for (int i = 0; i < DS*DS; i++) hdata[i] = 1;
  kernel = clCreateKernel(program, "matrix_multiply", &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  const size_t gwork_size[2] = {DS,DS};
  const size_t lwork_size[2] = {TILE_WIDTH,TILE_WIDTH};
  int msize = DS;
  void *progress = clEnqueueMapBuffer(queue1, mem4, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, 4, 0, NULL, NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  volatile int *iprogress = (volatile int *)progress;
  iprogress[0] = 0;
  err = clSetKernelArg(kernel, 0, sizeof(mem1), &mem1);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 1, sizeof(mem2), &mem2);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 2, sizeof(mem3), &mem3);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 3, sizeof(mem4), &mem4);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 4, sizeof(msize), &msize);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clEnqueueWriteBuffer(queue1, mem1, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clEnqueueWriteBuffer(queue1, mem2, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clEnqueueNDRangeKernel(queue1, kernel, 2, NULL, gwork_size, lwork_size, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  for (int i = 0; i < 1000000; i++) printf("%d,%d\n", i, iprogress[0]);
  err = clEnqueueReadBuffer(queue1, mem3, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  for (int i = 0; i < DS*DS; i++)
    if (hdata[i] != DS) {printf("error at %d, was %f, should be %f\n", i, hdata[i], (float)DS); return 1;}
  printf("success!\n");
  return 0;
}
$ nvcc -o t7 t7.cpp -lOpenCL
$ ./t7
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0
10,0
11,0
12,0
13,0
14,0
15,0
16,0
17,0
18,0
19,0
20,0
21,0
22,0
23,0
24,0
25,0
26,0
27,0
28,0
29,0
30,0
31,0
32,0
33,0
34,0
35,0
36,0
37,0
38,0
39,0
40,0
41,0
42,0
43,0
44,0
45,0
46,0
47,0
48,0
49,0
50,0
51,0
52,0
53,0
54,0
55,0
56,0
57,0
58,0
59,0
60,0
61,0
62,0
63,0
64,0
65,0
66,0
67,0
68,0
69,0
70,0
71,0
72,0
73,0
74,0
75,0
76,0
77,0
78,0
79,0
80,0
81,0
82,0
83,0
84,0
85,0
86,0
87,0
88,0
89,0
90,0
91,0
92,0
93,0
94,0
95,0
96,0
97,0
98,0
99,0
100,0
101,0
102,0
103,0
104,0
105,0
106,0
107,0
108,0
109,0
110,0
111,0
112,0
113,0
114,0
115,0
116,0
117,0
118,0
119,0
120,0
121,0
122,0
123,0
124,0
125,0
126,0
127,0
128,0
129,0
130,0
131,0
132,0
133,0
134,0
135,0
136,0
137,0
138,0
139,0
140,0
141,0
142,0
143,0
144,0
145,0
146,0
147,0
148,0
149,0
150,0
151,0
152,0
153,0
154,0
155,0
156,0
157,0
158,0
159,0
160,0
161,0
162,0
163,0
164,0
165,0
166,0
167,0
168,0
169,0
170,0
171,0
172,0
173,0
174,0
175,0
176,0
177,0
178,0
179,0
180,0
181,0
182,0
183,0
184,0
185,0
186,0
187,0
188,0
189,0
190,0
191,0
192,0
193,0
194,0
195,0
196,0
197,0
198,0
199,0
200,0
201,0
202,0
203,0
204,0
205,0
206,0
207,0
208,0
209,0
210,0
211,0
212,0
213,0
214,0
215,0
216,0
217,0
218,0
219,0
220,0
221,0
222,0
223,0
224,0
225,0
226,0
227,0
228,0
229,0
230,0
231,0
232,0
233,0
234,0
235,0
236,0
237,0
238,0
239,0
240,0
241,0
242,0
243,0
244,0
245,0
246,0
247,0
248,0
249,0
250,0
251,0
252,0
253,0
254,0
255,0
256,0
257,0
258,0
259,0
260,0
261,0
262,0
263,0
264,0
265,0
266,0
267,0
268,0
269,0
270,0
271,0
272,0
273,0
274,0
275,0
276,0
277,0
278,0
279,0
280,0
281,0
282,0
283,0
284,0
285,0
286,0
287,0
288,0
289,0
290,0
291,0
292,0
293,0
294,0
295,0
296,0
297,0
298,0
299,0
300,0
301,0
302,0
303,0
304,0
305,0
306,0
307,0
308,0
309,0
310,0
311,0
312,0
313,0
314,0
315,0
316,0
317,0
318,0
319,0
320,0
321,0
322,0
323,0
324,0
325,0
326,0
327,0
328,0
329,0
330,0
331,0
332,0
333,0
334,0
335,0
336,0
337,0
338,0
339,0
340,0
341,0
342,0
343,0
344,0
345,0
346,0
347,0
348,0
349,32
350,160
351,256
352,384
353,512
354,640
355,736
356,864
357,1056
358,1376
359,1472
360,1728
361,1856
362,1984
363,2080
364,2208
365,2304
366,2432
367,2528
368,2656
369,2752
370,2880
371,3008
372,3456
373,3584
374,3776
375,3904
376,4032
377,4128
378,4256
379,4384
380,4480
381,4736
382,4864
383,4992
384,5120
385,5248
386,5344
387,5472
388,5824
389,5952
390,6080
391,6304
392,6432
393,6560
394,6656
395,6880
396,7008
397,7136
398,7232
399,7360
400,7456
401,7584
402,7712
403,8096
404,8224
405,8384
406,8608
407,8736
... (output truncated)
999990,268435456
999991,268435456
999992,268435456
999993,268435456
999994,268435456
999995,268435456
999996,268435456
999997,268435456
999998,268435456
999999,268435456
success!
$

The final numerical output of the zero-copy variable iprogress[0] matches what I would expect for the matrix dimension. (16384*16384 = 268435456)

The nv extension function, as well as some hint of the rationale for it, is described here:

http://on-demand.gputechconf.com/gtc/2018/presentation/s8837-opencl-nvidia-recent-improvements-future-plans.pdf

CUDA 10.1, Tesla V100, CentOS 7, 418.67

Here is a somewhat improved version:

$ cat t7.cpp
#include <CL/opencl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define TILE_WIDTH 16
#define DS 16384
const int sleep_delay_us = 100000;
const char source[] =
"__kernel void matrix_multiply(__global float *A, __global float *B,"
" __global float *C, volatile  __global int *p, int width)"
"{"
"     __local float Ashare[TILE_WIDTH][TILE_WIDTH];"
"     __local float Bshare[TILE_WIDTH][TILE_WIDTH];"
"   int bx = get_group_id(0);"
"   int by = get_group_id(1);"
"   int tx = get_local_id(0);"
"   int ty = get_local_id(1);"
"   int row = by * TILE_WIDTH + ty;"
"   int col = bx * TILE_WIDTH + tx;"
"   float result = 0;"
"   for (int m = 0; m < width / TILE_WIDTH; m++) {"
"     Ashare[ty][tx] = A[(row * width) + (m * TILE_WIDTH) + tx];"
"     Bshare[ty][tx] = B[(((m * TILE_WIDTH) + ty) * width) + col];"
"     barrier(CLK_LOCAL_MEM_FENCE); "
"       for (int k = 0; k < TILE_WIDTH; k++) {"
"         result += Ashare[ty][k] * Bshare[k][tx];"
"       }"
"     barrier(CLK_LOCAL_MEM_FENCE); "
"   }"
"   atomic_add(p, 1);"
"   mem_fence(CLK_GLOBAL_MEM_FENCE);"
"   C[(row * width) + col] = result;"
" };"

;

int main(int argc, char *argv[])
{
  cl_platform_id platform;
  cl_device_id device;
  cl_context context;
  cl_command_queue queue1, queue2;
  cl_program program;
  cl_mem mem1, mem2, mem3, mem4;
  cl_kernel kernel;
  cl_int err;

  err = clGetPlatformIDs(1, &platform, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  queue1 = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  const char *sources[1] = {source};
  program = clCreateProgramWithSource(context, 1, sources, NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clBuildProgram(program, 1, &device, "-DTILE_WIDTH=16", NULL, NULL);
  if (err == CL_BUILD_PROGRAM_FAILURE) {
    // Determine the size of the log
    size_t log_size;
    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);


    // Allocate memory for the log
    char *log = (char *) malloc(log_size);

    // Get the log
    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);

    // Print the log
    printf("%s\n", log);
  }

  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  mem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  mem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  mem3 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  cl_mem (*clCreateBufferNV)(cl_context,cl_mem_flags, cl_mem_flags_NV, size_t, void*, cl_int*) = (cl_mem (*)(cl_context,cl_mem_flags, cl_mem_flags_NV, size_t, void*, cl_int*)) clGetExtensionFunctionAddressForPlatform(platform, "clCreateBufferNV");
  if (clCreateBufferNV == NULL) {printf("invalid function pointer request\n"); return -1;}
  mem4 = clCreateBufferNV(context, CL_MEM_READ_WRITE , CL_MEM_LOCATION_HOST_NV, (size_t)4, NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  float *hdata = new float[DS*DS];
  for (int i = 0; i < DS*DS; i++) hdata[i] = 1;
  kernel = clCreateKernel(program, "matrix_multiply", &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  const size_t gwork_size[2] = {DS,DS};
  const size_t lwork_size[2] = {TILE_WIDTH,TILE_WIDTH};
  int msize = DS;
  void *progress = clEnqueueMapBuffer(queue1, mem4, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, 4, 0, NULL, NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  volatile int *iprogress = (int *)progress;
  iprogress[0] = 0;
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 0, sizeof(mem1), &mem1);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 1, sizeof(mem2), &mem2);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 2, sizeof(mem3), &mem3);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 3, sizeof(mem4), &mem4);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 4, sizeof(msize), &msize);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clEnqueueWriteBuffer(queue1, mem1, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clEnqueueWriteBuffer(queue1, mem2, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clEnqueueNDRangeKernel(queue1, kernel, 2, NULL, gwork_size, lwork_size, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  int my_progress = iprogress[0];
  while (my_progress < DS*DS){
    printf("progress: %f%\n", (100.0 * my_progress)/(DS*DS));
    usleep(sleep_delay_us);
    my_progress = iprogress[0];
    }
  err = clEnqueueReadBuffer(queue1, mem3, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  for (int i = 0; i < DS*DS; i++)
    if (hdata[i] != DS) {printf("error at %d, was %f, should be %f\n", i, hdata[i], (float)DS); return 1;}
  printf("success!\n");
  return 0;
}
$ nvcc -o t7 t7.cpp -lOpenCL
$ ./t7
progress: 0.000000%
progress: 1.019764%
progress: 2.118003%
progress: 3.229284%
progress: 4.349995%
progress: 5.467999%
progress: 6.588578%
progress: 7.711983%
progress: 8.835554%
progress: 9.959292%
progress: 11.082935%
progress: 12.206590%
progress: 13.329780%
progress: 14.453077%
progress: 15.575981%
progress: 16.696775%
progress: 17.820740%
progress: 18.945205%
progress: 20.066702%
progress: 21.191645%
progress: 22.313595%
progress: 23.437488%
progress: 24.560642%
progress: 25.682950%
progress: 26.804721%
progress: 27.926970%
progress: 29.048944%
progress: 30.170643%
progress: 31.293011%
progress: 32.416499%
progress: 33.539093%
progress: 34.661484%
progress: 35.788572%
progress: 36.909640%
progress: 38.032901%
progress: 39.154053%
progress: 40.273190%
progress: 41.396427%
progress: 42.517865%
progress: 43.637753%
progress: 44.760513%
progress: 45.883942%
progress: 47.005963%
progress: 48.128045%
progress: 49.250865%
progress: 50.372541%
progress: 51.494408%
progress: 52.615702%
progress: 53.737748%
progress: 54.862452%
progress: 55.986857%
progress: 57.110488%
progress: 58.234537%
progress: 59.357834%
progress: 60.481644%
progress: 61.605811%
progress: 62.731361%
progress: 63.854420%
progress: 64.976835%
progress: 66.099250%
progress: 67.222333%
progress: 68.347764%
progress: 69.469392%
progress: 70.593882%
progress: 71.717751%
progress: 72.839820%
progress: 73.964739%
progress: 75.089264%
progress: 76.217365%
progress: 77.340806%
progress: 78.462887%
progress: 79.586887%
progress: 80.709970%
progress: 81.833220%
progress: 82.957995%
progress: 84.081352%
progress: 85.204720%
progress: 86.326563%
progress: 87.448740%
progress: 88.572216%
progress: 89.695275%
progress: 90.819287%
progress: 91.939759%
progress: 93.063176%
progress: 94.187069%
progress: 95.312214%
progress: 96.434307%
progress: 97.557795%
progress: 98.681629%
progress: 99.804878%
success!
$

fixed! thank you Robert! the progress bar is now printed as expected. looks like moving pinned-mem from the device side to host side made the difference

https://github.com/fangq/mcxcl/commit/2894ea10032777bb01fba164d2c41ad484f0766c

One follow up question - is there a way I can do a non-blocking (or asynchronously?) test on whether the kernel has completed?

I noticed that sometimes a premature kernel completion (due to error or input data) can cause my do-while loop (shown below) to hang

https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_host.cpp#L686-L693

it would be nice to check if the kernel was actually completed inside the loop and break after. I think I previously use events for my code, but not sure if something similar in opencl.

thanks

opencl has events. that’s what comes to mind for me anyway.

https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetEventInfo.html

hi Robert, I tested the clCreateBufferNV approach you suggested, and it worked well on all Linux hosts, but failed on one windows host with an NVIDIA 1050Ti GPU.

The error I got was

“Error 30: Invalid Value” at the below line

https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_host.cpp#L475

the NV_PIN macro is the same as CL_MEM_LOCATION_HOST_NV, defined here

https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_host.hpp#L47

The CUDA version of the software runs fine on this GPU. I have CUDA 8 installed, but it is not supposed to be necessary to run both codes. The opencl code runs fine on the CPU (Ryzen 1700X).

any thing you might suggest to fix this error? thanks again

Update driver to latest for that GPU.

And I don’t expect these sorts of concurrency scenarios to work well on a GPU in WDDM mode.

ok, now I made more observations to this issue.

first, the Error#30 invalid value to the clCreateBufferNV function is definitely related to driver versions. For the windows machine, upgrading driver to 431.xx stopped the error.

I also noticed such error appears on my Linux machines, for those running on driver 38x, 39x, they all got this error, those with 415, 418 drivers are ok.

But I still don’t fully understand why this is happening. if the driver does not support clCreateBufferNV, then the clGetExtensionFunctionAddress call should have returned a NULL. Why it returns a valid function, but it does not support this flag? is there a way for me to tell if the returned “clCreateBufferNV” function supports the HOST_PIN flag?

On a separate note, I found another similar problem - my CUDA code used to print the progress bar just fine on both Windows and Linux, at least when we tested/demoed the windows version 1 year ago, but now, it hangs both of my tested Windows machines - one is on a driver 417.xx, the other is 431.x. Here is my code, very similar to the OpenCL version

https://github.com/fangq/mcx/blob/v2019.4/src/mcx_core.cu#L2080-L2090

is this behavior expected? any fixes? the hanging only happens on Windows so far.

PS: there was a previous report of this issue too:

https://devtalk.nvidia.com/default/topic/1056441/cuda-programming-and-performance/clcreatebuffernv-from-the-opencl-extension-cl_nv_create_buffer-returning-unknown-error-code/

Such a thing is certainly possible. It’s an NV specific extension after all. The initial formulation of the extension may not have had that capability. Future versions did. I would assume the way to figure out if it is supported ultimately is to run the function. If you get the invalid error, it’s not supported. In that case the solution is to update to a newer driver.

I generally don’t recommend that people attempt these things on windows WDDM. I mentioned this already. One fix is to run in Windows TCC mode. In my experience, it’s very tedious to get windows WDDM concurrency scenarios to work correctly due to command batching. I won’t be able to sort it out for you, and yes its possible due to variation in command batching, that the difficulties could shift from one driver to the next. There are many questions on the forums that discuss the issue and give possible workaround suggestions.