<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
      <title>Tagged with opencl - NVIDIA Developer Forums</title>
      <link>http://forums.developer.nvidia.com/devforum/discussions/tagged/opencl/feed.rss</link>
      <pubDate>Wed, 16 May 12 17:33:11 -0400</pubDate>
         <description>Tagged with opencl - NVIDIA Developer Forums</description>
   <language>en-CA</language>
   <atom:link href="/devforum/discussions/taggedopencl/feed.rss" rel="self" type="application/rss+xml" />
   <item>
      <title>Non-blocking writing/reading not working</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7721/non-blocking-writingreading-not-working</link>
      <pubDate>Tue, 01 May 2012 11:40:20 -0400</pubDate>
      <dc:creator>edodas</dc:creator>
      <guid isPermaLink="false">7721@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br />Is there any reason that non-blocking data transfer wouldn't work? I am using a Quadro FX580 and when I set the blocking parameter to CL_FALSE in clEnqueueWriteImage or clEnqueueReadImage, I still have the impression that the data writing/reading is blocking.]]></description>
   </item>
      <item>
      <title>OpenCL examples unavailable</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8151/opencl-examples-unavailable</link>
      <pubDate>Mon, 14 May 2012 05:23:15 -0400</pubDate>
      <dc:creator>Xor</dc:creator>
      <guid isPermaLink="false">8151@/devforum/discussions</guid>
      <description><![CDATA[Hi there<br /><br />I'm trying to optimize a separable convolution filter further but i'm unable to download the sample code from the following part of the nvidia site <a href="http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html" target="_blank" rel="nofollow">http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html</a><br /><br />When i click download it says file not found.<br /><br />Does anyone perhaps know where these samples can be found?<br /><br />thanks in advance]]></description>
   </item>
      <item>
      <title>GTX680 OpenCL in Ubuntu 11.10, clBuildProgram returns -30</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8071/gtx680-opencl-in-ubuntu-11-10-clbuildprogram-returns-30</link>
      <pubDate>Fri, 11 May 2012 09:05:02 -0400</pubDate>
      <dc:creator>yashiz</dc:creator>
      <guid isPermaLink="false">8071@/devforum/discussions</guid>
      <description><![CDATA[I've tested driver 295.49 and 302.07 both.<br /><br />clBuildProgram returns -30 when build some kernel files, not all of them, while the others have the same config, as:<br /><br />m_ciErrNum = clBuildProgram(m_program, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);<br /><br />one of the problem kernels is in the attachment.<br /><br />when I remove TabulateCDF1Dv, no -30 and pass<br />when I keep TabulateCDF1Dv, and remove read_imagef and write_imagef inside, no -30 and pass<br />when I keep TabulateCDF1Dv, and remove write_imagef inside only, return -30.<br /><br />I think, maybe it is the problem of reading or writing 2D texture with height 1 . but kernel TabulateCDF2D runs well.<br /><br />these code can run on GTX460 and GTX580 with latest driver, so maybe a GTX680 driver bug ?<br /><br />it is quite weird...<br /><br />thanks for you help<br /><br />here is the code (if you can not get the attachment)<br /><br />__kernel __attribute__((reqd_work_group_size(WORKGROUP_SIZE, 1, 1)))<br />void TabulateCDF1Dv(__read_only image2d_t CDF1D, sampler_t normSampler, __write_only image2d_t CDF1DTable, sampler_t pixSampler, int lenth)<br />{<br />       uint tid = get_global_id(0);<br />       float cdfValue = (tid+0.0f)/(lenth+0.0f);<br /><br />       float index = 0.5f;<br />       float step = 0.5f;<br /><br />       for(int i=0; i&lt;8; i++)<br />       {<br /><br />          float4 tex1DRefValue = read_imagef(CDF1D, normSampler, (float2)(index,0.0f));<br />          float refValue = tex1DRefValue.x;<br /><br />          step *= 0.5f;<br /><br />          float diff = (cdfValue - refValue);<br /><br />          if(diff &lt; pMAXERROR &amp;&amp; diff &gt; nMAXERROR)<br />          {<br />                 break;<br />          }<br />	  if(diff &lt; nMAXERROR)<br />          {<br />		index = index - step;<br />          }<br />          if(diff &gt; pMAXERROR)<br />          {<br />                index = index + step;<br />          }<br />       }<br /><br />       write_imagef(CDF1DTable, (int2)(tid,0), (float4)(index));<br />}<br /><br />__kernel __attribute__((reqd_work_group_size(WORKGROUP_SIZE, 1, 1)))<br />void TabulateCDF2D(__read_only image2d_t CDF2D, sampler_t normSampler,__read_only image2d_t CDF1DTable, __write_only image2d_t CDF2DTable, sampler_t pixSampler, int lenth)<br />{<br />       uint tid = get_global_id(0);<br />       uint indexU = get_group_id(0);<br />       uint indexV = get_local_id(0);<br /><br />       float4 tex1DRefValue = read_imagef(CDF1DTable, pixSampler, (int2)(indexU,0));<br /><br />       float cdfValueV = (indexV+1.0f)/(lenth+1.0f);<br /><br />       float indexU0 = tex1DRefValue.x;<br />       float indexV0 = 0.5f;<br />       float step = 0.5;<br /><br /><br />       for(int i = 0; i&lt;8; i++)<br />       {<br />         float4 tex2DRefValue = read_imagef(CDF2D, normSampler, (float2)(indexV0, indexU0));<br /><br />         float refValue = tex2DRefValue.x;<br /><br />         step *= 0.5f;<br /><br />         float diff = (cdfValueV - refValue);<br /><br />          if(diff &lt; pMAXERROR &amp;&amp; diff &gt; nMAXERROR)<br />          {<br />                 break;<br />          }<br />	  if(diff &lt; nMAXERROR)<br />          {<br />		indexV0 = indexV0 - step;<br />          }<br />          if(diff &gt; pMAXERROR)<br />          {<br />                indexV0 = indexV0 + step;<br />          }<br />       }<br /><br />       write_imagef(CDF2DTable, (int2)(indexV,indexU), (float4)(indexV0));<br />}]]></description>
   </item>
      <item>
      <title>OpenCL callbacks scheduling</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8031/opencl-callbacks-scheduling</link>
      <pubDate>Thu, 10 May 2012 09:59:48 -0400</pubDate>
      <dc:creator>rjmarques</dc:creator>
      <guid isPermaLink="false">8031@/devforum/discussions</guid>
      <description><![CDATA[Greatings,<br /><br />I am having a huge overhead when using callbacks on linux. After I enqueue the necessary read operation, I set a callback for apropriate threatment. The read takes less then a milisecond to complete, however the callback is only issued after, about, 19 miliseconds. Is this a driver issue?<br /><br />The graphics card is a Tesla C2050.<br />The driver version is 295.41.<br />And the GCC version is 4.4.3.<br /><br />Thanks,<br />Ricardo Marques<br /><br /> ]]></description>
   </item>
      <item>
      <title>Is Nvidia working on OpenCL 1.2 beta drivers?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3011/is-nvidia-working-on-opencl-1-2-beta-drivers</link>
      <pubDate>Tue, 03 Jan 2012 13:54:43 -0500</pubDate>
      <dc:creator>oscarbg</dc:creator>
      <guid isPermaLink="false">3011@/devforum/discussions</guid>
      <description><![CDATA[AMD has already OCL 1.2 beta drivers.. when reg. developers can expect to have some build?]]></description>
   </item>
      <item>
      <title>Driver issue at Linux 11.10</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5411/driver-issue-at-linux-11-10</link>
      <pubDate>Mon, 05 Mar 2012 04:39:28 -0500</pubDate>
      <dc:creator>sajis</dc:creator>
      <guid isPermaLink="false">5411@/devforum/discussions</guid>
      <description><![CDATA[Hello forum,<br /><br />I have installed ubuuntu linux 11.10 on asus g74 and installed the most recent nvidia driver instead of the one that comes as default. It has the GeForce GTX 560M on board<br /><br />Now the system hangs whenever i try to run any of the CUDA examples. The very samples runs fine in windows with the same machine. I want to test OpenCL apps in linux, but the tool kit do not come with any opencl exmples for linux.<br /><br />Any suggestion deal with this issue?<br /><br /><br />Regards<br />Sajjad]]></description>
   </item>
      <item>
      <title>Out of order command execution non-working on Linux?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7526/out-of-order-command-execution-non-working-on-linux</link>
      <pubDate>Wed, 25 Apr 2012 11:23:13 -0400</pubDate>
      <dc:creator>thorfdbg</dc:creator>
      <guid isPermaLink="false">7526@/devforum/discussions</guid>
      <description><![CDATA[Could it be that out-of-order execution in command queues is currently simply not working with the current OpenCL SDK? I'm creating here a command queue with the flag CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, but when checking in the profiler, I see that the memory operations (buffer copy) and the GPU computation are still not overlapping, but executed sequentially.<br />]]></description>
   </item>
      <item>
      <title>Low memcpy performance in OpenCL, what to do about it?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6036/low-memcpy-performance-in-opencl-what-to-do-about-it</link>
      <pubDate>Fri, 16 Mar 2012 19:41:24 -0400</pubDate>
      <dc:creator>thorfdbg</dc:creator>
      <guid isPermaLink="false">6036@/devforum/discussions</guid>
      <description><![CDATA[Folks,<br /><br />using the nvvp profiler shows that my current OpenCL application has a low memcpy performance on Linux. Actually, it is only ~400MB/sec host to device and about 800MB/sec device to host. The nvvp compiler makes suggestions on CUDA, which I'm not using (this is OpenCL). The manual states that I should allocate pinned memory. <br /><br />I tried the following approaches:<br /><br />a) Allocate the buffers with CL_MEM_ALLOC_HOST_PTR and mapping buffers to host memory. Result: Negative.<br /><br />b) Pinning memory myself with the mlock() Linux system call. Result: Negative.<br /><br />Memcpy performance remains at a crawl in both setups, at exactly the same speed. This brings me to the rather paradoxical situation that even though my kernel is fast (~10ms for an operation) the memcpy to the GPU and back takes an enmourmous amount of time (~70ms) and makes the GPU usage rather unattractive - I can get about the same speed with using SSE2 vector instructions of the CPU.<br /><br />Any help or hints?]]></description>
   </item>
      <item>
      <title>Crash with the new LLVM compiler</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6606/crash-with-the-new-llvm-compiler</link>
      <pubDate>Mon, 02 Apr 2012 06:41:37 -0400</pubDate>
      <dc:creator>Tofic</dc:creator>
      <guid isPermaLink="false">6606@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />  OpenCL 1.1, drivers 296.10, GTX 580, 64-bits compiler, Windows 7 64-bits.<br /><br />  100% crash inside the compiler, when trying to compile this construction (well-compilable with the old compiler):<br /><strong>const struct BBox bbox = { (float4)(-.5f,-.5f,-.5f,0), (float4)(.5f,.5f,.5f,0) };<br /><br />	....</strong><br /><br />  Error: <em>OpenCL error 'Invalid binary': compilation error<br />	 ptxas application ptx input, line 13; error : Module-scoped variables in .local state space are not allowed with ABI</em><br /><br />       or<br /><br /><em>UNREACHABLE executed.</em><br /><br /><br />  Fix: remove the "const" modifier. Started with new LLVM compiler.<br /><br />Best wishes,<br />Anton]]></description>
   </item>
      <item>
      <title>openCL equivalent for &quot;cudaMallocPitch&quot;..?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7446/opencl-equivalent-for-cudamallocpitch-</link>
      <pubDate>Tue, 24 Apr 2012 03:01:59 -0400</pubDate>
      <dc:creator>HariniS</dc:creator>
      <guid isPermaLink="false">7446@/devforum/discussions</guid>
      <description><![CDATA[Hi every one,<br />I am very new to openCL and strated converting on eof the cuda code to opencl but got stuck in this part of code.My PC is having AMD processor with ATI graphics card which doesnt support openCL, but other codes are running fine here by falling back to CPU. someone please give me some suggestion so that i can get rid of this problem. the cuda code is<br /><br />size_t pitch = 0;	<br />	cudaError error = cudaMallocPitch( (void**)&amp;gpu_data, (size_t*)&amp;pitch, instances-&gt;cols * sizeof(float), instances-&gt;rows);	<br /><br />for( int i = 0; i &lt; instances-&gt;rows; i++ ){	<br />		///error = cudaMemcpy((void*)(gpu_data + (pitch/sizeof(float))*i), (void*)(instances-&gt;data + (instances-&gt;cols*i)), instances-&gt;cols * sizeof(float) ,cudaMemcpyHostToDevice);	<br /><br /><br />My converted openCL code is<br /><br />    gpu_data = clCreateBuffer(context, CL_MEM_READ_WRITE, ((instances-&gt;cols)*(instances-&gt;rows))*sizeof(float), NULL, &amp;ret);<br /><br />for( int i = 0; i &lt; instances-&gt;rows; i++ ){	<br />   ret = clEnqueueWriteBuffer(command_queue, gpu_data , CL_TRUE, 0, ((instances-&gt;cols)*(instances-&gt;rows))*sizeof(float),(void*)(instances-&gt;data + (instances-&gt;cols*i)) , 0, NULL, NULL);<br />if(ret != CL_SUCCESS)<br />	break;<br />}<br /><br /><br />some times it works fine but some times it stucks. But every time it stucks in reading part i.e.<br /><br />ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances-&gt;cols* 1 , instances-&gt;data, 0, NULL, NULL);<br /><br />where "gpu_data" is a device memory of type cl_mem and "instances" is a "matrix".<br />For both case its giving <br />Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.<br />when break is pressed<br />No symbols are loaded for any call stack frame. The source code cannot be displayed.<br /><br />and in Call stack,<br /><br />&gt;	OCL8CA9.tmp.dll!10001098() 	<br /> 	[Frames below may be incorrect and/or missing, no symbols loaded for OCL8CA9.tmp.dll]	<br /> 	amdocl.dll!5c39de16() 	<br /><br />is displayed. Someone please help me in getting out of this.<br />Thanks in advance]]></description>
   </item>
      <item>
      <title>OpenCL and DirectCompute</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/181/opencl-and-directcompute</link>
      <pubDate>Mon, 29 Aug 2011 17:52:21 -0400</pubDate>
      <dc:creator>Nadeem Mohammad</dc:creator>
      <guid isPermaLink="false">181@/devforum/discussions</guid>
      <description><![CDATA[NVIDIA continues to have one of the most advance support for OpenCL and DirectCompute.<br />When discussing these please use the correct tags below .]]></description>
   </item>
      <item>
      <title>When will there be OpenCL 1.2 drivers?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7401/when-will-there-be-opencl-1-2-drivers</link>
      <pubDate>Sun, 22 Apr 2012 21:08:13 -0400</pubDate>
      <dc:creator>ekpyron</dc:creator>
      <guid isPermaLink="false">7401@/devforum/discussions</guid>
      <description><![CDATA[Are there any information about a likely date for the release of OpenCL 1.2 drivers (and/or when a conformance candidate will be available to registered developers)?<br />The specification has been published several months ago and competitors already published beta drivers, so I'd appreciate at least some information about when to expect support from Nvidia.<br />OpenCL 1.2 will include major improvements and add some key features to the API, so that I eagerly await supporting drivers.]]></description>
   </item>
      <item>
      <title>Simultaneous data transfert</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7326/simultaneous-data-transfert</link>
      <pubDate>Thu, 19 Apr 2012 11:31:20 -0400</pubDate>
      <dc:creator>edodas</dc:creator>
      <guid isPermaLink="false">7326@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I am new to OpenCL and I have a question about data transfer from the host to a device (a GPU in my case):<br />is it possible to write (or read) simultaneously to (from) two different memory objects?<br />(using the same command queue)]]></description>
   </item>
      <item>
      <title>Know of any multi-GPU image segmentation algorithms?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7261/know-of-any-multi-gpu-image-segmentation-algorithms</link>
      <pubDate>Wed, 18 Apr 2012 09:53:03 -0400</pubDate>
      <dc:creator>chippies</dc:creator>
      <guid isPermaLink="false">7261@/devforum/discussions</guid>
      <description><![CDATA[Hi all, has anyone published research on image segmentation algorithms that run efficiently on multiple GPUs and handle gigabytes of image data (3D volumes).  If so, can you please post links, titles or at least the names of institutions that are doing this research.<br /><br />I'm asking this because it might be a good research direction and I want to know what others have done, but Google, IEEE Explore, Science Direct, Springer Link and Ebscohost haven't turned up any hits.<br /><br />Is there a tag for image processing?]]></description>
   </item>
      <item>
      <title>OpenCL issue with Driver</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5741/opencl-issue-with-driver</link>
      <pubDate>Sat, 10 Mar 2012 01:59:43 -0500</pubDate>
      <dc:creator>christophheindl</dc:creator>
      <guid isPermaLink="false">5741@/devforum/discussions</guid>
      <description><![CDATA[Lately, our uses have complained that ReconstructMe won't work with the latest 295.73 driver. When compiling the following stripped down OpenCL code with -cl-nv-arch sm_12 (or a device which does not support sm_13)<br /><br />float4 p0 = points[i];<br />float4 p1 = points[i+1];<br />float4 p2 = points[i+2];<br /><br />float4 n_tmp = -normalize(cross(p1 - p0, p2 - p0));<br /><br />we get a lot of error messages using the latest driver:<br /><br />Instruction 'mov' requires SM 1.3 or higher, or map_f64_to_f32 directiv<br />Instruction 'mov' requires SM 1.3 or higher, or map_f64_to_f32 directiv<br />Instruction 'mov' requires SM 1.3 or higher, or map_f64_to_f32 directiv<br />Instruction 'cvt' requires SM 1.3 or higher, or map_f64_to_f32 directiv<br />Instruction 'abs' requires SM 1.3 or higher, or map_f64_to_f32 directiv<br />Instruction 'setp' requires SM 1.3 or higher, or map_f64_to_f32 directi<br />Double is not supported. Demoting to float<br /><br />I suspect it is due to the translation of the normalize command, since the cross product seems to work fine. Since we are not using any double precision calculus I think it is a driver related bug.<br /><br />With 285.62 everything works fine. Here are the reports<br /><br /><a href="https://groups.google.com/d/topic/reconstructme/TgDG9f4AjJI/discussion" target="_blank" rel="nofollow">https://groups.google.com/d/topic/reconstructme/TgDG9f4AjJI/discussion</a>]]></description>
   </item>
      <item>
      <title>Sharing numerous small buffer objects between OpenGL and OpenCL-- driver issue or user issue?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6466/sharing-numerous-small-buffer-objects-between-opengl-and-opencl-driver-issue-or-user-issue</link>
      <pubDate>Tue, 27 Mar 2012 20:43:12 -0400</pubDate>
      <dc:creator>escorelle</dc:creator>
      <guid isPermaLink="false">6466@/devforum/discussions</guid>
      <description><![CDATA[I was writing some test code for an application which included allocating a lot of small buffers and sharing them with OpenCL.  I noticed that after about 3000 shares, I would get errors for the next few hundred, and then everything would be fine afterwards.  I'm going to put a work around in place, but was wondering what was going on.  The following code (context initialization omitted) shows the issue:<br /><br /><br />	std::vector&lt; unsigned &gt; vbos;<br /><br />	for( int i = 0; i &lt; 5000; ++i )<br />	{<br />		unsigned vbo = 0;<br />		glGenBuffers( 1, &amp;vbo );<br />		vbos.push_back( vbo );<br /><br />		glBindBuffer( GL_ARRAY_BUFFER, vbo );<br />		glBufferData( GL_ARRAY_BUFFER, 355 * sizeof( float ) * 4, 0, GL_STATIC_DRAW );<br />		glBindBuffer( GL_ARRAY_BUFFER, 0 );<br /><br />		cl_mem sharedObject = 0; <br /><br />		cl_int clError = 0;<br />		sharedObject = clCreateFromGLBuffer( context_, CL_MEM_READ_WRITE, vbo, &amp;clError );<br /><br />		if( clError != CL_SUCCESS )<br />		{<br />			std::cout &lt;&lt; "Error creating GL Interop Buffer." &lt;&lt; std::endl;<br />		}<br /><br />		if( sharedObject )<br />		{<br />			clReleaseMemObject( sharedObject );<br />		}<br /><br />	    clFinish( queue_ );<br />	}<br /><br />	glDeleteBuffers( vbos.size(), &amp;vbos[0] );<br /><br />Card: nVidia GeForce GTX 260M<br />Driver: 286.16]]></description>
   </item>
      <item>
      <title>Instrumented driver for profiling on Linux</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5256/instrumented-driver-for-profiling-on-linux</link>
      <pubDate>Wed, 29 Feb 2012 05:58:11 -0500</pubDate>
      <dc:creator>thorfdbg</dc:creator>
      <guid isPermaLink="false">5256@/devforum/discussions</guid>
      <description><![CDATA[Dear NVidia team,<br /><br />looking currently into OpenCL development, I'm missing a method for profiling my kernel code which runs slower than expected. The nvvp debugger on Linux is of less help than expected as it cannot collect all necessary data for a full analysis [I get an error saying "CUPTI_ERROR_PARAMETER_SIZE_NOT_SUFFICIENT"]. I believe this might be because I'm not using an instrumented X driver for my 560GT graphics card. However, the latest available driver I could find, NVPerfKit-Linux-x86_64-195.36.31, does not support the Fermi-based chips. Where would I find newer instrumented drivers and/or a fully functional nvvp profiler?]]></description>
   </item>
      <item>
      <title>clEnqueueReadBuffer always blocks with newer drivers</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5296/clenqueuereadbuffer-always-blocks-with-newer-drivers</link>
      <pubDate>Thu, 01 Mar 2012 02:27:30 -0500</pubDate>
      <dc:creator>stefanvehoff</dc:creator>
      <guid isPermaLink="false">5296@/devforum/discussions</guid>
      <description><![CDATA[Dear NVIDIA, dear OpenCL-developers,<br /><br />we are using OpenCL on Windows systems equipped with NVIDIA cards for quite some time now and so far did not have any problems. However, during the last days I tried to replace our calls to "clFinish", which occur at the end of our GPU computations, by something that does not use one full CPU core until the GPU is finished (sleeps in combination with event status polling).<br /><br />While doing so I realized that the calls to "clEnqueueReadBuffer", which are used to copy the results of the computation back to the host, always block the CPU. This is expected, of course, when the "blocking_read" parameter is set to true, but this is not the case in our code. This is why we introduced the calls to "clFinish" in the first place.<br /><br />So yesterday I did a simple check: I added a time measurement to our code which reports the time the CPU needs to execute all of the GPU commands. These range from copying data to the GPU, include all the calculations on the GPU, the "clEnqueueReadBuffer"-commands at the end but NOT the "clFinish"-command. Running this code on my machine (i7 975, GTX 560, 295.73 driver) I get values of around 14 ms and slightly larger values on the machine of a colleague (i7 975, GTX 295, 285-something driver). Running the same code on yet another machine with an older driver (i7 975, GTX 295, 266-something) I get results of around 1 ms and this is also how I remember it, i.e. the CPU rushing across the GPU commands without caring whether they are finished or not.<br /><br />Am I doing something wrong or is this really a driver issue/bug? Did anyony experience something similar or is somebody willing to test the behavior of "clEnqueueReadBuffer" on his system?<br /><br />Thanks for your help!]]></description>
   </item>
      <item>
      <title>[volumeRender] Why are unequally sized volumes rendered as cubes (i.e., scaled)?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4791/volumerender-why-are-unequally-sized-volumes-rendered-as-cubes-i-e-scaled</link>
      <pubDate>Thu, 16 Feb 2012 09:12:51 -0500</pubDate>
      <dc:creator>ivma</dc:creator>
      <guid isPermaLink="false">4791@/devforum/discussions</guid>
      <description><![CDATA[Hi!<br />I am trying out the volume renderer from the NVIDIA GPU Computing SDK 4.1/4.0 and I was wondering why it renders the Bucky.raw volume (256x256x256) accordingly but it scales unequally sized volumes such as for example the lobster (120x120x34)[1].<br /><br />Here are some resulting images (from bottom 120x120 and from the side where the volume resolution in Z direction is only 34 voxels):<br /><img src="http://img593.imageshack.us/img593/6965/lobsterbottom.png" alt="Bottom view" /><br /><img src="http://img853.imageshack.us/img853/2559/lobsterside.png" alt="Side view" /><br /><br />Does anybody have a clue why that is and possibly how to fix it?<br /><br />PS: I have tried it on a couple of other data sets as well but with the same effect.<br /><br />Greetings,<br />ivma<br /><br />[1] <a href="http://www.cg.tuwien.ac.at/courses/Visualisierung/data/lobster.zip">lobster.zip</a>]]></description>
   </item>
      <item>
      <title>Bug in cl_nv_d3d11_sharing: Cannot map DirectX buffer for interop</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5791/bug-in-cl_nv_d3d11_sharing-cannot-map-directx-buffer-for-interop</link>
      <pubDate>Mon, 12 Mar 2012 03:25:21 -0400</pubDate>
      <dc:creator>mchajdas</dc:creator>
      <guid isPermaLink="false">5791@/devforum/discussions</guid>
      <description><![CDATA[There seems to be a bug in cl_nv_d3d11_sharing which prevents more than 526 buffers being mapped for interop, even if there is plenty of memory free. In the attached application, we create, map and release 1 MiB sized buffers and on both a 560 Ti with 1 GiB of memory and a 480 with 1.5 GiB it fails at around buffer 526 (depending on the buffer size.) We have reproduced the issue also on a 3 GiB Tesla C2050. What's going on here? As we release each and every buffer, we should be able to map them until the end of days, shouldn't we?]]></description>
   </item>
      <item>
      <title>Include, link and use both OpenCL and CUDA in the same application at the same time, is it possible?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5571/include-link-and-use-both-opencl-and-cuda-in-the-same-application-at-the-same-time-is-it-possible</link>
      <pubDate>Wed, 07 Mar 2012 12:53:05 -0500</pubDate>
      <dc:creator>ggrocca</dc:creator>
      <guid isPermaLink="false">5571@/devforum/discussions</guid>
      <description><![CDATA[Maybe it's a stupid question, but I'm optimizing an app that currently uses a lot of OpenCL code. After a lot of work, the most needed thing now is optimization of some sparse matrix operations (currently done in CPU with eigen). It seems that the most complete library in this field is cusparse, which is obviously a cuda library.<br /><br />Has anybody tried something like this? Will there happen obvious problems which I cannot foresee?<br /><br />I wanted to know if anyone has insights to share in this regard before starting to code.<br /><br />Thank you very much in advance!]]></description>
   </item>
      <item>
      <title>Different OpenCL results with different drivers</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5681/different-opencl-results-with-different-drivers</link>
      <pubDate>Thu, 08 Mar 2012 18:41:25 -0500</pubDate>
      <dc:creator>Dan Mackley</dc:creator>
      <guid isPermaLink="false">5681@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br />Until today, I have been using the libOpenCL.so that came with the 280 driver to build a C program that does a just-in-time compile of OpenCL code and then runs it and uses the results.  The application runs on multiple, different platforms with possibly different nVidia hardware and driver versions.<br />Today I loaded the 295 driver, and the program gives different results -- different enough that they're unacceptable, numerically.<br />For various reasons, I do not want to keep multiple versions of our application around, built with and keyed to the different nVidia drivers (we have multiple customers with different hardware configs)... I'd prefer to stick with just 32- and 64-bit versions.<br />Is this the way things work -- does the libOpenCL that's used when building a C program need to match the runtime system's libOpenCL?<br />(By the way, both libOpenCL's are numbered as version 1.0.0)]]></description>
   </item>
      <item>
      <title>Automating profile collection for nvvp</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5641/automating-profile-collection-for-nvvp</link>
      <pubDate>Thu, 08 Mar 2012 08:42:05 -0500</pubDate>
      <dc:creator>bmerry</dc:creator>
      <guid isPermaLink="false">5641@/devforum/discussions</guid>
      <description><![CDATA[I've found nvvp in CUDA 4.1 to be quite handy, but since running my program to do the analysis takes a long time I'd like to automate it as part of the nightly build. I've looked at the COMPUTE_PROFILE_* environment variables, but they're a very low-level interface to select a specific set of counters, and to use them it seems you need to know which counters are compatible with each other. What I'd ideally like is a way to take a project I've saved in nvvp and, from the command-line, request that it reruns everything and stores all the results back in a project file I can open in nvvp have have all the analyses available.<br /><br />I don't mind getting my hands dirty with some scripting to do it, but I don't really know where to start since I don't know which counters nvvp wants on each run or how to integrate the per-run profiles into one easy-to-open file.]]></description>
   </item>
      <item>
      <title>I want to know the meaning of envreg0~31 in PTX code.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5131/i-want-to-know-the-meaning-of-envreg031-in-ptx-code-</link>
      <pubDate>Sun, 26 Feb 2012 01:31:59 -0500</pubDate>
      <dc:creator>komb</dc:creator>
      <guid isPermaLink="false">5131@/devforum/discussions</guid>
      <description><![CDATA[I am a programmer for GPGPU. I am studying OpenCL.<br />I have GeForce GT520M. <br /><br />I have a question about PTX code.<br />I made a PTX code for matrix multiplication.<br />The special register is used like %envreg0 ~ %envreg6.<br />I guess that envreg0 and envreg1 are group id for x, y at two dimension.<br />But I can't find the meanings of  the other registers.<br />I can't find any documents on that. The meaning is not descirbed in PTX spec. <br />Please let me know the meaning of special register envreg.  Are there any documents? ]]></description>
   </item>
      <item>
      <title>Geforce GTS250 OpenCL compiler bug?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5521/geforce-gts250-opencl-compiler-bug</link>
      <pubDate>Tue, 06 Mar 2012 12:03:22 -0500</pubDate>
      <dc:creator>hakuliu</dc:creator>
      <guid isPermaLink="false">5521@/devforum/discussions</guid>
      <description><![CDATA[Hi, I'm working on some project using OpenCL, and encountered this bug while using the GTS250.<br />I'm using OpenCL with Java using the JavaCL library, <a href="http://code.google.com/p/javacl/">http://code.google.com/p/javacl/</a><br />I'm quite sure that the bug however is not a part of JavaCL...<br /><br />I made a reproducible unit test for this case.  The kernel is <a href="http://pastebin.com/RekDVG8r">as follows</a> and the unit test, (which is in java) looks like <a href="http://pastebin.com/K5TnZHua">this</a> There are a bunch of utility classes in there that's not explained, but I'm sure it should be pretty easy to understand...<br /><br />I ran this test, and the unit test fails during the compilation process with <a href="http://pastebin.com/YYedvQrb">this stack trace.</a><br /><br />Peculiarly, I have Windows and linux installed on this computer, and it seems to be working in linux but not in windows (with the same hardware, probably different drivers...)  Could it be that the windows opencl driver for this card has some sort of bug?]]></description>
   </item>
      <item>
      <title>for - break issue</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4836/for-break-issue</link>
      <pubDate>Sat, 18 Feb 2012 06:39:20 -0500</pubDate>
      <dc:creator>thefonzy0</dc:creator>
      <guid isPermaLink="false">4836@/devforum/discussions</guid>
      <description><![CDATA[Hi all. Please, can anyone help me with my puzzle?<br />I write a CG Vertex/Fragment program for 3D real time medical data visualization. Until I changed my PC and operating system (from Windows 7 32 bit to Windows 7 64 bit), the program worked correctly, but now I have strange display errors. I have isolated and identified the portion of code that does not work, trial and error (since there is no a debugger for CG ... this thing is really absurd!!). This is my code:<br /><code> 		    <br />	float3  Position;<br />	float3  Position1;<br />        int vidx1 ;<br />	int vidx2;<br />	float3 vecV1;<br />	float3 vecV2;<br />	float3 vecStart;<br />	float3 vecDir;<br />	float  denom;<br />	float  lambda ;<br />	int e;<br />	bool Test0, Test1, Test2,Test3,Test4,Test5;<br />	float   dPlane = dPlaneStart + Vin.y * dPlaneIncr;<br />	for (e = 0; e &lt; 4; ++e) {<br />		vidx1 = nSequence[int(frontIdx * 8 + v1[Vin.x*4+e])];<br />		vidx2 = nSequence[int(frontIdx * 8 + v2[Vin.x*4+e])];<br />		vecV1=vecVertices [vidx1];<br />		vecV2=vecVertices [vidx2];<br />		vecDir=vecV2-vecV1;<br />		vecStart=vecV1+vecTranslate;<br />		denom= dot(vecDir,vecWiew);<br />		lambda = (denom!=0.0)?(dPlane - dot(vecStart,vecWiew))/denom : -1.0;<br />&lt;strong&gt;		if ((lambda &gt;=0.0) &amp;&amp; (lambda&lt;=1.0)){<br />			Position= vecStart + lambda * vecDir;<br />			break;<br />	 	 }<br />&lt;/strong&gt;	   }<br />	  Test0=(Vin.x==0)&amp;&amp; Position.x==-1;  // OK<br />	  Test1=(Vin.x==1);  // OK<br />	  Test2=(Vin.x==2);  // OK<br />	  Test3=(Vin.x==3);  // OK<br />	  Test4=(Vin.x==4);  // OK<br />	  Test5=(Vin.x==5);  // OK<br />	  ... </code><br /><br />The portion of code that does not work is that of the if block: if I split that :<br /><code>           <br />	if ((lambda &gt;=0.0) &amp;&amp; (lambda&lt;=1.0))<br />		 Position= vecStart + lambda * vecDir;<br />	if ((lambda &gt;=0.0) &amp;&amp; (lambda&lt;=1.0))<br />		break;<br /></code><br />All appear to work.<br /><br />Am I doing something wrong?<br /><br />regards,<br />TheFonzy]]></description>
   </item>
      <item>
      <title>Floating Point Number Errors</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5126/floating-point-number-errors</link>
      <pubDate>Sun, 26 Feb 2012 01:27:01 -0500</pubDate>
      <dc:creator>komb</dc:creator>
      <guid isPermaLink="false">5126@/devforum/discussions</guid>
      <description><![CDATA[I have a question.<br />I intalled SDK and am studing OpenCL. <br />The SDK is included examples made by OpenCL for example DCT8X8.<br />It check floating point values as comparing the result computed by GPU with the result computed by CPU.<br />But the floating point values are not same. There is a little difference.<br />Why has it difference?<br /><br />I checked the difference. <br />Every part in IEEE754 is the same but mantissa is a little deffirent.<br />Please let me know why it has difference.]]></description>
   </item>
      <item>
      <title>clSetKernelArg in loop CL_OUT_OF_RESSOURCES</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4986/clsetkernelarg-in-loop-cl_out_of_ressources</link>
      <pubDate>Wed, 22 Feb 2012 05:47:26 -0500</pubDate>
      <dc:creator>ratcrever3</dc:creator>
      <guid isPermaLink="false">4986@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I have to pass kernel arguments in a loop.<br /><br />for (...)<br />{<br />clSetKernelArg(....)<br />...<br />//running the kernel<br /><br />clWaitForEvents(...)<br />}<br /><br />after a while, clWaitForEvents return CL_OUT_OF_RESSOURCES on windows.<br />It works just fine on mac and linux.<br /><br />I supposed a driver problem, even with the lastest one!<br /><br />Does someone else had that kind of problem?<br /><br />Sincerely]]></description>
   </item>
      <item>
      <title>OpenCL: Using Constant Memory Causes CL_INVALID_KERNEL_ARGS from clEnqueueReadBuffer</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4891/opencl-using-constant-memory-causes-cl_invalid_kernel_args-from-clenqueuereadbuffer</link>
      <pubDate>Mon, 20 Feb 2012 03:37:01 -0500</pubDate>
      <dc:creator>homemade-jam</dc:creator>
      <guid isPermaLink="false">4891@/devforum/discussions</guid>
      <description><![CDATA[I am reposting a question I asked here: <a href="http://www.khronos.org/message_boards/viewtopic.php?f=28&amp;t=4758&amp;sid=7ffe1878e45cb6600e61e1d09fe24eba">http://www.khronos.org/message_boards/viewtopic.php?f=28&amp;t=4758&amp;sid=7ffe1878e45cb6600e61e1d09fe24eba</a><br /><br /><br /><br />What is happening is, I recently switched my code to be using events to indicate the dependencies rather than flushing the command queue each time. This works fine 90% of the time except in this particular case.<br /><br />I have a kernel that works fine under Intel and AMD SDKs for my CPU and used to work fine on NVIDIA too. It uses constant memory. All my others kernels (the 90%) use textures and global and these work fine. Both kernels are identical except for the __constant in the args. The constant memory kernel causes an error to be returned from the clEnqueueReadBuffer command.<br /><br />The documentation doesn't mention the result of CL_INVALID_KERNEL_ARGS from clEnqueueReadBuffer. Has anyone experienced this problem before? I can't see how it has occured.]]></description>
   </item>
      <item>
      <title>Killing the watchdog timer under macOSX lion</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4866/killing-the-watchdog-timer-under-macosx-lion</link>
      <pubDate>Sun, 19 Feb 2012 05:34:42 -0500</pubDate>
      <dc:creator>ratcrever3</dc:creator>
      <guid isPermaLink="false">4866@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I have the watchdog timer problem during the execution of a kernel and i did not found any solution on the forum.<br />My kernel browse an half edge structure and fetch neighborhoods.<br />Even by creating small groups, the kernel is killed...<br />The kernel has been succesfully execute on Linux and Windows when the watchdog timer has been killed.<br />My only solution is to desactivate it...<br /><br />My question is the following :<br />How can i desactivate the watchdog timer?<br /><br />Best regards<br />]]></description>
   </item>
      <item>
      <title>OpenCL app crashed kernel mode display driver, System can&#039;t boot!</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4846/opencl-app-crashed-kernel-mode-display-driver-system-cant-boot</link>
      <pubDate>Sat, 18 Feb 2012 13:36:24 -0500</pubDate>
      <dc:creator>pixelpusher</dc:creator>
      <guid isPermaLink="false">4846@/devforum/discussions</guid>
      <description><![CDATA[I have come across a very horrible problem.<br />I was running an OpenCL + OpenGL raytracing program in VS 2010.<br />When a simply tried to run the program, it crashed the <em>user</em> mode driver (screen just goes blank and comes right back) but when I tried to debug the program it crashed the <em>kernel</em> mode driver and the system had to do a hard restart. But when the computer tried to reboot, instead of loading the BIOS, its just beeps 3 times. :(<br /><a href="http://downloadmirror.intel.com/19451/eng/DX58SO2_ProductGuide01_English.pdf">This</a> page for my motherboard says that it is a memory error, but I checked both RAM sticks and they are fine? The motherboard also shows EB or E8 (can't tell) on the LCD display, but that has to do with boot device selection?<br />Now I cannot boot my system!<br />How could crashing the display driver cause this kind of issue? Has anyone ever experience this sort of thing before? <br />I would appreciate any thoughts you might have, I am getting quite desperate.<br />Thanks]]></description>
   </item>
      <item>
      <title>Code runs on weak GT540M but not on fast GF 560Ti ??</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1876/code-runs-on-weak-gt540m-but-not-on-fast-gf-560ti-</link>
      <pubDate>Sun, 27 Nov 2011 07:46:51 -0500</pubDate>
      <dc:creator>StefanDS</dc:creator>
      <guid isPermaLink="false">1876@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br /><br />I'm writing some OpenCL kernels to do stream compaction, which works fine on a Quadro FX 4600 (old powerful professional card) and an GT 540M (1-year old low-range laptop chipset). But when I try to run on a GTX 560 Ti (1-year old high/mid range desktop card), a clCreateFromGLTexture2D call returns CL_OUT_OF_RESOURCES.<br /><br />How is this possible? The Quadro I could understand, I believe the drivers of professional cards manage resources differently, but the GTX 560 is in every way a better card then the GT540M, and they are both consumer cards. The 560 also has more recent drivers, I would expect it to run better, not worse.<br /><br />Does anyone has an idea on what could cause this? I'll attach the output of oclDeviceQuery for both cards (except the imageformat stuff with is identical):<br /><br /> ---------------------------------<br /> Device GeForce GT 540M<br /> ---------------------------------<br />  CL_DEVICE_NAME: 			GeForce GT 540M<br />  CL_DEVICE_VENDOR: 			NVIDIA Corporation<br />  CL_DRIVER_VERSION: 			270.51<br />  CL_DEVICE_VERSION: 			OpenCL 1.0 CUDA<br />  CL_DEVICE_TYPE:			CL_DEVICE_TYPE_GPU<br />  CL_DEVICE_MAX_COMPUTE_UNITS:		2<br />  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:	3<br />  CL_DEVICE_MAX_WORK_ITEM_SIZES:	1024 / 1024 / 64 <br />  CL_DEVICE_MAX_WORK_GROUP_SIZE:	1024<br />  CL_DEVICE_MAX_CLOCK_FREQUENCY:	1344 MHz<br />  CL_DEVICE_ADDRESS_BITS:		32<br />  CL_DEVICE_MAX_MEM_ALLOC_SIZE:		248 MByte<br />  CL_DEVICE_GLOBAL_MEM_SIZE:		993 MByte<br />  CL_DEVICE_ERROR_CORRECTION_SUPPORT:	no<br />  CL_DEVICE_LOCAL_MEM_TYPE:		local<br />  CL_DEVICE_LOCAL_MEM_SIZE:		48 KByte<br />  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:	64 KByte<br />  CL_DEVICE_QUEUE_PROPERTIES:		CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE<br />  CL_DEVICE_QUEUE_PROPERTIES:		CL_QUEUE_PROFILING_ENABLE<br />  CL_DEVICE_IMAGE_SUPPORT:		1<br />  CL_DEVICE_MAX_READ_IMAGE_ARGS:	128<br />  CL_DEVICE_MAX_WRITE_IMAGE_ARGS:	8<br />  CL_DEVICE_SINGLE_FP_CONFIG:		denorms INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma <br /><br />  CL_DEVICE_IMAGE 			2D_MAX_WIDTH	 4096<br />					2D_MAX_HEIGHT	 32768<br />					3D_MAX_WIDTH	 2048<br />					3D_MAX_HEIGHT	 2048<br />					3D_MAX_DEPTH	 2048<br /><br />  CL_DEVICE_EXTENSIONS:			cl_khr_byte_addressable_store<br />					cl_khr_icd<br />					cl_khr_gl_sharing<br />					cl_nv_d3d9_sharing<br />					cl_nv_d3d10_sharing<br />					cl_khr_d3d10_sharing<br />					cl_nv_d3d11_sharing<br />					cl_nv_compiler_options<br />					cl_nv_device_attribute_query<br />					cl_nv_pragma_unroll<br />					cl_khr_global_int32_base_atomics<br />					cl_khr_global_int32_extended_atomics<br />					cl_khr_local_int32_base_atomics<br />					cl_khr_local_int32_extended_atomics<br />					cl_khr_fp64<br /><br /><br />  CL_DEVICE_COMPUTE_CAPABILITY_NV:	2.1<br />  NUMBER OF MULTIPROCESSORS:		2<br />  NUMBER OF CUDA CORES:			96<br />  CL_DEVICE_REGISTERS_PER_BLOCK_NV:	32768<br />  CL_DEVICE_WARP_SIZE_NV:		32<br />  CL_DEVICE_GPU_OVERLAP_NV:		CL_TRUE<br />  CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV:	CL_TRUE<br />  CL_DEVICE_INTEGRATED_MEMORY_NV:	CL_FALSE<br />  CL_DEVICE_PREFERRED_VECTOR_WIDTH_	CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1<br /><br /> ---------------------------------<br /> Device GeForce GTX 560 Ti<br /> ---------------------------------<br />  CL_DEVICE_NAME: 			GeForce GTX 560 Ti<br />  CL_DEVICE_VENDOR: 			NVIDIA Corporation<br />  CL_DRIVER_VERSION: 			285.62<br />  CL_DEVICE_VERSION: 			OpenCL 1.1 CUDA<br />  CL_DEVICE_OPENCL_C_VERSION: 		OpenCL C 1.1 <br />  CL_DEVICE_TYPE:			CL_DEVICE_TYPE_GPU<br />  CL_DEVICE_MAX_COMPUTE_UNITS:		8<br />  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:	3<br />  CL_DEVICE_MAX_WORK_ITEM_SIZES:	1024 / 1024 / 64 <br />  CL_DEVICE_MAX_WORK_GROUP_SIZE:	1024<br />  CL_DEVICE_MAX_CLOCK_FREQUENCY:	1660 MHz<br />  CL_DEVICE_ADDRESS_BITS:		32<br />  CL_DEVICE_MAX_MEM_ALLOC_SIZE:		256 MByte<br />  CL_DEVICE_GLOBAL_MEM_SIZE:		1024 MByte<br />  CL_DEVICE_ERROR_CORRECTION_SUPPORT:	no<br />  CL_DEVICE_LOCAL_MEM_TYPE:		local<br />  CL_DEVICE_LOCAL_MEM_SIZE:		48 KByte<br />  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:	64 KByte<br />  CL_DEVICE_QUEUE_PROPERTIES:		CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE<br />  CL_DEVICE_QUEUE_PROPERTIES:		CL_QUEUE_PROFILING_ENABLE<br />  CL_DEVICE_IMAGE_SUPPORT:		1<br />  CL_DEVICE_MAX_READ_IMAGE_ARGS:	128<br />  CL_DEVICE_MAX_WRITE_IMAGE_ARGS:	8<br />  CL_DEVICE_SINGLE_FP_CONFIG:		denorms INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma <br /><br />  CL_DEVICE_IMAGE 			2D_MAX_WIDTH	 32768<br />					2D_MAX_HEIGHT	 32768<br />					3D_MAX_WIDTH	 2048<br />					3D_MAX_HEIGHT	 2048<br />					3D_MAX_DEPTH	 2048<br /><br />  CL_DEVICE_EXTENSIONS:			cl_khr_byte_addressable_store<br />					cl_khr_icd<br />					cl_khr_gl_sharing<br />					cl_nv_d3d9_sharing<br />					cl_nv_d3d10_sharing<br />					cl_khr_d3d10_sharing<br />					cl_nv_d3d11_sharing<br />					cl_nv_compiler_options<br />					cl_nv_device_attribute_query<br />					cl_nv_pragma_unroll<br />					cl_khr_global_int32_base_atomics<br />					cl_khr_global_int32_extended_atomics<br />					cl_khr_local_int32_base_atomics<br />					cl_khr_local_int32_extended_atomics<br />					cl_khr_fp64<br /><br /><br />  CL_DEVICE_COMPUTE_CAPABILITY_NV:	2.1<br />  NUMBER OF MULTIPROCESSORS:		8<br />  NUMBER OF CUDA CORES:			384<br />  CL_DEVICE_REGISTERS_PER_BLOCK_NV:	32768<br />  CL_DEVICE_WARP_SIZE_NV:		32<br />  CL_DEVICE_GPU_OVERLAP_NV:		CL_TRUE<br />  CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV:	CL_TRUE<br />  CL_DEVICE_INTEGRATED_MEMORY_NV:	CL_FALSE<br />  CL_DEVICE_PREFERRED_VECTOR_WIDTH_	CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1<br /><br />]]></description>
   </item>
      <item>
      <title>nsight OpenCL profiling issue</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4316/nsight-opencl-profiling-issue</link>
      <pubDate>Sun, 05 Feb 2012 10:49:26 -0500</pubDate>
      <dc:creator>ivanmalin</dc:creator>
      <guid isPermaLink="false">4316@/devforum/discussions</guid>
      <description><![CDATA[Hi! I'm a beginner in OpenCL and would like to ask a question.<br />I'm trying to get profiling information about my OpenCL app and get no success. <br />Using MS VS 2010, parallel nsight 2.1 under Windows 7 x 64 with GTX 550 Ti card.<br />I start new analysis activity, set its type to "Profile", experiment configuration to "Memory" and launch. But after it finishes its work, I get no information except for "Session Summary" and in the report events collection status is No Events Captured.<br />How can I get all counters, statistics etc?<br /><br />P.S. Also, when I use nVidia Visual Profiler from 4.1 Toolkit, I get an error, saying "Unable to collect metric and event values" CUPTI_ERROR_PARAMETER_SIZE_NOT_SUFFICIENT]]></description>
   </item>
      <item>
      <title>Visual Profiler not working under Mac OS X</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4396/visual-profiler-not-working-under-mac-os-x</link>
      <pubDate>Tue, 07 Feb 2012 11:13:24 -0500</pubDate>
      <dc:creator>juanrgar</dc:creator>
      <guid isPermaLink="false">4396@/devforum/discussions</guid>
      <description><![CDATA[I have an OpenCL program that runs without problems on a MacBook Pro with Snow Leopard installed, using Apple's OpenCL implementation. I want to profile that program with Visual Profiler. I have installed the NVIDIA developer drivers and then the CUDA Toolkit, with matching versions. I have tried versions 4.1 and 4.0. When I run my program under Visual Profiler 4.1 I get a "No Timeline" message, and no analysis can be performed. When running under Visual Profiler 4.0 I get the following error message:<br /><br />Application : "/Users/juanrgar/Work/SSII/ldpc-gpu/jrgarcia/src/OpenCL/serial-schedule/ver0/ldpc".<br />Profiler data file '/Users/juanrgar/Work/SSII/ldpc-gpu/jrgarcia/src/OpenCL/serial-schedule/ver0/temp_compute_profiler_0_0.csv' for application run 0 not found.<br /><br />I think the problem is I am using Apple's OpenCL implementation, so the platform is identified as "Apple" instead of "NVIDIA OpenCL" or something like that, and Visual Profiler thinks it is a plain C program. I have installed the CUDA Toolkit in Windows too, and I get no errors with version 4.1 of Visual Profiler. Although in Windows I compile and link against NVIDIA's OpenCL implementation.<br /><br />I am not sure if that is the problem I am running in, but if it is I would want to know how to solve it. I have asked google and I got nothing.<br /><br />Thank you very much.]]></description>
   </item>
      <item>
      <title>NV Controller stop responding during OpenCL kernel execution.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1531/nv-controller-stop-responding-during-opencl-kernel-execution-</link>
      <pubDate>Tue, 15 Nov 2011 06:09:06 -0500</pubDate>
      <dc:creator>Ffelagund</dc:creator>
      <guid isPermaLink="false">1531@/devforum/discussions</guid>
      <description><![CDATA[Hello, <br /><br />We are starting with OpenCL and we are performing some sample applications to learn how it works under the tables. We found in one execution that the device driver stopped responding and rebooted (not the machine, only the device driver).<br />I've attache the VS2008 solution that makes the application crash, as well as a screen capture in the moment of the problem (sorry, it is in Spanish)<br /><br />We are using a NV 525M GT, with driver version 285.62 in a laptop with Windows 7 64b.<br /><br />Any clue to foind out why this happens will be really appreciated :)<br /><br />Regards,<br />Jacobo.]]></description>
   </item>
      <item>
      <title>opencl application is portable cuda for intel and amd?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2076/opencl-application-is-portable-cuda-for-intel-and-amd</link>
      <pubDate>Fri, 02 Dec 2011 13:18:08 -0500</pubDate>
      <dc:creator>silviocassiano</dc:creator>
      <guid isPermaLink="false">2076@/devforum/discussions</guid>
      <description><![CDATA[Good afternoon, is the first time I write, I am newbie in this world, someone from the forum would respond if it is possible to run an application  OpenCL developed using the Intel SDKon architecture Nvidia, RADEON and Intel or application developed using the nvidia sdk architectures run on Intel, AMD and RADEON.<br /><br />my e-mail is silvio.cassiano@hotmail.com <br /><br />Thank you.<br /><br />]]></description>
   </item>
      <item>
      <title>floating point precison loss  on gpu in opencl</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3346/floating-point-precison-loss-on-gpu-in-opencl</link>
      <pubDate>Thu, 12 Jan 2012 00:14:03 -0500</pubDate>
      <dc:creator>AbhijitDalvi</dc:creator>
      <guid isPermaLink="false">3346@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br />I am getting precision loss while working on floating point at 6th or 7th decimal point. I am doing simple addition operation of 64 values on CPU and GPU.<br />please see below for input values<br /><br />I Observed that if I am doing bellows operation<br />CPU Method for addition:<br />for(i=0;i&lt;64;i++)<br />{<br />sum += input[i];<br />}<br />I am getting sum value as " 31.637564 "<br /><br />GPU Method for addition using OPENCL 1.0:<br />And if I using vector addition (float8) I am getting result as " 31.637560 "<br /><br /><a href="/devforum/search?Search=%23pragma&amp;Mode=like">#pragma</a> OPENCL EXTENSION cl_khr_fp64: enable<br />__kernel void test_dist(__global float8 *gpu_output,__global float8 *gpu_input, __global float *temp_buf)<br />{<br /><br />//vector addition<br />gpu_output[0] = gpu_input[0]+gpu_input[1]+gpu_input[2]+gpu_input[3]+<br />gpu_input[4]+gpu_input[5]+ gpu_input[6]+gpu_input[7];<br /><br />//scalar addition<br />barrier(CLK_GLOBAL_MEM_FENCE);<br />temp_buf[0] = gpu_output[0].s0+gpu_output[0].s1+gpu_output[0].s2+gpu_output[0].s3+gpu_output[0].s4+gpu_output[0].s5+gpu_output[0].s6+gpu_output[0].s7;<br /><br /><br />}<br /><br />Anyone please tell me why this is happening<br />input values are below (randomly generated):<br />0.481918<br />0.963439<br />0.134709<br />0.630177<br />0.461104<br />0.868496<br />0.095309<br />0.669088<br />0.147649<br />0.588305<br />0.575121<br />0.513291<br />0.750877<br />0.275674<br />0.206702<br />0.056642<br />0.361522<br />0.901700<br />0.869289<br />0.463973<br />0.653462<br />0.438917<br />0.359844<br />0.171880<br />0.485275<br />0.734458<br />0.449446<br />0.500961<br />0.996734<br />0.041017<br />0.167516<br />0.097964<br />0.292367<br />0.675222<br />0.658773<br />0.927183<br />0.307413<br />0.770959<br />0.456038<br />0.028596<br />0.948546<br />0.598041<br />0.512803<br />0.763482<br />0.104038<br />0.413373<br />0.839412<br />0.764000<br />0.617084<br />0.314493<br />0.086306<br />0.307260<br />0.814997<br />0.063601<br />0.918607<br />0.331980<br />0.689566<br />0.351146<br />0.525010<br />0.816004<br />0.321238<br />0.467574<br />0.105716<br />0.734275]]></description>
   </item>
      <item>
      <title>OpenCL: Program crash when copying data onto device</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3371/opencl-program-crash-when-copying-data-onto-device</link>
      <pubDate>Thu, 12 Jan 2012 08:45:16 -0500</pubDate>
      <dc:creator>BaiLong</dc:creator>
      <guid isPermaLink="false">3371@/devforum/discussions</guid>
      <description><![CDATA[Hey there,<br /><br />I'm completely new to OpenCL and I can't find a thread to my issue.<br /><br />My situation is: I have a single method in which I create a kernel, run and delete it. (Maybe not the smartest way, but that's only for the first tests whether OpenCL is the right way to go) The method itself is called several times (currently approx. 30), but so far I don't access any data after I free the memory.<br />My problem is: My program crashes with no error but that appears to be at random. Sometimes it crashes on the first execution of <strong>move</strong>, sometimes it runs through nicely and sometimes it crashes in between. I don't get any error code (except code 0) or crash message, the program window just closes and that's it. I only can tell it's <strong>always</strong> when copying data onto my graphics card.<br /><br />I really tried everything I could imagine! :(<br />Hopefully that's just a beginners issue.<br /><br />Here is the host method:<br /><code>      void move(<br />         const float *srcPos, float *destPos,<br />         size_t numVertices)<br />      {         <br />         cl_context context = 0;<br />         cl_command_queue commandQueue = 0;<br />         cl_program program = 0;<br />         cl_device_id device = 0;<br />         cl_kernel kernel = 0;<br />         cl_int errNum;<br /><br />         // Create an OpenCL context on first available platform<br />         context = CreateContext();<br /><br />         // Create a command-queue on the first device available<br />         // on the created context<br />         commandQueue = CreateCommandQueue(context, &amp;device);<br /><br />         char* kernelPath = "move.cl";         <br /><br />         program = CreateProgram(context, device, kernelPath);<br /><br />         // Create OpenCL kernel<br />         kernel = clCreateKernel(program, "move_kernel", NULL);         <br /><br />         // Create memory objects that will be used as arguments to kernel.<br />         float* result = (float*)malloc(numVertices * sizeof(float));<br /><br />         cl_mem d_result = clCreateBuffer(context, CL_MEM_WRITE_ONLY,<br />                                 sizeof(float) * numVertices, NULL, &amp;errNum);   <br /><br />         cl_mem d_srcPos = clCreateBuffer( context, CL_MEM_READ_ONLY,<br />                                 sizeof(float) * numVertices * 3, (void*)srcPos, &amp;errNum);      <br /><br />         //#########################<br />         // This is the call that causes the crash<br />         //#########################<br />         errNum = clEnqueueWriteBuffer(commandQueue, d_srcPos, CL_TRUE, 0, sizeof(float) * numVertices * 3,<br />                           (void*)srcPos, 0, NULL, NULL);<br />         //#########################<br />         if (errNum != CL_SUCCESS)<br />         {<br />            std::cerr &lt;&lt; "Error setting kernel argument." &lt;&lt; std::endl;<br />            std::cout &lt;&lt; "Error code: " &lt;&lt; errNum &lt;&lt; std::endl;   <br />            std::getchar();<br />            return;<br />         }<br /><br />         cl_mem d_destPos = clCreateBuffer( context, CL_MEM_WRITE_ONLY,<br />                                 sizeof(float) * numVertices * 3, NULL, &amp;errNum);<br /><br />         errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem), &amp;d_srcPos);<br />         errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &amp;d_destPos);<br />         errNum |= clSetKernelArg(kernel, 2, sizeof(int), &amp;numVertices);<br />         errNum |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &amp;d_result);<br /><br />         if (errNum != CL_SUCCESS)<br />         {<br />            std::cerr &lt;&lt; "Error setting kernel arguments." &lt;&lt; std::endl;<br />            std::cout &lt;&lt; "Error code: " &lt;&lt; errNum &lt;&lt; std::endl;<br />            Cleanup(context, commandQueue, program, kernel, memObjects);<br />            std::getchar();<br />            return;<br />         }<br /><br />         size_t localWorkSize[1] = { 1 };<br />         size_t globalWorkSize[1] = { numVertices };<br /><br />         // Queue the kernel up for execution across the array<br />         errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL,<br />                                 globalWorkSize, localWorkSize,<br />                                 0, NULL, NULL);<br /><br />         // Read the output buffer back to the Host      <br />         errNum = clEnqueueReadBuffer(commandQueue, d_result, CL_TRUE,<br />                               0, numVertices * sizeof(float), result,<br />                               0, NULL, NULL);<br /><br />         std::cout &lt;&lt; "Executed program succesfully." &lt;&lt; std::endl;<br /><br />         free(result);<br /><br />         clReleaseMemObject(d_result);<br />         clReleaseMemObject(d_srcPos);<br />         clReleaseMemObject(d_destPos);<br /><br />         clReleaseCommandQueue(commandQueue);<br /><br />         clReleaseKernel(kernel);<br /><br />         clReleaseProgram(program);<br /><br />         clReleaseContext(context);<br />}</code><br /><br />The kernel so far looks like this:<br /><code>__kernel void move_kernel(__global const float *d_srcPos,<br />                                 __global float *d_destPos,<br />                                 int numVertices,<br />                                 __global float *d_result<br />                                 )<br />{   <br />        int gid = get_global_id(0);   <br /><br />        d_result[gid] = gid;<br />}</code><br /><br />I also tried to use <code>cl_mem d_srcPos = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float) * numVertices * 3, (void*)srcPos, &amp;errNum);</code> instead, but that causes the same behavior.<br /><br />A lot of the code is still from some example file, which worked perfectly (even with data transfer), but since I added this mem-copy function I get the described crashes.<br /><br />Since I'm still working on it, the variable <strong>result</strong> is just a test variable, that I sometimes use to output values (e.g. <strong><em>get_global_id()</em></strong>, or such).<br />For the sake of readability I also deleted all the error handling like <code>if(kernel == NULL)</code>. Therefore you have to believe me I really tested everything and the only thing that causes the crash is the marked function call.<br /><br />Does anyone have an idea what could cause my Problem?<br /><br />Thanks in advance!<br /><br />Cheers,<br />--Markus<br /><br />My Device:<br />Quadro 5000<br />Driver 285.58<br /><br />PS: I also posted in the khronos forum, which gave me the idea it could also be some driver related issue.]]></description>
   </item>
      <item>
      <title>Accessing GPU Memory from Operating System Kernel</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4006/accessing-gpu-memory-from-operating-system-kernel</link>
      <pubDate>Thu, 26 Jan 2012 09:45:30 -0500</pubDate>
      <dc:creator>FrankFeinbube</dc:creator>
      <guid isPermaLink="false">4006@/devforum/discussions</guid>
      <description><![CDATA[Hi everyone,<br /><br />we are working on a research prototype where we intend to use the GPU memory for a "GPU RAM Disk".<br />At the moment we have a user-mode module that uses OpenCL to copy data to and from the GPU memory.<br /><br />We would prefer a way of accessing the GPU memory directly from the Windows Kernel (thereby reducing the overhead introduced f.e. by the context switch). What is the best way to achieve this?<br /><br />Best regards,<br />Frank]]></description>
   </item>
      <item>
      <title>GPU Accelerated 2D to Stereo 3D Video Conversion</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3906/gpu-accelerated-2d-to-stereo-3d-video-conversion</link>
      <pubDate>Tue, 24 Jan 2012 16:43:20 -0500</pubDate>
      <dc:creator>DryRiver</dc:creator>
      <guid isPermaLink="false">3906@/devforum/discussions</guid>
      <description><![CDATA[Hello All,<br /><br />I have written a pretty good 2D-to-3D video conversion algorithm in C# NET. (Took a little over 2 years of experimenting to get it right)<br /><br />I now want to GPU accelerate this 2D-to-3D conversion algorithm. I am hoping for a 10x - 20x times speedup using the GPU to do the pixel crunching, instead of the CPU. <br /><br />My requirements are:<br /><br />- The GPU code needs to execute inside a C# .NET Windows Forms Applicaton<br /><br />- I want to use the easiest/beginner friendliest GPU coding method possible<br /><br />Where should I start with this? CUDA.NET? OpenCL.NET? Brahma (for C#)?<br /><br />Are there any beginners tutorials for using CUDA/OpenCL inside C# NET?<br /><br />Are there, specifically, any Image Processing tutorials/examples for CUDA/OpenCL?<br /><br />Thank you for any feedback. I am a complete CUDA/OpenCL Noob and am hoping for expert advice on making my first GPU accelerated project happen.<br /><br />Best Regards,<br /><br />                  DryRiver<br /><br /><br /><br /><br /><br /><br />]]></description>
   </item>
      <item>
      <title>linker errors while executing opencl sample codes</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3421/linker-errors-while-executing-opencl-sample-codes</link>
      <pubDate>Fri, 13 Jan 2012 06:10:44 -0500</pubDate>
      <dc:creator>Prasanna</dc:creator>
      <guid isPermaLink="false">3421@/devforum/discussions</guid>
      <description><![CDATA[Hi<br />I am new in executing opencl codes...I have downloaded the GPU Computing SDK and drivers and executing opencl samples from that...I have included all the .lib files which are there in opencl in SDK...While executing i got the following errors in visual studio 2010<br /><br />1&gt;------ Build started: Project: testopencl, Configuration: Debug Win32 ------<br />1&gt; Skipping... (no relevant changes detected)<br />1&gt; testopencl.cpp<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrComparefet referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clEnqueueReadBuffer@36 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clEnqueueNDRangeKernel@36 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clEnqueueWriteBuffer@36 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clSetKernelArg@16 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateKernel@12 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clBuildProgram@24 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateProgramWithSource@20 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _oclLoadProgSource referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrFindFilePath referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateBuffer@24 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateCommandQueue@20 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateContext@24 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clGetDeviceIDs@24 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clGetPlatformIDs@12 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrFillArray referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrRoundUp referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrLog referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrSetLogFileName referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrCheckCmdLineFlag referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseMemObject@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseContext@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseCommandQueue@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseProgram@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseKernel@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;C:\Users\Acer\Documents\Visual Studio 2010\Projects\testopencl\Debug\testopencl. exe : fatal error LNK1120: 25 unresolved externals<br />========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========<br /><br /><br />I am using Windows7 os 64-bit with nvidia graphic card...It will be great helpful if anyone reply the solution for this problem.<br />Thank You... ]]></description>
   </item>
      <item>
      <title>Memory Leak</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3541/memory-leak</link>
      <pubDate>Tue, 17 Jan 2012 06:29:49 -0500</pubDate>
      <dc:creator>ti6csb</dc:creator>
      <guid isPermaLink="false">3541@/devforum/discussions</guid>
      <description><![CDATA[Hi there,<br /><br />I am using the Nvidia SDK 4.0 on a 64-bit 10.10 Ubuntu with the kernel version 2.6.35-30-generic. <br /><br />When using the OpenCL function clEnqueueNDRangeKernel() there happens to be a memory leak when the 5th and 6th parameter i.e. the global- and localworksize are not arrays and passed by providing the address of the variable via &amp;. In this case valgrind reports a memory leak. This leak disappears when both variables are arrays.<br /><br />best regards,<br />Cem<br /><br />]]></description>
   </item>
      <item>
      <title>Mip-map texture support - Vote for this feature and discuss!</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3376/mip-map-texture-support-vote-for-this-feature-and-discuss</link>
      <pubDate>Thu, 12 Jan 2012 10:23:06 -0500</pubDate>
      <dc:creator>Crog</dc:creator>
      <guid isPermaLink="false">3376@/devforum/discussions</guid>
      <description><![CDATA[Hi, please vote on this discussion if you want mip-map support added in CUDA. If enough of the community show desire Nvidia may investigate this as the hardware is known to support this feature so we just need CUDA to give us the access to it.<br /><br />We desire this as we are using CUDA and OptiX to perform ray-tracing and texture aliasing is a major issue in larger scale/real scenes. We are limited to software emulating mip-mapping ourselves. The interface for mip-maps is already in the OptiX interface its just there isn't enough emphasis on it from the community for the CUDA development team to implement this yet.<br /><br />Also if anybody has ideas/ways of optimizing a software mip-map implementation they can discuss them here for others to find. For example how do you efficiently pack them in memory and then access these locations with as few instructions as possible.<br /><br />Cheers,<br />Craig]]></description>
   </item>
      <item>
      <title>OpenCL: what is best way to implement an atomic add with upper bound (clamp) to prevent roll over?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3521/opencl-what-is-best-way-to-implement-an-atomic-add-with-upper-bound-clamp-to-prevent-roll-over</link>
      <pubDate>Mon, 16 Jan 2012 17:43:16 -0500</pubDate>
      <dc:creator>connertorcroboticscom</dc:creator>
      <guid isPermaLink="false">3521@/devforum/discussions</guid>
      <description><![CDATA[I need to increment a uint by a calculated amount "inc" between 0 and UINT_MAX.  These calculations occur in parallel, and several threads may access the same value.<br /><br />I can do this with atom_add(pmap, inc), but there is a distinct possibility that the value may wrap. <br /><br />Is there an efficient way to prevent the roll over by clamping the value?<br /><br />Currently I increment the value, then retrieve the value and test that it is &gt; inc.<br />If I detect roll over, then I just set the value to UINT_MAX.<br /><br />This would be simple if atom_add() return the new value, but atom_add returns the "old" value.<br /><br />As it is, this two step process is not guaranteed to be correct.<br /><br />Is there a good and computationally efficient way to do what I need?<br /><br />Thanks]]></description>
   </item>
      <item>
      <title>GPU Computing SDK: problem with the OpenCL n-body simulation example.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2116/gpu-computing-sdk-problem-with-the-opencl-n-body-simulation-example-</link>
      <pubDate>Sun, 04 Dec 2011 08:41:50 -0500</pubDate>
      <dc:creator>Bekos</dc:creator>
      <guid isPermaLink="false">2116@/devforum/discussions</guid>
      <description><![CDATA[Hello everyone!<br /><br />I have a question regarding the n-body OpenCL simulation in the nVidia GPU computing SDK. First of all I want to apologize if I am asking something very silly. My physics and n-body knowledge is not very good yet. I was checking the CPU version of the n-body algorithm in file "oclBodySystemCPU.cpp". My question is related to the void BodySystemCPU::_integrateNBodySystem(float) function. This function, at line 165 calculates the velocity of the particle at the end of the interval. And then uses this velocity to calculate the position of the particle at the end of the interval. Isn't this wrong? I thought the correct solution is to calculate the position using the velocity of the particle at the end of the previous interval. And the velocity calculated at line 165 should be used for the next interval. I am missing something here? Thanks a lot for your time.<br /><br />Cheers,<br />Bekos<br />]]></description>
   </item>
      <item>
      <title>Does cuda-gdb work with OpenCL?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3211/does-cuda-gdb-work-with-opencl</link>
      <pubDate>Mon, 09 Jan 2012 18:11:52 -0500</pubDate>
      <dc:creator>connertorcroboticscom</dc:creator>
      <guid isPermaLink="false">3211@/devforum/discussions</guid>
      <description><![CDATA[I'm developing an OpenCL application on NVIDIA processor, and would like to debug and profile the code.  I've tried gDEBugger for Windows, but it seems to hang a lot.<br /><br />The code is cross-platform compatible, so I'm curious if the cuda-gdb tool set will help with debugging and profiling OpenCL code on a Linux (Ubuntu 10.04) platform.  Does it provide useful information for OpenCL programs?<br /><br />parallel-nsight looked promising, but I'm not using Visual Studio professional.]]></description>
   </item>
      <item>
      <title>New Developer&#039;s Driver 285.67 OpenCL Problem: command queues are not started</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2131/new-developers-driver-285-67-opencl-problem-command-queues-are-not-started</link>
      <pubDate>Mon, 05 Dec 2011 09:21:26 -0500</pubDate>
      <dc:creator>kontakt2</dc:creator>
      <guid isPermaLink="false">2131@/devforum/discussions</guid>
      <description><![CDATA[Hi Everyone,<br /><br />I am developing an application which uses the OpenCL implementation of NVIDIA. Until yesterday I had the old developer's driver (270.81) installed everything ran smoothly.<br />Now with the new driver (285.67) suddenly, when I Enqueue a kernel in a command queue, the command is never being sent to the device, it stays at status CL_QUEUED.<br /><br />To give a few details:<br />In my main thread I initialize the OpenCL runtime (platform, a command queue, etc) and in that thread I create buffers, transfer data to the buffers, etc. So far everything works fine.<br />But when I call clEnqueueNDRangeKernel(...) and pass a cl_event in there to check the status of the kernel execution, the event always stays in the state CL_QUEUED and is never finished.<br /><br />When I reinstall the old driver, everything works fine again!<br /><br />Do you have similar problems perhaps?]]></description>
   </item>
      <item>
      <title>Visual Profiler 4.0 Problem : Rows dropped</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2136/visual-profiler-4-0-problem-rows-dropped</link>
      <pubDate>Mon, 05 Dec 2011 09:24:53 -0500</pubDate>
      <dc:creator>kontakt2</dc:creator>
      <guid isPermaLink="false">2136@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I have a problem with the Visual Profiler. This problem now persists for a long time and I wanted to check whether someone in here has a solution:<br /><br />I have OpenCL application which I want to profile. When I start the Visual Profiler 4.0 and profile my application I always get an error about dropped rows. Now there are some posts on the internet but none worked for me, and most of the posts don't even have a solution.<br />Do you guys have the same issues with the Profiler?<br /><br />Thanks for your help]]></description>
   </item>
      <item>
      <title>OpenCL C++ Bindings</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2086/opencl-c-bindings</link>
      <pubDate>Fri, 02 Dec 2011 14:59:16 -0500</pubDate>
      <dc:creator>robloblaw</dc:creator>
      <guid isPermaLink="false">2086@/devforum/discussions</guid>
      <description><![CDATA[I'm new to using OpenCL with my nVidia GTX 260 card, so please bear with me...<br /><br />I installed the Cuda Toolkit and newest graphics driver on my Ubuntu 11.10 box and have OpenCL up and running using the C version of OpenCL (using the header file cl.h).  I was looking on the Khronos Group's website and noticed they have C++ bindings for OpenCL.  I was reading somewhere to simply place the cl.hpp file in the CL directory with the other OpenCL header files, but g++ gave me the following error:<br /><br />/usr/local/cuda/include/CL/cl.hpp:160:19: fatal error: GL/gl.h: No such file or directory<br />compilation terminated.<br />make: *** [smith-waterman] Error 1<br /><br />Are there some dependencies (i.e. OpenGL dependencies) that I need to download to enable C++ support?]]></description>
   </item>
      <item>
      <title>weird texture opencl behaviour by using write_imagef</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1606/weird-texture-opencl-behaviour-by-using-write_imagef</link>
      <pubDate>Thu, 17 Nov 2011 11:55:04 -0500</pubDate>
      <dc:creator>florisdesmedt</dc:creator>
      <guid isPermaLink="false">1606@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />To learn to work with texture memory,I have created a demo-application.<br /><br /><code><br />Kernel Code:<br />__kernel void testimage(__write_only image2d_t outputImage, const int width, const int height, sampler_t Sam,__read_only image2d_t InputImage){<br /><br />        int x = get_global_id(0);<br />        int y = get_global_id(1);<br /><br />        if(x &lt; width &amp;&amp; y &lt; height ){<br />                int2 coord = (int2)(x,y);<br />                float2 coord3 = (float2)(x,y);<br />                int2 coord2 = (int2)(x*2,y);<br />                float4 I = read_imagef(InputImage,Sam,coord);<br /><br />                write_imagef(outputImage,coord2,I);<br />        }<br />}<br /><br />Host Code:<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;oclUtils.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;iostream&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;stdlib.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> "highgui.h"<br /><br />using namespace std;<br /><br />int main(int argc, char** argv){<br />	 cout &lt;&lt; "Starting program ..." &lt;&lt; endl &lt;&lt; flush;<br /><br />//initialisation<br />	 cl_int error = 0;<br />	 cl_context context;<br />	 cl_command_queue queue;<br />	 cl_device_id device;<br /><br />FILE *fp;<br />long filelen, readlen;<br />char* kernelSource;<br /><br />int W=10, H=10; //width and height of image in pixels<br /><br />// Platform<br />cl_platform_id platform[2];<br /><br />cl_device_id devices[4];<br />cl_uint nplat;<br /><br />//Platform<br />error = clGetPlatformIDs(2,platform,&amp;nplat);<br />if (error != CL_SUCCESS) {<br />   cout &lt;&lt; "Error getting platform id: "  &lt;&lt; endl&lt;&lt; flush;<br />   exit(error);<br />}<br /><br />// Device<br />cl_uint ndevices;<br />error = clGetDeviceIDs(platform[1], CL_DEVICE_TYPE_GPU, 4, devices, &amp;ndevices);<br />if (error != CL_SUCCESS) {<br />   cout &lt;&lt; "Error getting device ids: " &lt;&lt; endl&lt;&lt; flush;<br />   exit(error);<br />}<br /><br />//context<br />int numberOfDevices = 1; //amount of devices to use<br />context = clCreateContext(NULL, numberOfDevices, &amp;devices[3], NULL, NULL, &amp;error);<br />if (error != CL_SUCCESS) {<br />   cout &lt;&lt; "Error creating context: " &lt;&lt; endl&lt;&lt; flush;<br />   exit(error);<br />}<br /><br />//command queue<br />queue = clCreateCommandQueue(context, devices[3], 0, &amp;error);<br />if (error != CL_SUCCESS) {<br />   cout &lt;&lt; "Error creating command queue: "  &lt;&lt; endl&lt;&lt; flush;<br />   exit(error);<br />}<br /><br /><br />cl_image_format form2;<br />form2.image_channel_order = CL_RGBA;<br />form2.image_channel_data_type = CL_FLOAT;<br /><br />const size_t ori[3] = {0,0,0};<br />const size_t reg[3] = {W,H,1}; <br /><br />cl_mem Afbeelding2 = clCreateImage2D(context,CL_MEM_READ_WRITE,&amp;form2,W,H,0,NULL,&amp;error); <br />cl_mem Afbeelding = clCreateImage2D(context,CL_MEM_READ_WRITE,&amp;form2,W,H,0,NULL,&amp;error); <br /><br />float Geh[W*H*4]; //input data of image<br />//fill array of input data<br />for(int Q=0;Q&lt;W*H*4;Q++)<br />	Geh[Q] = Q;<br /><br />clEnqueueWriteImage(queue,Afbeelding, CL_TRUE, ori, reg, 0,0,Geh,0,NULL,NULL);<br /><br />fprintf(stderr,"End creating store on GPU ...\n");<br />fprintf(stderr,"Give kernel to execute ...\n");<br /><br />//reading kernel code<br />fp = fopen("kernel.cl","r");<br />fseek(fp,0L,SEEK_END);<br />filelen = ftell(fp);<br />rewind(fp);<br /><br />kernelSource = (char*)malloc(sizeof(char)*filelen+1);<br />readlen = fread(kernelSource,1,filelen,fp);<br />if(readlen != filelen){<br />        cout &lt;&lt; "PROBLEM READING KERNEL" &lt;&lt; endl;<br />}<br />kernelSource[filelen] = '\0';<br /><br />cl_program prog = clCreateProgramWithSource(context,1, (const char**)&amp;kernelSource, NULL, &amp;error);<br />assert(error == CL_SUCCESS);<br />fprintf(stderr,"Start building GPU program ...\n");<br /><br />device = devices[3];<br /><br />//create program<br />error = clBuildProgram(prog, 1, &amp;device, NULL,NULL,NULL);<br />if(error == CL_INVALID_PROGRAM)<br />	cout &lt;&lt; "Invalid program" &lt;&lt; endl&lt;&lt; flush;<br />else if(error == CL_INVALID_VALUE)<br />	cout &lt;&lt; "device list is NULL en num_devices is greater than zero or num_devices is zero ofwel is user data not NULL maar wel als NULL ingevuld" &lt;&lt; endl&lt;&lt; flush;<br />else if(error == CL_INVALID_DEVICE)<br />	cout &lt;&lt; "No valid device" &lt;&lt; endl&lt;&lt; flush;<br />else if(error == CL_INVALID_BUILD_OPTIONS)<br />	cout &lt;&lt; "No valid build options" &lt;&lt; endl&lt;&lt; flush;<br />else if (error == CL_INVALID_OPERATION)<br />	cout &lt;&lt; "CL_INVALID_OPERATION"&lt;&lt; endl&lt;&lt; flush;<br />else if(error == CL_COMPILER_NOT_AVAILABLE)<br />	cout &lt;&lt; "CL_COMPILER_NOT_AVAILABLE" &lt;&lt;endl&lt;&lt; flush;<br />else if(error == CL_BUILD_PROGRAM_FAILURE)<br />	cout &lt;&lt; "CL_BUILD_PROGRAM_FAILURE" &lt;&lt; endl&lt;&lt; flush;<br /><br />if(error != CL_SUCCESS){<br />	char* buildLog;<br />	size_t log_size;<br />	clGetProgramBuildInfo(prog,device, CL_PROGRAM_BUILD_LOG, 0, NULL, &amp;log_size);<br />	buildLog = new char[log_size+1];<br />	clGetProgramBuildInfo(prog, device, CL_PROGRAM_BUILD_LOG, log_size, buildLog, NULL);<br />	buildLog[log_size] = '\0'; <br />	printf("BuildLog:\n-----------\n%s",buildLog);<br />	delete[] buildLog;<br />}<br />assert(error == CL_SUCCESS);<br /><br />fprintf(stderr,"GPU program created ...\n");<br />fprintf(stderr,"Creating kernel ...\n");<br /><br />//Kernel to use<br />cl_kernel kern = clCreateKernel(prog, "testimage", &amp;error);<br />assert(error == CL_SUCCESS);<br /><br />fprintf(stderr,"Kernel created ...\n");<br />fprintf(stderr,"Create sampler object ...\n");<br />cl_sampler Samp = clCreateSampler(context, CL_FALSE, CL_ADDRESS_CLAMP_TO_EDGE, CL_FILTER_NEAREST, &amp;error);<br />assert(error == CL_SUCCESS);<br />fprintf(stderr,"Sampler created ...\n");<br />fprintf(stderr,"Give arguments of kernel ...\n");<br /><br />error  = clSetKernelArg(kern, 0, sizeof(cl_mem), &amp;Afbeelding2);<br />error |= clSetKernelArg(kern, 1, sizeof(int), &amp;W);<br />error |= clSetKernelArg(kern, 2, sizeof(int), &amp;H);<br />error |= clSetKernelArg(kern, 3, sizeof(cl_sampler), &amp;Samp);<br />error |= clSetKernelArg(kern, 4, sizeof(cl_mem), &amp;Afbeelding);<br /><br />if(error == CL_INVALID_KERNEL){<br />	cout &lt;&lt; "CL_INVALID_KERNEL" &lt;&lt; endl;<br />}<br />if(error == CL_INVALID_ARG_INDEX){<br />	cout &lt;&lt; "CL_INVALID_ARG_INDEX" &lt;&lt; endl;<br />}<br />if(error == CL_INVALID_ARG_VALUE){<br />	cout &lt;&lt; "CL_INVALID_ARG_VALUE" &lt;&lt; endl;<br />}<br />if(error == CL_INVALID_MEM_OBJECT){<br />	cout &lt;&lt; "CL_INVALID_MEM_OBJECT" &lt;&lt; endl;<br />}<br />if(error == CL_INVALID_SAMPLER){<br />	cout &lt;&lt; "CL_INVALID_SAMPLER" &lt;&lt; endl;<br />}<br />if(error == CL_INVALID_ARG_SIZE){<br />	cout &lt;&lt; "CL_INVALID_ARG_SIZE" &lt;&lt; endl;<br />}<br /><br />assert(error == CL_SUCCESS);<br /><br />const size_t local_ws[] = {16,16};<br />const size_t global_ws[] = {W,H};<br />fprintf(stderr,"Execute kernel ...\n");<br /><br />//Execute kernel<br />error = clEnqueueNDRangeKernel(queue,kern, 2, NULL, global_ws, local_ws, 0, NULL, NULL);<br />if(error == CL_INVALID_PROGRAM_EXECUTABLE)<br />	cout &lt;&lt; "CL_INVALID_PROGRAM_EXECUTABLE" &lt;&lt;endl;<br />if(error == CL_INVALID_COMMAND_QUEUE)<br />	cout &lt;&lt; "CL_INVALID_COMMAND_QUEUE" &lt;&lt;endl;<br />if(error == CL_INVALID_KERNEL)<br />	cout &lt;&lt; "CL_INVALID_KERNEL" &lt;&lt;endl;<br />if(error == CL_INVALID_CONTEXT)<br />	cout &lt;&lt; "CL_INVALID_CONTEXT" &lt;&lt;endl;<br />if(error == CL_INVALID_KERNEL_ARGS)<br />	cout &lt;&lt; "CL_INVALID_KERNEL_ARGS" &lt;&lt;endl;<br />if(error == CL_INVALID_WORK_DIMENSION)<br />	cout &lt;&lt; "CL_INVALID_WORK_DIMENSION" &lt;&lt;endl;<br />if(error == CL_INVALID_GLOBAL_WORK_SIZE)<br />	cout &lt;&lt; "CL_INVALID_GLOBAL_WORK_SIZE" &lt;&lt;endl;<br />if(error == CL_INVALID_WORK_GROUP_SIZE)<br />	cout &lt;&lt; "CL_INVALID_WORK_GROUP_SIZE" &lt;&lt;endl;<br />if(error == CL_INVALID_WORK_ITEM_SIZE)<br />	cout &lt;&lt; "CL_INVALID_WORK_ITEM_SIZE" &lt;&lt;endl;<br />if(error == CL_INVALID_GLOBAL_OFFSET)<br />	cout &lt;&lt; "CL_INVALID_GLOBAL_OFFSET" &lt;&lt;endl;<br />assert(error == CL_SUCCESS);<br /><br />fprintf(stderr,"Kernel execution finished ...\n");<br />fprintf(stderr,"Read results from GPU ...\n");<br />cl_float *h_img = (cl_float*) malloc(W*H*sizeof(cl_float)*4);<br />clEnqueueReadImage(queue,Afbeelding2,CL_TRUE,ori,reg,4*sizeof(cl_float)*W,0,h_img,NULL,0,NULL);<br />fprintf(stderr,"Reading of results finished.. ...\n");<br /><br />fprintf(stderr, "Printing Results...\n");<br />for(int j=0;j&lt;H;j++){<br />	for(int i=0;i&lt;W;i++){<br />		for(int B=0;B&lt;4;B++)<br />			printf("%f ",h_img[j*W*4+i*4+B]);<br />		printf("\n");<br />	}<br />	printf("\n");<br />}<br /><br />fprintf(stderr,"Free gpu memory ...\n");<br />free(h_img);<br />clReleaseKernel(kern);<br />clReleaseCommandQueue(queue);<br />clReleaseContext(context);<br />clReleaseMemObject(Afbeelding);<br />clReleaseMemObject(Afbeelding2);<br />clReleaseSampler(Samp);<br /><br />fprintf(stderr,"Finishing program ...\n");<br /><br />return 0;<br />}<br /></code><br /><br />I have to double the x-coordinate to prevent the content of pixels is overwritten by the next line. I only experiencing this behaviour on Ubuntu linux, with different versions of the toolkit and different drivers. The same code has been tested on windows systems, on both NVIDIA and AMD. Only on the linux-configuration the x-coordinate must be doubled.<br /><br />Has someone an explanation/solution for this behaviour?<br /><br />Greetings]]></description>
   </item>
      </channel>
</rss>
