<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
      <title>Tagged with cuda-sdk - NVIDIA Developer Forums</title>
      <link>http://forums.developer.nvidia.com/devforum/discussions/tagged/cuda-sdk/feed.rss</link>
      <pubDate>Wed, 16 May 12 17:31:08 -0400</pubDate>
         <description>Tagged with cuda-sdk - NVIDIA Developer Forums</description>
   <language>en-CA</language>
   <atom:link href="/devforum/discussions/taggedcuda-sdk/feed.rss" rel="self" type="application/rss+xml" />
   <item>
      <title>cudaFree returning cudaErrorMemoryAllocation - bug?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8181/cudafree-returning-cudaerrormemoryallocation-bug</link>
      <pubDate>Mon, 14 May 2012 13:38:35 -0400</pubDate>
      <dc:creator>ajsimmonds</dc:creator>
      <guid isPermaLink="false">8181@/devforum/discussions</guid>
      <description><![CDATA[I have been encountering a strange problem using the cuda 4.2 tools where our application eventually receives a cudaErrorMemoryAllocation error when trying to perform a cudaFree on cudaMalloc'd memory. The number of allocations and deallocations performed varies but the problem can be reproduced in the app relatively easily. Once the error has been received once further frees and also cudaMemGetInfo continue to return the error.<br /><br />To further narrow down the error I have also written a test program that simply allocates areas using cudaMalloc and when this returns an out of memory error, releases one or more of the previously allocated areas to make space. This program, which launches no kernels, fails with the same symptoms. I have also tried this with the 4.0 tools and still receive the same error condition.<br /><br />If I limit the number of iterations such that the error is not encountered then it is quite likely that the free memory value returned by cudaGetMemInfo is larger than the value the program started with.<br /><br />This looks all the world to me as if there is a problem with the tracking of memory allocations within the SDK, so can anyone confirm this or possibly point me at things I may be doing wrong?!<br /><br />Many thanks<br />Andrew]]></description>
   </item>
      <item>
      <title>simpleStreams example in SDK not working</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8101/simplestreams-example-in-sdk-not-working</link>
      <pubDate>Sat, 12 May 2012 03:36:29 -0400</pubDate>
      <dc:creator>madhur13490</dc:creator>
      <guid isPermaLink="false">8101@/devforum/discussions</guid>
      <description><![CDATA[I've installed CUDA 4.1 GPUComputingSDK and GPUComputing toolkit. I'm trying to see performance improvement for simpleStreams example given in src folder but it seems there is some problem in new version. Streamed version is consistently taking more time than non-streamed version. I've no modified code. It seems there is some bug new examples.]]></description>
   </item>
      <item>
      <title>Trouble with processing image in rows</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8121/trouble-with-processing-image-in-rows</link>
      <pubDate>Sun, 13 May 2012 05:29:07 -0400</pubDate>
      <dc:creator>laz007</dc:creator>
      <guid isPermaLink="false">8121@/devforum/discussions</guid>
      <description><![CDATA[Hello!<br />I'm making an image filter that is processing the image in rows.<br />Two weeks I'm trying to figure out why it's not working when executed in parallel.<br />I use only threads in the Y dimension. Is that a problem?<br /><br /><br /><br />Here is part of the code:<br />BLOCKDIM_Y=16;<br />....<br />dim3 threads(1, BLOCKDIM_Y);<br />dim3 grid(1,  iDivUp(h, BLOCKDIM_Y));<br /><br />my_CUDA_filter&lt;&lt;&lt; grid, threads&gt;&gt;&gt;(sumR, sumG, sumB, mask,h,w, inD, outD, test);<br />...<br /><br />__global__ void my_CUDA_filter_simple222(int* sumR, int* sumG, int* sumB, int mask,int h,int w, u_int8_t *in, u_int8_t *out, int* test){<br />...<br />int iy = blockDim.y * blockIdx.y + threadIdx.y;<br />int ix=0;<br /><br />	if (iy&gt;=m &amp;&amp; iy&lt;(h-m)) {<br /><br />	//for(iy=m; iy&lt;h-m; iy++){<br /><br />	 ...<br />	for(ix=m+1;ix&lt;w-m;ix++){<br />	 ...<br />	 }<br />}<br /><br />The result image is messed up...<br />If I use for(iy=m; iy&lt;h-m; iy++){ <br />and run the kernel with one single thread (that means there is no parallelization) everything is OK.<br /><br />Any ideas?<br /><br />]]></description>
   </item>
      <item>
      <title>nvcc 4.2 pragma unroll issue</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7741/nvcc-4-2-pragma-unroll-issue</link>
      <pubDate>Tue, 01 May 2012 15:42:47 -0400</pubDate>
      <dc:creator>dlowell</dc:creator>
      <guid isPermaLink="false">7741@/devforum/discussions</guid>
      <description><![CDATA[If exit condition is: i&lt;=nv-1 where nv is define as a macro setting nv = NV, <a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NV 16, then the unroll will be incorrectly implemented.<br /><br />Example, <br /><br /><code><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NV 16 <br />nv=NV;<br />int tid = threadIdx.x+blockDim.x*blockIdx.x;<br /><a href="/devforum/search?Search=%23pragma&amp;Mode=like">#pragma</a> unroll 2<br />for(int i=0;i&lt;=nv-1;i++){<br />  y[tid]+=a[i]*x[i*n+tid];<br />}</code><br /><br />The code above with nvcc 4.2 will produce incorrect code, where as nvcc 4.0 will produce correct code. The code below will produce correct output for nvcc 4.2.<br /><br /><code><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NV 16 <br />nv=NV;<br />int tid = threadIdx.x+blockDim.x*blockIdx.x;<br /><a href="/devforum/search?Search=%23pragma&amp;Mode=like">#pragma</a> unroll 2<br />for(int i=0;i&lt;nv;i++){<br />  y[tid]+=a[i]*x[i*n+tid];<br />}</code><br /><br />Anyone else have this issue?]]></description>
   </item>
      <item>
      <title>Linker error with c function in .cu file</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7881/linker-error-with-c-function-in-cu-file</link>
      <pubDate>Fri, 04 May 2012 16:08:40 -0400</pubDate>
      <dc:creator>basementscientist</dc:creator>
      <guid isPermaLink="false">7881@/devforum/discussions</guid>
      <description><![CDATA[I've created a kernel inside a .cu file. Also inside the .cu file is a c++ function that calls<br />the kernal. Everything compiles ok, but on the final linking, the c++ function is not visible to the rest of the program. How do I make the function visible?<br /><br />I am using Visual Studio 2010 on Windows 8, and the newest SDK and Toolkit.]]></description>
   </item>
      <item>
      <title>Thread indexing</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7606/thread-indexing</link>
      <pubDate>Thu, 26 Apr 2012 17:13:20 -0400</pubDate>
      <dc:creator>essaysoftware</dc:creator>
      <guid isPermaLink="false">7606@/devforum/discussions</guid>
      <description><![CDATA[I would like to index my threads from 0 to N with a <br />&lt;&lt;&gt;&gt; launch.<br /><br />How do I do this.]]></description>
   </item>
      <item>
      <title>Cuda Multip Kernel</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7641/cuda-multip-kernel</link>
      <pubDate>Fri, 27 Apr 2012 11:44:56 -0400</pubDate>
      <dc:creator>Saouli</dc:creator>
      <guid isPermaLink="false">7641@/devforum/discussions</guid>
      <description><![CDATA[Hello There<br />My question is : can we invoque kernel inside an other kernel?<br />Exemple :<br />__global__ kernel1(.....)<br />{<br />//do some thing<br />kernel2 &lt;&lt;&gt;&gt;(...);<br />//with the resulte of kernel 2 do the rest of the work of kernel1<br />}<br />please i need answers thank you for your time reading this <br />Abdelhak]]></description>
   </item>
      <item>
      <title>Mac OS X 4.1.28 Driver installer doesn&#039;t allow me to continue/complete installation</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5366/mac-os-x-4-1-28-driver-installer-doesnt-allow-me-to-continuecomplete-installation</link>
      <pubDate>Fri, 02 Mar 2012 12:58:44 -0500</pubDate>
      <dc:creator>ischou</dc:creator>
      <guid isPermaLink="false">5366@/devforum/discussions</guid>
      <description><![CDATA[I'm trying to install all the components of the CUDA Toolkit 4.1 on a Mac OS X 10.6.8 machine.  I'm trying to install the 4.1.28 drivers and after the point where I accept the license agreement, I'm at the point where I'm supposed to select a destination.  The clickable selection for "Install for all users of this computer" is grayed out as is the "Continue" button.<br /><br />Does anyone know how I can get get past this?  I'm pretty certain that my Mac should be able to support CUDA development.  The OEM graphics card is the GeForce GT 120.<br /><br />Thanks.]]></description>
   </item>
      <item>
      <title>GPU Direct supported boards</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7571/gpu-direct-supported-boards</link>
      <pubDate>Thu, 26 Apr 2012 03:56:30 -0400</pubDate>
      <dc:creator>reubensant</dc:creator>
      <guid isPermaLink="false">7571@/devforum/discussions</guid>
      <description><![CDATA[Hello<br /><br />I spoke to an nvidia representative at the NAB about the gpudirect technology and its support by io card manufacturers.<br /><br />We currently use blackmagic cards without gpu direct. Support on blackmagic is coming soon (at least they say).  <br /><br />Anyone has experience on other boards? (AJA, Blackmagic Design, Bluefish 444, Deltacast, DVS and Matrox).<br /><br />I'm interested in bluefish. Has anyone had any opportunity comparig them with blackmagic?<br /><br />Thanks and Regards<br /><br />Reuben Sant<br />iMedia Ltd. ]]></description>
   </item>
      <item>
      <title>nvcc 4.2; a cicc and gcc preprocessing issue</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7441/nvcc-4-2-a-cicc-and-gcc-preprocessing-issue</link>
      <pubDate>Mon, 23 Apr 2012 16:23:17 -0400</pubDate>
      <dc:creator>dlowell</dc:creator>
      <guid isPermaLink="false">7441@/devforum/discussions</guid>
      <description><![CDATA[After upgrading to SDK 4.2 for some reason when I am building my library I now get this error below:<br /><br /><br /><br /><code>#$ cicc  -arch compute_20 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 -g -O0 "/tmp/tmpxft_00002684_00000000-10_vecgpu" "/tmp/tmpxft_00002684_00000000-7_vecgpu.cpp3.i"  -o "/tmp/tmpxft_00002684_00000000-2_vecgpu.ptx"<br />&lt;built-in&gt;(2): error: "__STDC_HOSTED__" is predefined; attempted redefinition ignored<br /><br />&lt;built-in&gt;(8): error: "__WCHAR_TYPE__" is predefined; attempted redefinition ignored<br /><br />&lt;built-in&gt;(115): error: "__x86_64" is predefined; attempted redefinition ignored<br /><br />&lt;built-in&gt;(116): error: "__x86_64__" is predefined; attempted redefinition ignored<br /><br />&lt;built-in&gt;(126): error: "__linux__" is predefined; attempted redefinition ignored<br /><br />&lt;built-in&gt;(128): error: "__unix__" is predefined; attempted redefinition ignored<br /><br />6 errors detected in the compilation of "/tmp/tmpxft_00002684_00000000-7_vecgpu.cpp3.i".<br /># --error 0x1 --</code><br /><br /><br /><br /><br />My gcc version is 4.4 though I've attempted this on 4.3<br />I am not sure why it is getting caught on this. If the redefinition is being ignored, why is it throwing an error and stopping compilation at all? Additionally the NVCC doc still mentions cicc as nvopencc, and in fact nowhere mentions cicc.<br /><br />Has anyone else had this issue? Any tips would be greatly appreciated.]]></description>
   </item>
      <item>
      <title>Compile SDK samples on Ubuntu 10.04 plain vanilla ok, /usr/bin/ld: cannot find -lcuda on Optimus</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4331/compile-sdk-samples-on-ubuntu-10-04-plain-vanilla-ok-usrbinld-cannot-find-lcuda-on-optimus</link>
      <pubDate>Sun, 05 Feb 2012 17:04:50 -0500</pubDate>
      <dc:creator>gue22</dc:creator>
      <guid isPermaLink="false">4331@/devforum/discussions</guid>
      <description><![CDATA[Compile SDK samples on plain vanilla Ubuntu 10.04 + GTX 560 ok, on the Optimus machine and on a VMware [Fedora 14] VM w/o nVidia drv I get<br />make[1]: Entering directory `/home/gy/NVIDIA_GPU_Computing_SDK/C/src/deviceQueryDrv'<br />/usr/bin/ld: cannot find -lcuda<br />collect2: ld returned 1 exit status<br />make[1]: *** [../../bin/linux/release/deviceQueryDrv] Error 1<br />make[1]: Leaving directory `/home/gy/NVIDIA_GPU_Computing_SDK/C/src/deviceQueryDrv'<br />make: *** [src/deviceQueryDrv/Makefile.ph_build] Error 2<br /><br />[Edit: Just loaded a VMware Ubuntu 10.04 with CUDA toolkit and GPUcomp SDK just to double-check. Same error.]<br /><br />Don´t see any difference in the setup of the machines [except the driver - and a compile / make should not be dependent on the drv install! The Optimus notebook seems somewhere in between - with an Intel on-board GPU and the GTX 525 via PCIe. Could that be the cause there? Dev driver installed correctly though.]<br /><br />[EDIT 2: Why there compile (global make in the C subdir) 84 examples on the quad and only a handful on the Tosh and the HP with exactly the same Ubuntu 10.04.3 setup is beyond me.<br /><br />Why deviceQuery doesn´t compile in the global make on the latter two machines, but compiles without a hitch in the local make is BEYOND BEYOND. - Well, there must be some issues in the tool chain.]<br />Thx<br />G.]]></description>
   </item>
      <item>
      <title>Why is a GTX680 even slower than a GTX480 when using CUDA?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7181/why-is-a-gtx680-even-slower-than-a-gtx480-when-using-cuda</link>
      <pubDate>Tue, 17 Apr 2012 07:39:41 -0400</pubDate>
      <dc:creator>nepluno</dc:creator>
      <guid isPermaLink="false">7181@/devforum/discussions</guid>
      <description><![CDATA[I've tested several Apps in the GPU Computing SDK, such as the GrabCutNPP. Surprisingly I found the  GTX680 is even slower than my old GTX480 (about 0.9x). Why could this happen? In contrast, the test on 3DMark11 reported that the GTX680 is 2x faster.<br /><br />The installed driver is 301.10, with a CUDA Toolkit 4.26. My OS is Windows 7 SP1. I even compile the code using compute_30 and sm_30, but the result kept the same.<br /><br />ps: I couldn't find a developer version driver that supports GTX680.]]></description>
   </item>
      <item>
      <title>Cannot find Reduce1.sln</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7416/cannot-find-reduce1-sln</link>
      <pubDate>Mon, 23 Apr 2012 05:21:04 -0400</pubDate>
      <dc:creator>celebisait</dc:creator>
      <guid isPermaLink="false">7416@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I'm new at GPU programming and CUDA. I read the CUDA C Programming Guide. It was very helpful for me. And now I'm reading that tutorial* from Cyril Zeller. However, it says "Open up reduce\src\reduce1.sln" on the page 36/157, and I couldn't find that visual studio solution file.<br /><br />I have NVIDIA GPU Computing SDK 4.1. I searched in the SDK and found something at:<br /><br />"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src\reduction"<br /><br />but I'm not sure is that the same thing with the PDF because it doesn't have the solution files separate like reduce1.sln, reduce2.sln etc.<br /><br />I will be appreciated for any help,<br />Sait.<br /><br />*<a href="http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf" target="_blank" rel="nofollow">http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf</a>]]></description>
   </item>
      <item>
      <title>Crash with the new LLVM compiler</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6606/crash-with-the-new-llvm-compiler</link>
      <pubDate>Mon, 02 Apr 2012 06:41:37 -0400</pubDate>
      <dc:creator>Tofic</dc:creator>
      <guid isPermaLink="false">6606@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />  OpenCL 1.1, drivers 296.10, GTX 580, 64-bits compiler, Windows 7 64-bits.<br /><br />  100% crash inside the compiler, when trying to compile this construction (well-compilable with the old compiler):<br /><strong>const struct BBox bbox = { (float4)(-.5f,-.5f,-.5f,0), (float4)(.5f,.5f,.5f,0) };<br /><br />	....</strong><br /><br />  Error: <em>OpenCL error 'Invalid binary': compilation error<br />	 ptxas application ptx input, line 13; error : Module-scoped variables in .local state space are not allowed with ABI</em><br /><br />       or<br /><br /><em>UNREACHABLE executed.</em><br /><br /><br />  Fix: remove the "const" modifier. Started with new LLVM compiler.<br /><br />Best wishes,<br />Anton]]></description>
   </item>
      <item>
      <title>Unable to build cutil64D.lib library file.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7406/unable-to-build-cutil64d-lib-library-file-</link>
      <pubDate>Mon, 23 Apr 2012 03:22:01 -0400</pubDate>
      <dc:creator>atul2188</dc:creator>
      <guid isPermaLink="false">7406@/devforum/discussions</guid>
      <description><![CDATA[Dear All,<br /><br />	 I am new to cuda programming and I am trying to run a CUDA program.But while building the project it is failing giving the error : cutil64D.lib file not found.<br /><br />Though I tried to build the library file by proper mehtods still I am unable to get the file..<br /><br />Please suggest something.<br /><br />Thanks.]]></description>
   </item>
      <item>
      <title>CUDA Toolkit and GPU Computing SDK</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/186/cuda-toolkit-and-gpu-computing-sdk</link>
      <pubDate>Mon, 29 Aug 2011 18:02:51 -0400</pubDate>
      <dc:creator>Nadeem Mohammad</dc:creator>
      <guid isPermaLink="false">186@/devforum/discussions</guid>
      <description><![CDATA[If you have general questions about the CUDA Toolkit - not relating to the included libraries, just tag the question or discussion with the cuda-toolkit tag so its easy to find.<br />The SDK contains 100's of samples for CUDA, OpenCL, DirectCompute and use of many libraries, use these forums to discuss any aspect of the SDK - be sure to use the TAG below or add ones yourself.]]></description>
   </item>
      <item>
      <title>Is it possible to debug a .exe cuda app, in other words, Can cuda app be reversed ?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6881/is-it-possible-to-debug-a-exe-cuda-app-in-other-words-can-cuda-app-be-reversed-</link>
      <pubDate>Wed, 11 Apr 2012 04:08:27 -0400</pubDate>
      <dc:creator>leolord</dc:creator>
      <guid isPermaLink="false">6881@/devforum/discussions</guid>
      <description><![CDATA[Almost all the software are in danger to be reversed.I want to know whether the cracker can debug the executable file which is out of souse code or debug symbolics.]]></description>
   </item>
      <item>
      <title>Kernel invocation line in C++ error &quot;no global operator found&quot; (Parallel Nsight)</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5711/kernel-invocation-line-in-c-error-no-global-operator-found-parallel-nsight</link>
      <pubDate>Fri, 09 Mar 2012 09:54:24 -0500</pubDate>
      <dc:creator>wdrozd</dc:creator>
      <guid isPermaLink="false">5711@/devforum/discussions</guid>
      <description><![CDATA[For some reason when trying to invoke my kernel like this:<br /><br />EvaluateKernel&lt;&gt;(param_a, param_b, param_c);<br /><br />I get these errors:<br /><br />error C2677: binary '&lt;&lt;' : no global operator found which takes type 'dim3' (or there is no acceptable conversion)<br /><br />error C2297: '&gt;&gt;' : illegal, right operand has type 'float *'<br /><br />BTW param_a is a float*<br /><br />I have declared my Kernel using extern "C" at the beginning of the C++ file, but it seems my code is not recognizing the cuda code? My Cuda code is definitely being built by the NVCC compiler as I receive the Building NVCC (Device) messages which complete (although with a couple of warnings)<br /><br />Thanks for any help you can give.]]></description>
   </item>
      <item>
      <title>seeking a movie via CUDA Video Decoder API</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6926/seeking-a-movie-via-cuda-video-decoder-api</link>
      <pubDate>Wed, 11 Apr 2012 12:30:16 -0400</pubDate>
      <dc:creator>Joseph Laurino</dc:creator>
      <guid isPermaLink="false">6926@/devforum/discussions</guid>
      <description><![CDATA[In doing some modifications to the CUDA Video Decoder D3D9 sample in the GPU Computing SDK, we could not find a way to use any of the available api to seek to a specific time within a video. <br /><br />We are wondering if seeking within a movie is possible.<br /><br />Thank you,<br />-Joseph<br /><br />]]></description>
   </item>
      <item>
      <title>Is there an efficient CUDA sorting Algorithm?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6836/is-there-an-efficient-cuda-sorting-algorithm</link>
      <pubDate>Tue, 10 Apr 2012 00:15:59 -0400</pubDate>
      <dc:creator>addio3305</dc:creator>
      <guid isPermaLink="false">6836@/devforum/discussions</guid>
      <description><![CDATA[Hi everyone.<br /><br />I'd like to implement CUDA sorting algorithm. So I found the information and the example from the <br /><br />Internet, but I can't find the efficient algorithm because of some problem.<br /><br />First of all, most of example and the thesis use the power of 2 as its inputs, but I want to <br /><br />not the power of 2. For example, the merge sort or bitonic merge sort in CODE Samples, <br /><br />Second, I want to sort massive data set such as 10million. <br /><br />Is it possible to solve these problems? <br /><br />I want to find any reference or example for these problem.<br /><br />Thanks for your help. <br /><br /> ]]></description>
   </item>
      <item>
      <title>NVIDIA Parallel Insight 2.1 - debugging GTX 680 in VS2010  (Windows 7)</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6316/nvidia-parallel-insight-2-1-debugging-gtx-680-in-vs2010-windows-7</link>
      <pubDate>Sat, 24 Mar 2012 10:07:29 -0400</pubDate>
      <dc:creator>tvandervlies</dc:creator>
      <guid isPermaLink="false">6316@/devforum/discussions</guid>
      <description><![CDATA[I get the following warning when I start CUDA debugging:<br /><br /><strong>Parallel Nsight Debug<br />A CUDA context was created on a GPU that is not currently debuggable. breakpoints will be disabled.<br /><br />Adapter: Geforce GTX 680</strong><br /><br />When I change the CUDA context to the second controller a GTS 250 and connects the monitor to the GTX 680 debugging works normal. Why is GTX 680 not debuggable?  <br /><br />]]></description>
   </item>
      <item>
      <title>CU_DEVICE_ATTRIBUTE_CLOCK_RATE and GTX 680 (Kepler)</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6326/cu_device_attribute_clock_rate-and-gtx-680-kepler</link>
      <pubDate>Sat, 24 Mar 2012 13:35:36 -0400</pubDate>
      <dc:creator>red-ray</dc:creator>
      <guid isPermaLink="false">6326@/devforum/discussions</guid>
      <description><![CDATA[When I call cuDeviceGetAttribute() and spicify CU_DEVICE_ATTRIBUTE_CLOCK_RATE for a GTX 680 (Kepler) the speed returned is 705MHz rather than 1006MHz. How can I get the correct speed using CUDA? I could use NVAPI, but would prefer not to do this.<br /><br />I am using the 4.1 CUDA SDK and 302.10 drivers.<br /><br />I have just checked CUDA 4.2 and there are no additions to typedef enum CUdevice_attribute_enum at all. I was expecting at least the Boost Clock.]]></description>
   </item>
      <item>
      <title>Crash debug symbols?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6176/crash-debug-symbols</link>
      <pubDate>Wed, 21 Mar 2012 22:50:35 -0400</pubDate>
      <dc:creator>dandrumea</dc:creator>
      <guid isPermaLink="false">6176@/devforum/discussions</guid>
      <description><![CDATA[Hello, thanks for looking to help out.<br /><br />Picked category 'Mobile' as it seems more likely for a cause.. <br /><br />Problem: I crash on CUDA/OpenGL interop on an Optimus mobile machine (intel 3000 + GT540M) with Win7SP1 inside cudaGraphicsGLRegisterBuffer. <br /><br />Report: Using CUDASDK sample simpleGL for reporting (stack below - if anyone could point me to NVIDIA debug symbols that could help?). Currently on CUDA 4.1 with driver 286.16 (see stack), but it always happens(ed), on 4.0 and earlier, with drivers like 285.62, and earlier. Here is the stack of the crash in cudaGraphicsGLRegisterBuffer:<br /><br /> 	KernelBase.dll!_RaiseException@16()  + 0x58 bytes	<br /> 	cudart32_41_28.dll!100387f7() 	<br /> 	[Frames below may be incorrect and/or missing, no symbols loaded for cudart32_41_28.dll]	<br /> 	cudart32_41_28.dll!10011d27() 	<br /> 	cudart32_41_28.dll!10008d45() 	<br /> 	cudart32_41_28.dll!1002ff2f() 	<br /> 	gdi32.dll!7614e8d9() 	<br /> 	ig4icd32.dll!025daf62() 	<br /> 	ig4icd32.dll!025be2ae() 	<br /> 	ig4icd32.dll!0259a511() 	<br /> 	ig4icd32.dll!025c192c() 	<br />&gt;	simpleGL.exe!mainCRTStartup()  Line 189	C<br /> 	kernel32.dll!75f9339a() 	<br /> 	ntdll.dll!77a89ef2() 	<br /> 	ntdll.dll!77a89ec5() 	<br /><br />Other Info: Optimus switching to NVIDIA graphics always fails obviously when "run with graphics processor", etc. is invoked for applications.<br /><br />Any suggestions are most welcome, thank you!<br /><br />Edit: March 22, 2012 3:43pm est - changed question category from 'mobile' to 'gpu computing']]></description>
   </item>
      <item>
      <title>[volumeRender] Why are unequally sized volumes rendered as cubes (i.e., scaled)?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4791/volumerender-why-are-unequally-sized-volumes-rendered-as-cubes-i-e-scaled</link>
      <pubDate>Thu, 16 Feb 2012 09:12:51 -0500</pubDate>
      <dc:creator>ivma</dc:creator>
      <guid isPermaLink="false">4791@/devforum/discussions</guid>
      <description><![CDATA[Hi!<br />I am trying out the volume renderer from the NVIDIA GPU Computing SDK 4.1/4.0 and I was wondering why it renders the Bucky.raw volume (256x256x256) accordingly but it scales unequally sized volumes such as for example the lobster (120x120x34)[1].<br /><br />Here are some resulting images (from bottom 120x120 and from the side where the volume resolution in Z direction is only 34 voxels):<br /><img src="http://img593.imageshack.us/img593/6965/lobsterbottom.png" alt="Bottom view" /><br /><img src="http://img853.imageshack.us/img853/2559/lobsterside.png" alt="Side view" /><br /><br />Does anybody have a clue why that is and possibly how to fix it?<br /><br />PS: I have tried it on a couple of other data sets as well but with the same effect.<br /><br />Greetings,<br />ivma<br /><br />[1] <a href="http://www.cg.tuwien.ac.at/courses/Visualisierung/data/lobster.zip">lobster.zip</a>]]></description>
   </item>
      <item>
      <title>Typecasting to custom struct in kernel</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5161/typecasting-to-custom-struct-in-kernel</link>
      <pubDate>Sun, 26 Feb 2012 17:17:30 -0500</pubDate>
      <dc:creator>avion85</dc:creator>
      <guid isPermaLink="false">5161@/devforum/discussions</guid>
      <description><![CDATA[Since this is my first post, greetings to everyone.<br /><br />I've encountered a problem regarding casting to a custom struct in a kernel, hopefully someone else was in the same situation.<br /><br />I'm passing as a parameter into a kernel from a .cu file a large array which I would like to cast into a struct and access as an array of structures.<br /><br />pseudo-code:<br /><br />kernels.cu (with nvcc)<br /><code><br />struct myMatrix<br />{<br />	float e[6];<br />};<br />__global__ myKernel(float *raw, myMatrix *p){<br /> myID = int me_idx = blockIdx.x * blockDim.x + threadIdx.x;<br /><br /> myMatrix m = p[myID];	  //does not work - "???" in nsight for all values <br /><br /> myMatrix n =((myMatrix *)raw)[myID];     //does not work also - "???"<br /><br /> float a = raw[0];    //works and I get correct single float values, but unstructured<br /><br /> float 4 b = ((float4*)raw)[0];  //works and I get correct tuples<br /><br />//what I want:<br />Matrix m = p[myID];<br />float something = m.e[3];<br />}<br /></code><br /><br /><br />main.cu (with microsoft c compiler)<br /><code><br />float *p = [large array];<br />myKernel&lt;&lt;&lt;block,thread&gt;&gt;&gt;(p,(myMatrix*)p);<br /></code><br /><br />I am using Parallel Nsight to inspect the values and what I get is "???" while stepping through the program. I have never had problems if I use the built-in types like float4. However,  I would of course, like to have my own structures working properly.<br />Maybe the problem is in the alignment? If so, to which value to I align? <br /><br />Appreciate the help.<br />Avion<br /><br />PS.Working with Visual Studio, everything is 64bit.<br /><br />EDIT: added another example that works.]]></description>
   </item>
      <item>
      <title>Display driver vs Developer driver  Version</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4496/display-driver-vs-developer-driver-version</link>
      <pubDate>Thu, 09 Feb 2012 11:03:13 -0500</pubDate>
      <dc:creator>4fermi</dc:creator>
      <guid isPermaLink="false">4496@/devforum/discussions</guid>
      <description><![CDATA[<br />The latest "Display Driver" from <a href="http://www.nvidia.com/object/linux-display-amd64-290.10-driver.html">the products page</a> is <strong>ver 290.10</strong>.  But the latest "Developer Driver" from <a href="http://www.developer.nvidia.com/cuda-toolkit-41#s=bcb">the CUDA Developer Toolkit 4.1 download page</a> is <strong>ver 285.05.33</strong>.<br /><br />Question:  why must CUDA developers use an older version of the driver?  Afterall the cuda product made by the developer will be used by people with the newer driver!<br />]]></description>
   </item>
      <item>
      <title>Interactive updating the texture volume</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4141/interactive-updating-the-texture-volume</link>
      <pubDate>Mon, 30 Jan 2012 10:44:28 -0500</pubDate>
      <dc:creator>sangminpk</dc:creator>
      <guid isPermaLink="false">4141@/devforum/discussions</guid>
      <description><![CDATA[Hi, <br />I am trying to update the texture (cudaArray) with CUDA while rendering it with ray-casting.<br />Since the device-to-device copy takes several seconds in my Quadro 4000, <br />the ray-casting is very hard to be interactive if there are many copy events. <br />Could you give me any suggestion to update the texture (cudaArray) directly? or any better way?<br /><br />Here are my CUDA codes:<br /><br />// 1. Variable declarations<br />cudaArray *d_cudaArray = 0;<br />texture tex_volume<br />unsigned char	*d_Edited_Volume_uc;<br /><br />// 2. copy data to 3D array (Device to Device) <br />// There are multiple copies during the ray-casting in my implementation<br />cudaMemcpy3DParms copyParams = {0};<br />cudaExtent volumeSize = make_cudaExtent(width, height, depth);<br />copyParams.srcPtr   = make_cudaPitchedPtr(d_Edited_Volume_uc, width*sizeof(unsigned char), width, height);<br />copyParams.dstArray = d_cudaArray;<br />copyParams.extent   = volumeSize;<br />copyParams.kind     = cudaMemcpyDeviceToDevice;<br />cutilSafeCall( cudaMemcpy3D(&amp;copyParams) );  // --&gt; Takes 3-4 seconds (The main bottle neck)<br /><br />// 3. 3D texture binding<br />cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc();<br />cutilSafeCall(cudaBindTextureToArray(tex_volume, d_cudaArray, channelDesc));   <br /><br />// 4. Texture fetching at the ray-casting algorithm<br />float sample = tex3D(tex_volume, texCoord.x, texCoord.y, texCoord.z);<br /><br />After the volume (d_Edited_Volume_uc, width*height*depth = 256^3) is updated with CUDA, it is copied to "d_cudaArray" for the texture binding. That takes most of the time and it is the main bottle neck for the interactive ray-casting. Without the copying the volume, it shows around 10 FPS.<br />I am wondering if I can update the texture memory directly with CUDA during the ray-casting.<br />Or, if is there any better way to make the rendering interactive with multiple texture copies. <br /><br />Thank you in advance<br />]]></description>
   </item>
      <item>
      <title>Is there a way to access an ID3D11Texture3D in CUDA (read/write)?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4221/is-there-a-way-to-access-an-id3d11texture3d-in-cuda-readwrite</link>
      <pubDate>Wed, 01 Feb 2012 09:02:28 -0500</pubDate>
      <dc:creator>SoulWiz</dc:creator>
      <guid isPermaLink="false">4221@/devforum/discussions</guid>
      <description><![CDATA[I have an ID3D11Texture3D with the following descriptor:<br /><code><br />D3D11_TEXTURE3D_DESC td;<br />ZeroMemory(&amp;td, sizeof(td));<br />td.Width = uiWidth;<br />td.Height = uiHeight;<br />td.Depth = uiDepth;<br />td.MipLevels = 1;<br />td.Format = DXGI_FORMAT_R32_FLOAT;<br />td.BindFlags = D3D11_BIND_RENDER_TARGET | D3D11_BIND_SHADER_RESOURCE;<br /></code><br /><br />I want to read and write to this texture using CUDA. Is this possible somehow?<br />I have tried the following:<br />cudaGraphicsD3D11RegisterResource &gt;&gt; cudaGraphicsMapResources &gt;&gt; cudaGraphicsSubResourceGetMappedArray<br />to READ: cudaBindTextureToArray &gt;&gt; tex3D<br />to WRITE: cudaMemcpy3D (from linear memory allocated with cudaMalloc3D)<br />but the memory copy failes with cudaErrorInvalidValue:<br /><code><br />cudaMemcpy3DParms oMemcpy3DParms;<br />memset(&amp;oMemcpy3DParms, 0, sizeof(cudaMemcpy3DParms));<br />oMemcpy3DParms.srcPtr = oPitchedPtr;<br />oMemcpy3DParms.srcPos = make_cudaPos(0, 0, 0);<br />oMemcpy3DParms.dstArray = pArray;<br />oMemcpy3DParms.dstPos = make_cudaPos(0, 0, 0);<br />oMemcpy3DParms.extent = oExtent;<br />oMemcpy3DParms.kind = cudaMemcpyDeviceToDevice;<br />cudaError oCudaError = cudaMemcpy3D(&amp;oMemcpy3DParms);<br /></code><br /><br />Any ideas? ... or does it even work?]]></description>
   </item>
      <item>
      <title>GPU computing in a virtual environment</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1781/gpu-computing-in-a-virtual-environment</link>
      <pubDate>Wed, 23 Nov 2011 15:44:52 -0500</pubDate>
      <dc:creator>bwatson</dc:creator>
      <guid isPermaLink="false">1781@/devforum/discussions</guid>
      <description><![CDATA[Assume I have server-grade hardware running VMWare ESX and hosting 1 or more virtual machines.  If I were to add NVIDIA graphics to the server, would it be possible for a program running inside one of the virtual machines to utilize the GPU for calculations?]]></description>
   </item>
      <item>
      <title>Comparing CPU with GPU code execution time</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4116/comparing-cpu-with-gpu-code-execution-time</link>
      <pubDate>Sun, 29 Jan 2012 15:54:59 -0500</pubDate>
      <dc:creator>jseba</dc:creator>
      <guid isPermaLink="false">4116@/devforum/discussions</guid>
      <description><![CDATA[Hi.<br />I am new with CUDA programming and I'm trying to do a basic comparing of execution time between CPU and GPU code.<br /><br />I'am using de vectorAdd SDK example modified for testing:<br /><br /><code><br /><br />// Device code<br />__global__ void deviceVecAdd(const float* A, const float* B, float* C, int N)<br />{<br />  int i = blockDim.x * blockIdx.x + threadIdx.x;<br />  if (i &lt; N)<br />    C[i] = A[i] + B[i];<br />}<br /><br />// Host code<br /><br />void hostVecAdd(const float * A, const float * B, float * C, int N) {<br />  for(int i = 0; i &lt; N; i++)<br />    C[i] = A[i] + B[i];<br />}<br /><br />int main(int argc, char** argv)<br />{<br /><br />  bool useCuda = argc &gt;=2 &amp;&amp; strncmp(argv[1], "-c", 2) == 0;<br /><br />  std::cout &lt;&lt; "Vector addition" &lt;&lt; std::endl;<br />  if(useCuda)<br />    std::cout &lt;&lt; "Using CUDA" &lt;&lt; std::endl;<br /><br />  int N = 50000;<br />  size_t size = N * sizeof(float);<br /><br />  Timer timer;<br />  timer.start();<br /><br />  // Allocate input vectors h_A and h_B in host memory<br />  h_A = (float*)malloc(size);<br />  if (h_A == 0) cleanupResources();<br />  h_B = (float*)malloc(size);<br />  if (h_B == 0) cleanupResources();<br />  h_C = (float*)malloc(size);<br />  if (h_C == 0) cleanupResources();<br /><br />  // Initialize input vectors<br />  randomInit(h_A, N);<br />  randomInit(h_B, N);<br /><br />  if(useCuda) {<br /><br />    // Allocate vectors in device memory<br />    cudaMalloc((void**)&amp;d_A, size);<br />    cudaMalloc((void**)&amp;d_B, size);<br />    cudaMalloc((void**)&amp;d_C, size);<br /><br />    // Copy vectors from host memory to device memory<br />    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);<br />    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);<br /><br />    // Invoke kernel<br />    int threadsPerBlock = 256;<br />    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;<br />    deviceVecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);<br /><br />    // Copy result from device memory to host memory<br />    // h_C contains the result in host memory<br />    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);<br />  } else {<br />    timer.start();<br />    hostVecAdd(h_A, h_B, h_C, N);<br />  }<br /><br />  timer.stop();<br /><br />  // Verify result<br />  int i;<br />  for (i = 0; i &lt; N; ++i) {<br />    float sum = h_A[i] + h_B[i];<br />    if (fabs(h_C[i] - sum) &gt; 1e-5) {<br />      std::cout &lt;&lt; "Fail" &lt;&lt; std::endl;<br />      break;<br />    }<br />  }<br /><br />  std::cout &lt;&lt; timer.time() &lt;&lt; " ms" &lt;&lt; std::endl;<br /><br />  cleanupResources();<br />}<br /><br /></code><br /><br />Class Timer uses clock_gettime() function with CLOCK_REALTIME. <br />With this test CPU code runs an order of magnitude faster than GPU code. Someone can tell me where is my mistake?<br /><br />Thanks]]></description>
   </item>
      <item>
      <title>Dynamic memory allocation in 2.x CUDA devices</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4321/dynamic-memory-allocation-in-2-x-cuda-devices</link>
      <pubDate>Sun, 05 Feb 2012 13:34:55 -0500</pubDate>
      <dc:creator>IndrajeetK</dc:creator>
      <guid isPermaLink="false">4321@/devforum/discussions</guid>
      <description><![CDATA[  C:\Users\DELL\Desktop\template(CUDA)&gt;"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2010 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin" -I"./" -I"../../common/inc" -I"../../../shared/inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\include"  -G0  --keep-dir "Debug" -maxrregcount=0  --machine 32 --compile  -g    -Xcompiler "/EHsc /nologo /Od /Zi  /MTd " -o "Win32/Debug/template.cu.obj" "C:\Users\DELL\Desktop\template(CUDA)\template.cu" <br />1&gt;  template.cu<br />1&gt;C:/Users/DELL/Desktop/template(CUDA)/template.cu(6): error : calling a host function("operator new ") from a __device__/__global__ function("mallocTest") is not allowed<br />1&gt;  <br />1&gt;C:/Users/DELL/Desktop/template(CUDA)/template.cu(7): error : calling a host function("free") from a __device__/__global__ function("mallocTest") is not allowed<br /><br /><br /><br />I am using a Nvidia geforce 525 with nvcc 4.1<br /><br />The code is pretty much same as in http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf<br /><br />page number 123,124<br />Please help !!!<br />Thanks]]></description>
   </item>
      <item>
      <title>On streams and asynchronous execution</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3496/on-streams-and-asynchronous-execution</link>
      <pubDate>Mon, 16 Jan 2012 03:32:28 -0500</pubDate>
      <dc:creator>lucana</dc:creator>
      <guid isPermaLink="false">3496@/devforum/discussions</guid>
      <description><![CDATA[This question is just to make sure I'm understading well how CUDA streams work. <br /><br />Imagine I have a for loop like this. I am using only one stream.<br /><br />for (i=0; i &lt; N; i++)<br />{<br />	 run operations on CPU <br />	 copy results of CPU operations to CUDA kernel with cudaMemcpyAsync<br />	 call kernel &lt;&lt;&lt;  , &gt;&gt;&gt;<br />}<br /><br />My understanding is that the kernel for i and the CPU operations for i+1 at the begining of the loop will execute concurrently, but the kernel won't start for i+1 until the CPU has finished computing results for i+1.<br /><br />Is this right? Or will the operations on CPU and GPU never overlap? Will the kernel start before have the proper results computed from the CPU? Is it necessary to put some control flags to make sure the operations on the CPU have finished before the kernel starts?<br /><br />This diagram shows what I want to do. In fact it is a pipeline, but I'm still unsure if it is possible with CUDA. <br /><br />----------i = 0 -------------------- i = 1 --------------------------- i = 2<br />(t0) compute results on CPU<br />(t1) copy results to CUDA kernel -- compute results on CPU<br />(t2) execute kernel --------------- copy results to CUDA kernel -- compute results on CPU<br />(t3) ------------------------------ execute kernel --------------- copy results to CUDA kernel <br />(t4)-------------------------------------------------------------- execute kernel<br /><br />Finally, I would like to ask if it makes sense to use CUDA streams when there is data dependacy between streams, with a pipeline like the one showed before. ]]></description>
   </item>
      <item>
      <title>Is there a way to access an ID3D11Texture2D with 8 samples in CUDA (read/write)?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4201/is-there-a-way-to-access-an-id3d11texture2d-with-8-samples-in-cuda-readwrite</link>
      <pubDate>Wed, 01 Feb 2012 07:24:34 -0500</pubDate>
      <dc:creator>SoulWiz</dc:creator>
      <guid isPermaLink="false">4201@/devforum/discussions</guid>
      <description><![CDATA[I have an ID3D11Texture2D with the following descriptor:<br /><code><br />D3D11_TEXTURE2D_DESC td;<br />ZeroMemory(&amp;td, sizeof(td));<br />td.Width = m_uiWidth;<br />td.Height = m_uiHeight;<br />td.MipLevels = 1;<br />td.ArraySize = 1;<br />td.Format = DXGI_FORMAT_R32_FLOAT;<br />td.SampleDesc.Count = 8;<br />td.SampleDesc.Quality = 0;<br />td.BindFlags = D3D11_BIND_RENDER_TARGET;</code><br /><br />I want to read and write to this texture (all 8 samples) using CUDA. Is this possible somehow?<br />I have tried the following:<br />cudaGraphicsD3D11RegisterResource &gt;&gt; cudaGraphicsMapResources &gt;&gt; cudaGraphicsSubResourceGetMappedArray<br />to READ: cudaBindTextureToArray &gt;&gt; tex2DLayered<br />to WRITE: cudaMemcpy3D (from linear memory allocated with cudaMalloc3D)<br />but it looks like I cannot access all 8 samples this way.<br /><br />I also tried to have direct read/write access using a surface reference:<br />cudaGraphicsD3D11RegisterResource &gt;&gt; cudaGraphicsMapResources &gt;&gt; cudaGraphicsSubResourceGetMappedArray &gt;&gt; cudaBindSurfaceToArray<br />to READ: surf2Dread<br />to WRITE: surf2Dwrite<br /><br />Any ideas? ... or does it even work?]]></description>
   </item>
      <item>
      <title>Please update the openSUSE packages.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4086/please-update-the-opensuse-packages-</link>
      <pubDate>Sat, 28 Jan 2012 09:51:42 -0500</pubDate>
      <dc:creator>Deanjo</dc:creator>
      <guid isPermaLink="false">4086@/devforum/discussions</guid>
      <description><![CDATA[Can you guys please update the openSUSE packages? openSUSE 11.2's support was discontinued May 12th 2011 and 11.3's support was discontinued January 20th 2012.  12.1 is the current release and all we are asking is for the package to be updated and a bit of equality in support here. Just to give a bit of perspective here, the openSUSE versions support was discontinued around the same time latest Cuda supported version of Ubuntu was released.]]></description>
   </item>
      <item>
      <title>opencl application is portable cuda for intel and amd?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2076/opencl-application-is-portable-cuda-for-intel-and-amd</link>
      <pubDate>Fri, 02 Dec 2011 13:18:08 -0500</pubDate>
      <dc:creator>silviocassiano</dc:creator>
      <guid isPermaLink="false">2076@/devforum/discussions</guid>
      <description><![CDATA[Good afternoon, is the first time I write, I am newbie in this world, someone from the forum would respond if it is possible to run an application  OpenCL developed using the Intel SDKon architecture Nvidia, RADEON and Intel or application developed using the nvidia sdk architectures run on Intel, AMD and RADEON.<br /><br />my e-mail is silvio.cassiano@hotmail.com <br /><br />Thank you.<br /><br />]]></description>
   </item>
      <item>
      <title>GPU Accelerated 2D to Stereo 3D Video Conversion</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3906/gpu-accelerated-2d-to-stereo-3d-video-conversion</link>
      <pubDate>Tue, 24 Jan 2012 16:43:20 -0500</pubDate>
      <dc:creator>DryRiver</dc:creator>
      <guid isPermaLink="false">3906@/devforum/discussions</guid>
      <description><![CDATA[Hello All,<br /><br />I have written a pretty good 2D-to-3D video conversion algorithm in C# NET. (Took a little over 2 years of experimenting to get it right)<br /><br />I now want to GPU accelerate this 2D-to-3D conversion algorithm. I am hoping for a 10x - 20x times speedup using the GPU to do the pixel crunching, instead of the CPU. <br /><br />My requirements are:<br /><br />- The GPU code needs to execute inside a C# .NET Windows Forms Applicaton<br /><br />- I want to use the easiest/beginner friendliest GPU coding method possible<br /><br />Where should I start with this? CUDA.NET? OpenCL.NET? Brahma (for C#)?<br /><br />Are there any beginners tutorials for using CUDA/OpenCL inside C# NET?<br /><br />Are there, specifically, any Image Processing tutorials/examples for CUDA/OpenCL?<br /><br />Thank you for any feedback. I am a complete CUDA/OpenCL Noob and am hoping for expert advice on making my first GPU accelerated project happen.<br /><br />Best Regards,<br /><br />                  DryRiver<br /><br /><br /><br /><br /><br /><br />]]></description>
   </item>
      <item>
      <title>device function pointers</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3921/device-function-pointers</link>
      <pubDate>Wed, 25 Jan 2012 04:12:51 -0500</pubDate>
      <dc:creator>micheletuttafesta</dc:creator>
      <guid isPermaLink="false">3921@/devforum/discussions</guid>
      <description><![CDATA[Dear Sirs,<br />I need a device version of the following<br />host code:<br /><br />double (**func)(double x);<br /><br />double func1(double x)<br />{<br /> return x+1.;<br />}<br /><br />double func2(double x)<br />{<br /> return x+2.;<br />}<br /><br />double func3(double x)<br />{<br /> return x+3.;<br />}<br /><br />void test(void)<br />{<br /> double x;<br /><br /> for(int i=0;i&lt;3;++i){<br />  x=func[i](2.0);<br />  printf("%g\n",x);<br /> }<br /><br />}<br /><br />int main(void)<br />{<br /> func=(double (**)(double))malloc(10*sizeof(double (*)(double)));<br /><br /> test();<br /><br /> return 0;<br />}<br /><br /><br />where func1, func2, func3<br />have to be __device__ functions<br />and "test"<br />has to be a (suitably modified) __global__ kernel.<br /><br />I have a NVIDIA GeForce GTS 450 (compute capability 2.1)<br />Thank you in advance<br />Michele<br /><br />]]></description>
   </item>
      <item>
      <title>NSIGHT doesn&#039;t let me choose threads with id greater than 15.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3696/nsight-doesnt-let-me-choose-threads-with-id-greater-than-15-</link>
      <pubDate>Thu, 19 Jan 2012 10:41:02 -0500</pubDate>
      <dc:creator>lucana</dc:creator>
      <guid isPermaLink="false">3696@/devforum/discussions</guid>
      <description><![CDATA[I have managed to stop CUDA debugging at breakpoints. I'm working with VS2010. I can use the Debug Focus to select threads and blocks to follow. But I can't select any of the threads/blocks defined. The dimensions of grid and block written there are wrong. For example, I launched 1024 (kernel&lt;&lt;&lt;1, 1024&gt;&gt;&gt;)threads, but it only lets me choose up to thread number 15. Is it normal? I'm I doing something wrong? ]]></description>
   </item>
      <item>
      <title>linker errors while executing opencl sample codes</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3421/linker-errors-while-executing-opencl-sample-codes</link>
      <pubDate>Fri, 13 Jan 2012 06:10:44 -0500</pubDate>
      <dc:creator>Prasanna</dc:creator>
      <guid isPermaLink="false">3421@/devforum/discussions</guid>
      <description><![CDATA[Hi<br />I am new in executing opencl codes...I have downloaded the GPU Computing SDK and drivers and executing opencl samples from that...I have included all the .lib files which are there in opencl in SDK...While executing i got the following errors in visual studio 2010<br /><br />1&gt;------ Build started: Project: testopencl, Configuration: Debug Win32 ------<br />1&gt; Skipping... (no relevant changes detected)<br />1&gt; testopencl.cpp<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrComparefet referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clEnqueueReadBuffer@36 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clEnqueueNDRangeKernel@36 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clEnqueueWriteBuffer@36 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clSetKernelArg@16 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateKernel@12 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clBuildProgram@24 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateProgramWithSource@20 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _oclLoadProgSource referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrFindFilePath referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateBuffer@24 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateCommandQueue@20 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clCreateContext@24 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clGetDeviceIDs@24 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clGetPlatformIDs@12 referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrFillArray referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrRoundUp referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrLog referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrSetLogFileName referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _shrCheckCmdLineFlag referenced in function _main<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseMemObject@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseContext@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseCommandQueue@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseProgram@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;testopencl.obj : error LNK2019: unresolved external symbol _clReleaseKernel@4 referenced in function "void __cdecl Cleanup(int,char * *,int)" (?Cleanup@@YAXHPAPADH@Z)<br />1&gt;C:\Users\Acer\Documents\Visual Studio 2010\Projects\testopencl\Debug\testopencl. exe : fatal error LNK1120: 25 unresolved externals<br />========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========<br /><br /><br />I am using Windows7 os 64-bit with nvidia graphic card...It will be great helpful if anyone reply the solution for this problem.<br />Thank You... ]]></description>
   </item>
      <item>
      <title>Many CPUs in GPU?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3596/many-cpus-in-gpu</link>
      <pubDate>Wed, 18 Jan 2012 01:56:59 -0500</pubDate>
      <dc:creator>Alfian Akbar Gozali</dc:creator>
      <guid isPermaLink="false">3596@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br /><br />Is it true that NVIDIA GPU has many CPUs which can compute many task concurrently?<br />If it is true, what is the CUDA function to get the number of CPUs in a GPU?<br /><br />*)FYI: I want to build Island Model Genetic Algorithm with My GeForce 220. I want to evaluate the GA fitness at each of CPUs in my GPU.<br /><br />Thank you before..]]></description>
   </item>
      <item>
      <title>Nvda.Build.CudaTasks.SanitizePaths compile error</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3336/nvda-build-cudatasks-sanitizepaths-compile-error</link>
      <pubDate>Wed, 11 Jan 2012 21:39:02 -0500</pubDate>
      <dc:creator>jm99</dc:creator>
      <guid isPermaLink="false">3336@/devforum/discussions</guid>
      <description><![CDATA[I receive the error Nvda.Build.CudaTasks.SanitizePaths trying to compile a program in VS2010 with SDK 4.0.<br /><br />The complete error is:<br /><br /><br />Error	1	error MSB4062: The "Nvda.Build.CudaTasks.SanitizePaths" task could not be loaded from the assembly C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\BuildCustomizations\Nvda.Build.CudaTasks.v4.0.dll. Could not load file or assembly 'file:///C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\BuildCustomizations\Nvda.Build.CudaTasks.v4.0.dll' or one of its dependencies. The system cannot find the file specified. Confirm that the  declaration is correct, that the assembly and all its dependencies are available, and that the task contains a public class that implements Microsoft.Build.Framework.ITask<br /><br />The dll referred to does exist in the specified path.  What could be the issue?  Thanks.]]></description>
   </item>
      <item>
      <title>NSight Monitor fails to start</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2146/nsight-monitor-fails-to-start</link>
      <pubDate>Mon, 05 Dec 2011 10:20:35 -0500</pubDate>
      <dc:creator>kemperbenny</dc:creator>
      <guid isPermaLink="false">2146@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I've installed NSight 2.1 RC1 with the correct (as posted in the release, 285.67) drivers.<br /><br />I'm running Windows 7 64bit and have two cards, ATI (connected to display) and GTS 450 (Which I've seen that it's architecture GF106 is supported).<br /><br />Every time I start the monitor (with "Run as administrator"), I see it for a short while (one to three seconds) in the task manager, then it disappears. I've looked with procmon, and the only thing fishy was some "buffer overflow" messages when reading some certificates (don't know if that is the problem, because it looks like it continues to run a bit afterwards).<br /><br />I'm attaching the procmon trace if it helps.<br /><br />Please help, I'm at a dead end.<br /><br />Thanks.]]></description>
   </item>
      <item>
      <title>How to compile cu files to ptx with MS 64 bit compiler WITHOUT VS Professional installed?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3076/how-to-compile-cu-files-to-ptx-with-ms-64-bit-compiler-without-vs-professional-installed</link>
      <pubDate>Thu, 05 Jan 2012 09:23:50 -0500</pubDate>
      <dc:creator>szerbst</dc:creator>
      <guid isPermaLink="false">3076@/devforum/discussions</guid>
      <description><![CDATA[Hi<br /><br />this is pretty annoying. We are working with the Microsoft 64 bit compiler from the Windows SDK on Windows 7. While I'm digging my nose into Optix I've now reached a point where I want to write my own CUDA programs to feed Optix instead of abusing the ptx files from the Optix samples. <br /><br />However, I only seem to be able to compile the cu files with nvcc for machine 32 target. Using the m64 target it keeps telling me:<br /><code>Visual Studio configuration file '(null)' could not be found for installation at blablabla</code><br /><br />Googling this brought up some threads across the internet seem to indicate that CUDA indeed relies on a Visual Studio Professional being installed. Well ... I don't have it since we are using Eclipse as IDE.<br /><br />It seems to work with machine32 because I do have Visual Sutdio Express installed for debugging purposes.<br /><br />So what am I missing or what is the *official* way to compile a CUDA program for a 64 bit Windows machine if there is no Visual Studio professional installed.<br /><br />Thanks]]></description>
   </item>
      <item>
      <title>Draw and Flood-fill with Cuda</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2196/draw-and-flood-fill-with-cuda</link>
      <pubDate>Tue, 06 Dec 2011 04:29:56 -0500</pubDate>
      <dc:creator>elect</dc:creator>
      <guid isPermaLink="false">2196@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br /><br />I need to implement a flood-fill function in Cuda<br /><br />I have an array of triangles and I calculate their projection on the plane xy (z=0), now I need to draw them in a -boolean- matrix (the xy plane) and fill them<br /><br />For example, the projection of one triangle may appear as follow:<br /><br />00010000<br />00011000<br />00011100<br />00000000<br /><br />The matrix is 800x800 or bigger, the array of the projected triangles is an array of floats<br /><br />A triangle has three points, p1, p2 and p3. Then in my vector is something like<br /><br />p1x,p1y,p1z,p2x,p2y,p2z,p3x,p3y,p3z<br /><br />Obviously the p_z dont count in this context<br /><br />I would like to know opinions/suggest/tips from someone more expert than me :) (and it's not that difficult ^^)<br /><br />For the moment I thought to two main ways:<br /><br />- Every thread start drawing its own triangle starting from one p1 and incrementing x and y (with a special calculated ratio). Then it fill it.<br /><br />- Assign the calculation of each triangle to a full block. First I calculate all the intersections between the grid (i.e: x=0, ..., x=n and y=0, ..., y=n) and the triangle sides. Then I start to wave my boolean matrix (vertically, horizontally or diagonally) and set to true the little squares (1x1) that have a intersection on one of their side.<br /><br /><br /><br />What do you think?]]></description>
   </item>
      <item>
      <title>sample code running</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2021/sample-code-running</link>
      <pubDate>Thu, 01 Dec 2011 01:07:19 -0500</pubDate>
      <dc:creator>alphato</dc:creator>
      <guid isPermaLink="false">2021@/devforum/discussions</guid>
      <description><![CDATA[It may be a stupid question.<br />Plz, understand! I'm a start-up.<br />Question : If I run the sample code in the SDK browser. It run normally. But if I click exe files in the folder directly, CUDA driver is insufficinet ...... message appear in the command promt window. why? exe file is not same?]]></description>
   </item>
      <item>
      <title>Arbitrary number of textures?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1446/arbitrary-number-of-textures</link>
      <pubDate>Fri, 11 Nov 2011 08:44:32 -0500</pubDate>
      <dc:creator>sirpalee</dc:creator>
      <guid isPermaLink="false">1446@/devforum/discussions</guid>
      <description><![CDATA[Hello Everyone!<br /><br />I need to use a bunch of textures in my kernel, and I don't know the exact number of textures in advance. Until now I used simple arrays and did the filtering myself, but It would more efficient to use the hardware texturing. <br /><br />So looking at the sample projects, each of them is defining a texture reference in the kernel code, and using that later. Can I somehow create those references in the host code, and pass them somehow to the kernel? (for example an array)<br /><br />According to the ptx ISA reference (3.0), I need to use 3 parameters (destination, texture reference, coordinates), is there any way to specify the hardware to sample the first, second etc texture? And assign my textures to those texture slots? (and later use array of textures)<br /><br />Or the best solution would be to build a texture atlas? (if possible I don't want to do that...)<br /><br />I'm using the driver api, the latest 4.1 RC sdk and I can freely add even inline ptx to my kernel.<br /><br />Cheers, Pal.]]></description>
   </item>
      <item>
      <title>New cuda version (4.1?)</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1216/new-cuda-version-4-1</link>
      <pubDate>Wed, 02 Nov 2011 08:18:27 -0400</pubDate>
      <dc:creator>kalman</dc:creator>
      <guid isPermaLink="false">1216@/devforum/discussions</guid>
      <description><![CDATA[Are we going to have a new Cuda version soon? Any hint about the new features?]]></description>
   </item>
      <item>
      <title>Is it fundamentaly possible  for a kernel to execute in 0.0000ms</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/921/is-it-fundamentaly-possible-for-a-kernel-to-execute-in-0-0000ms</link>
      <pubDate>Tue, 11 Oct 2011 12:27:37 -0400</pubDate>
      <dc:creator>S4N1</dc:creator>
      <guid isPermaLink="false">921@/devforum/discussions</guid>
      <description><![CDATA[Hi! im trying to take the time nescassery to execute my kernel and it says 0.0ms is that even possible ?<br /><br />cudaEvent_t start, stop;<br />cudaEventCreate(&amp;start);<br />cudaEventCreate(&amp;stop);<br /><strong>cudaEventRecord( start, 0 );</strong><em></em><br /><br />    add&lt;&lt;&lt;1000,1000,10&gt;&gt;&gt;(dev_v_a, dev_v_b, dev_v_c );<br /><br />cudaEventRecord(stop, 0);<br />cudaEventSynchronize(stop);<br />float   elapsedTime;<br />cudaEventElapsedTime( &amp;elapsedTime,<br />start, stop );<br /><br /> printf( "Time to generate:  %f ms\n", elapsedTime );<br /><br />-------------------------------------<br /><br />apologize there was a fault in the code---]]></description>
   </item>
      <item>
      <title>CUDA 4.0 with AS 5.6</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/991/cuda-4-0-with-as-5-6</link>
      <pubDate>Thu, 13 Oct 2011 21:12:34 -0400</pubDate>
      <dc:creator>jackson312</dc:creator>
      <guid isPermaLink="false">991@/devforum/discussions</guid>
      <description><![CDATA[I just installed CUDA 4.0 on a Dell R5500 with a C2050 card in it. I can see the card using lspci, but I am not able to query the card using the deviceQuery which is built in the SDK. <br /><br />I have used CUDA 4.0 successfully on a Dell Optiplex 740 with a 9800GT card on AS 5.5. Are there any issues with CUDA 4.0 and AS 5.6? <br /><br />I did not have any problems getting the development driver to build or load. I got the SDK to compile fine.<br /><br />Thanks,<br /><br />Jackson]]></description>
   </item>
      <item>
      <title>memory alloc</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/721/memory-alloc</link>
      <pubDate>Tue, 27 Sep 2011 06:07:14 -0400</pubDate>
      <dc:creator>irivahid</dc:creator>
      <guid isPermaLink="false">721@/devforum/discussions</guid>
      <description><![CDATA[Hi my name is Muhammad<br />I have questions about working with cuda technology<br />During a two-dimensional array to a global program to compile error Bdvnh<br />But getting the following error when running.<br />%a.out'died due to signal 11(invalid memory refrence).<br />warning:can not tell what pointer points to,assuming global memory space.<br />Please can you help me<br />Or you can submit it in order to resolve the issue]]></description>
   </item>
      </channel>
</rss>
