<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
      <title>Tagged with cuda - NVIDIA Developer Forums</title>
      <link>http://forums.developer.nvidia.com/devforum/discussions/tagged/cuda/feed.rss</link>
      <pubDate>Wed, 16 May 12 17:31:27 -0400</pubDate>
         <description>Tagged with cuda - NVIDIA Developer Forums</description>
   <language>en-CA</language>
   <atom:link href="/devforum/discussions/taggedcuda/feed.rss" rel="self" type="application/rss+xml" />
   <item>
      <title>Tesla vs GTX560M this is wierd!</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8311/tesla-vs-gtx560m-this-is-wierd</link>
      <pubDate>Wed, 16 May 2012 16:32:03 -0400</pubDate>
      <dc:creator>Bleiner</dc:creator>
      <guid isPermaLink="false">8311@/devforum/discussions</guid>
      <description><![CDATA[Hello Everyone,<br /><br />So I have been working on porting over a lot of the more computationally intensive portions of my code from matlab to .ptx gpu code.  I have been doing the development on my laptop where I have made HUGE increases in speed.  Achieving at least a 10X speed up in a lot of areas.  Now all my work is double precision, and involves millions of objects.  My laptop with a gaming GPU (that by design has hampered double precision performance, and fewer processing cores) can complete a task in about 46 seconds.  <br /><br />I have a desktop with a Tesla C2075, that typically outperforms my laptop by a factor of two.  When I bring this code over to the Tesla machine it is running anywhere from 42-48 seconds to complete the same task.  Does anyone have any idea why this would be?  <br /><br />The only thing that comes to mind on this is that I upgraded my laptop to the 301.40 driver version to use Nsight visual studio, while the Tesla machine is still on 301.32 (or someething like that.  When I attempted to upgrade the tesla machine it appears that the 301.40 drivers have been removed from the website.  When I upgraded to the Beta Cuda 5 version drivers 301.53 (i think) Matlab no longer recognizes that there is a GPU attached to the system.  <br /><br />Could the issues be the driver?  Did it improve that much from 301.27 to 301.4?  Does it have anything to do with the GPU on the laptop being compute level 2.1 while tesla is 2.0?  Is there a memory manager issue that 2.1 does a lot better?  <br /><br />The strange thing is that TESLA USED to perform at twice the speed of the laptop (and that includes all the overhead in matlab that is taking place on the CPU.  So Tesla must have been significantly more than 2x faster.  <br /><br />Any thoughts?<br /><br />Thanks<br />Ben<br /><br />As an after thought here are the specs of the machines in question.. <br />Tesla Work station<br />Dell Precision T5500<br />Xeon E5620 2.4ghz<br />12gb ram<br />Telsa C2075 with 6gb ram driver 301.32<br /><br />Laptop<br />Asus G74S<br />Core I7 2670QM 2.2ghz<br />12gb Ram<br />GTX 560M 3gb ram Driver 301.40<br />]]></description>
   </item>
      <item>
      <title>Solve linear system in OptiX / Dynamic memory</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8306/solve-linear-system-in-optix-dynamic-memory</link>
      <pubDate>Wed, 16 May 2012 15:28:28 -0400</pubDate>
      <dc:creator>xindon</dc:creator>
      <guid isPermaLink="false">8306@/devforum/discussions</guid>
      <description><![CDATA[Hey OptiX experts :-),<br /><br />in order to implement a moving least squares point set surface algorithm I need to solve a linear system <strong>for each ray</strong>. The matrix sizes are variable and can change from frame to frame.<br /><br />1) As OptiX does not allow dynamic memory allocation (free/malloc or new/delete), how/where should the data of the linear system be stored best? <br /><br />Each ray (or more precise: each instance of the intersection program) needs its own local memory for the linear system. I thought of allocating a huge global buffer, but I'm not sure how to do the indexing such that the different programs don't interfere.<br /><br />2) Do I have to implement the solver manually or are the better/faster/easier implementations already available? (which maybe also solve the above local memory problem).<br /><br />Thank you very much in advance!<br /><br />Regards<br />Tim<br /><br />]]></description>
   </item>
      <item>
      <title>downloading of the installer doesn&#039;t complete</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8266/downloading-of-the-installer-doesnt-complete</link>
      <pubDate>Wed, 16 May 2012 05:59:53 -0400</pubDate>
      <dc:creator>mathiapeter</dc:creator>
      <guid isPermaLink="false">8266@/devforum/discussions</guid>
      <description><![CDATA[Hi, I had Nsight Visual Studio RC2 installed and everything worked fine, but now whet I try do download the final version 64bit, it shows that the installer has about 950MB but the downloading ends always at 51% so I can't run the installer. What would be the problem? I have tried 3 internet browsers but still the same.]]></description>
   </item>
      <item>
      <title>Cuda, PTX and Debugging symbols</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6511/cuda-ptx-and-debugging-symbols</link>
      <pubDate>Wed, 28 Mar 2012 13:08:21 -0400</pubDate>
      <dc:creator>Bleiner</dc:creator>
      <guid isPermaLink="false">6511@/devforum/discussions</guid>
      <description><![CDATA[Hello everyone,<br /><br />I have a question about .ptx files and debugging in Visual studio 10 professional.  I am attempting to write some ptx code to integrate into Matlab.  However I am ONLY writing the  __global__ functions and I have not written a main function or any host code.  The functions are fairly simple, but what I want to do is be able to debug my code when matlab is running it.  I am compiling my code using nvcc -ptx 'functionname.cu'. and when I try to use nvcc -G -ptx 'functionname.cu' to get debugging information nothing else is returned except the .ptx file.  I should mention that I am relatively new to visual studio and that is why I am compiling using the command line.  <br /><br />I believe that if I had the symbols generated I could attach the process to Matlab and debug it.  Since .ptx is supposed to be just in time compiled code does it not allow me to have debugging symbols?  <br /><br />Any help would be greatly appreciated<br />Ben]]></description>
   </item>
      <item>
      <title>CUDA: clock cycles for division of one float value</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8281/cuda-clock-cycles-for-division-of-one-float-value</link>
      <pubDate>Wed, 16 May 2012 08:55:14 -0400</pubDate>
      <dc:creator>hahnerehm</dc:creator>
      <guid isPermaLink="false">8281@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br />you can find a table about how many arithmetic instructions per clock cycle a multiprozessor can perform in the CUDA C Programming Guide.<br />A division is performed by a couple of multiply-add instructions. But how many multiply-add instructions are needed to perform a division on 32 bit float value?<br /><br />Thanks in advance,<br />Johannes]]></description>
   </item>
      <item>
      <title>nvv start error code=13 (linux)</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8211/nvv-start-error-code13-linux</link>
      <pubDate>Tue, 15 May 2012 08:04:16 -0400</pubDate>
      <dc:creator>Da Ma</dc:creator>
      <guid isPermaLink="false">8211@/devforum/discussions</guid>
      <description><![CDATA[Dear all,<br /><br />i want to profile my cuda program. In the good old time i used computeprof without any problem. Since update to cuda 4.1 i must use the new profiler nvv. <br /><br />Some system information's:<br />Linux 3.2.1-gentoo-r2 x86_64<br />nvidia-drivers-295.41<br />dev-util/nvidia-cuda-sdk-4.1<br />dev-util/nvidia-cuda-toolkit-4.1<br />dev-java/sun-jdk-1.6.0.31<br />dev-java/sun-jre-bin-1.6.0.31<br /><br />When i start nvv from comandline i get <br /><br /><code>JVM terminated. Exit code=13<br />/opt/cuda/libnvvp/jre/bin/java<br />-jar /opt/cuda/libnvvp/plugins/org.eclipse.equinox.launcher_1.1.0.v20100507.jar<br />-os linux<br />-ws gtk<br />-arch x86_64<br />-showsplash<br />-launcher /opt/cuda/libnvvp/nvvp<br />-name Nvvp<br />--launcher.library /opt/cuda/libnvvp/plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.1.R36x_v20100810/eclipse_1309.so<br />-startup /opt/cuda/libnvvp/plugins/org.eclipse.equinox.launcher_1.1.0.v20100507.jar<br />-exitdata a848011<br />-data <a href="/devforum/profile/user">@user</a>.home/nvvp_workspace<br />-vm /opt/cuda/libnvvp/jre/bin/java<br />-vmargs<br />-jar /opt/cuda/libnvvp/plugins/org.eclipse.equinox.launcher_1.1.0.v20100507.jar <br /></code><br /><br />It told me nothing because i'm not that java-expert ... maybe someone can tell me whats went wrong.  Should i use a newer version of sun java?<br /><br />Thanks a lot in advance.<br /><br />Best,<br />David]]></description>
   </item>
      <item>
      <title>Cuda FreeBSD</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8271/cuda-freebsd</link>
      <pubDate>Wed, 16 May 2012 06:18:09 -0400</pubDate>
      <dc:creator>DJs3000</dc:creator>
      <guid isPermaLink="false">8271@/devforum/discussions</guid>
      <description><![CDATA[Can I use Cuda on FreeBSD? There is a GeForce 460.]]></description>
   </item>
      <item>
      <title>Breakpoint don&#039;t be hit using Nsight Visual Studio</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8216/breakpoint-dont-be-hit-using-nsight-visual-studio</link>
      <pubDate>Tue, 15 May 2012 08:49:16 -0400</pubDate>
      <dc:creator>froure</dc:creator>
      <guid isPermaLink="false">8216@/devforum/discussions</guid>
      <description><![CDATA[Hi!<br /><br />I'm trying to debug with Nsight. I put some breakpoints but debugger don't stop the execution there. During the debugging, this breakpoints appers with a warning icon, and when I put the mouse over, it says: "The breakpoint will not currently bye hit. CUDA: no source correspondence for breakpoint".<br /><br />I try to find some solution inside the other forums, but I can't.<br /><br />Can you help me please?<br /><br />Thanks!! :D]]></description>
   </item>
      <item>
      <title>Can&#039;t use GPU local debugging with Nsight</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7061/cant-use-gpu-local-debugging-with-nsight</link>
      <pubDate>Sun, 15 Apr 2012 03:58:50 -0400</pubDate>
      <dc:creator>laz007</dc:creator>
      <guid isPermaLink="false">7061@/devforum/discussions</guid>
      <description><![CDATA[Hello!<br />I'm using GF 8400M GS on a x64 Windows 7 machine with Visual Studio 2008.<br />I installed the new beta driver (301) and Nsight 2.2. After I tried the tutorial nothing happened when I put breakpoints in CUDA kernels.<br />What could be the possible reason for that problem?<br />Does my GPU support local debugging? Because it was written that only selected GPUs support it.]]></description>
   </item>
      <item>
      <title>Linker Issue in building a CUDA application in Visual Studio 2010</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4766/linker-issue-in-building-a-cuda-application-in-visual-studio-2010</link>
      <pubDate>Thu, 16 Feb 2012 03:56:49 -0500</pubDate>
      <dc:creator>gladman</dc:creator>
      <guid isPermaLink="false">4766@/devforum/discussions</guid>
      <description><![CDATA[Reposted from old forum.<br /><br />I am trying to build an application with Visual Studio 2010 and Nsight 2.1 hosted on Windows 7 (x64) using the 4.0 and/or the 4.1 build customisation rules.  <br /><br />I am using object file output format with the default filename for the output object files - $(IntDir)%(FileName)%(Extension).obj, which correctly generates all the object files in the $(IntDir) directory.   But after all the files have compiled correctly, the linker build step then fails with the message:<br /><br />LINK : fatal error LNK1181: cannot open input file '..\.obj'<br /><br />It appears that the file listing inputs for the linker is not being generated.<br /><br />If I perform the linker step manually, the application builds correctly and runs without problems. So it seems that there is an issue with the way the build customisations are producing the input file lists for the linker.   <br /><br />I would appreciate any advice that people can offer on what might be going wrong. <br />]]></description>
   </item>
      <item>
      <title>Best way to tackle a specific MIP with CUDA</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8011/best-way-to-tackle-a-specific-mip-with-cuda</link>
      <pubDate>Thu, 10 May 2012 04:57:51 -0400</pubDate>
      <dc:creator>Caelen</dc:creator>
      <guid isPermaLink="false">8011@/devforum/discussions</guid>
      <description><![CDATA[Hi every one,<br /><br />I am actually working on a project and I would like to have your opinion<br />The aim of this project is to solve a problem using GPGPU programming (so CUDA)<br /><br />So problem is quite easy, I have to put in a truck a lot of container, with different weight and size. But the Truck is large enough to accept all containers so it is not a bin packing problem. Also, the truck is divided in N area in which you can put exactly one container.<br />So, each container has an order of “preference place” and I have to optimize this preference. Also you have some other constraint; some container must be neighbor, one container can take one and half place etc…<br /><br />The problem can be model by a Mixed Integer problem, with an objective function.<br />Actually, I was thinking to put for each container N binary variables and put 1 at the variable i if the container is in the area I, 0 otherwise.<br /><br />The problem is that, I will have a spare matrix (density under 5%). So, from what I've read so far, it won’t be very useful to use GPGPU<br /><br />I also think to put for each container a variable which can have the value 1 to N. Then the matrix will be denser.<br /><br />My major problem is that I don’t know what is the best way to tackle the problem :<br />Using branch &amp; Band Solver and during the relaxation, use a LP solver programming in GPGPU.<br />Or use Constraint Solver in GPGPU<br />Or, use metaheuristic solver in GPGPU.<br /><br />So, I would like to have your opinion about my problem<br /><br />Thank’s a lot ]]></description>
   </item>
      <item>
      <title>Cuda 5 Preview: GPUDirect RDMA ?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8246/cuda-5-preview-gpudirect-rdma-</link>
      <pubDate>Tue, 15 May 2012 18:01:15 -0400</pubDate>
      <dc:creator>skybuck2000</dc:creator>
      <guid isPermaLink="false">8246@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />On the cuda toolkit 5.0 preview download page it mentions the following but I cannot find any <br />further documention about what this is ?<br /><br />"<br />GPUDirect RDMA provides fastest possible communication between GPUs and other PCI-E devices<br /> • Direct memory access (DMA) supported between NIC and GPU without the need for CPU-side data buffering<br />• Significantly improved MPISendRecv efficiency between GPU and other nodes in a network<br />• Eliminates CPU bandwidth and latency bottlenecks<br />• Works with variety of 3rd party network and storage devices<br />"<br /><br />Are these just general hardware/driver improvements or can I somehow program cuda to communicate with network cards (nics) ? <br /><br />Sounds somewhat interesting so I would like to know more about it...<br /><br />Are these perhaps tesla only features or so ? Does fermi support it or only the new kepler ? <br /><br />I find this link/doc:<br /><br /><a href="http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/13-cuda_advmpi_keeneland.pdf" target="_blank" rel="nofollow">http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/13-cuda_advmpi_keeneland.pdf</a><br /><br />It's somewhat older (?) and it mentions linux only, has this now changed, and is it available for windows too ?<br /><br />Ok I found some official documentation/link here:<br /><br /><a href="http://developer.nvidia.com/gpudirect" target="_blank" rel="nofollow">http://developer.nvidia.com/gpudirect</a><br /><br />Hmm, so far it seems to be for tesla/data center products and linux...<br /><br />I guess this is not for desktop/windows (yet?) and probably requires special network card and/or drivers ? hmm...<br /><br />Bye,<br />  Skybuck.<br />]]></description>
   </item>
      <item>
      <title>Can I redistribute nvmex and nvopts.sh?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8241/can-i-redistribute-nvmex-and-nvopts-sh</link>
      <pubDate>Tue, 15 May 2012 17:04:08 -0400</pubDate>
      <dc:creator>jeffblanchard</dc:creator>
      <guid isPermaLink="false">8241@/devforum/discussions</guid>
      <description><![CDATA[I want to release a CUDA-C (nvmex) package I have nearly finished.  I am working on the licensing and such.  I can not seem to find a definitive answer regarding including nvmex and nvopts.sh in my package.  Am I permitted to include these files in my package?]]></description>
   </item>
      <item>
      <title>Kepler Quadro?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8236/kepler-quadro</link>
      <pubDate>Tue, 15 May 2012 15:43:41 -0400</pubDate>
      <dc:creator>Bleiner</dc:creator>
      <guid isPermaLink="false">8236@/devforum/discussions</guid>
      <description><![CDATA[Hello Everyone,<br /><br />Not to be a starting rumors, but has anyone heard anything about when the new Kepler Quadro cards will be coming out?   I am responsible for selecting new laptops for the design department at my company.  The 5010M looks to be a great card don't get me wrong, but with my code starting to rely more on CUDA, I want to get the most powerful GPU for double precision work I can find.  I would hate to drop a LARGE sum of money on a laptop only to have the GPU be obsolete a month later, especially when the GPU is a major component of the cost.  <br /><br />Any rumors?  Any employees around who can deny knowing anything while winking at the same time?<br /><br />Thanks<br />Ben]]></description>
   </item>
      <item>
      <title>Matlab, Mex and Cuda</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8231/matlab-mex-and-cuda</link>
      <pubDate>Tue, 15 May 2012 15:33:54 -0400</pubDate>
      <dc:creator>Bleiner</dc:creator>
      <guid isPermaLink="false">8231@/devforum/discussions</guid>
      <description><![CDATA[Hello Everyone,<br /><br />I had a question about integrating Cuda code with C code.  Here is my dilema:  I am working in a matlab environment, and I want to introduce some Cuda C code to help speed up some applications.  I can do that using .PTX files without much trouble, but that is strictly for GPU only code called my matlab. But what if I wanted to use Host C code to do some operations faster than matlab and move stuff to the GPU and process data during the C code.  <br /><br />I can compile C code in matlab using the MEX function, but as far as I know Mex won't call NVCC as a compiler (only C compilers).  So my question is.  Is there a way to call precompiled (in so much as ptx is compiled) from my C function such that my C code will compile using a standard C compiler?  or is there another format I can compile my CUDA code to such that I can just include it during compilation (without the need for the CUDA to be recompiled)?  I am new to C and CUDA programming and so I am unfamiliar with a lot of the compiler issues and tricks. <br /><br />Thanks<br />Ben<br />]]></description>
   </item>
      <item>
      <title>How do I use &quot;prof_trigger&quot; (user profile triggers) from my kernel?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8226/how-do-i-use-prof_trigger-user-profile-triggers-from-my-kernel</link>
      <pubDate>Tue, 15 May 2012 13:07:03 -0400</pubDate>
      <dc:creator>m4dc4p</dc:creator>
      <guid isPermaLink="false">8226@/devforum/discussions</guid>
      <description><![CDATA[The CUPTI Event API provides counters for user profiling (prof_trigger_00 through prof_trigger_07).<br /><br />I can figure out how to read those counters, but how do I write to them or in some other use those counters from my kernel?<br /><br />I am using a Tesla (v1.1 capability) GPU.<br /><br />Thanks!]]></description>
   </item>
      <item>
      <title>Help adding external force to particles</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8111/help-adding-external-force-to-particles</link>
      <pubDate>Sat, 12 May 2012 10:29:45 -0400</pubDate>
      <dc:creator>dartwing17</dc:creator>
      <guid isPermaLink="false">8111@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I have run a wind simulation in Flow-3D and have exported a set of wind-vectors, which describe<br />a position and a velocity. I want to use these wind-vectors in my particle system, so that when a particle is inside a wind-vector position (lets say +- 0.5), the velocity of the wind-vector should be added to the particle.<br /><br />I have implemented a way to do this, but it is very inefficient. I was just wondering if anyone had any suggestion on how to make the algorithm more efficient.<br /><br />This is my implementation (sorry for the ugly code):<br /><br />glm::vec3 checkWindPosition(glm::vec3 particlepos)<br />{<br />	for(int i = 0; i &lt; flowWindGrid.size(); i++)<br />	     if(particlepos.x &lt; flowWindGrid[i].position.x+0.5f &amp;&amp; particlepos.x &gt; flowWindGrid[i].position.x-0.5f<br />			&amp;&amp; particlepos.y &lt; flowWindGrid[i].position.y+0.5f &amp;&amp; particlepos.y &gt; flowWindGrid[i].position.y-0.5f<br />			&amp;&amp; particlepos.z &lt; flowWindGrid[i].position.z+0.5f &amp;&amp; particlepos.z &gt; flowWindGrid[i].position.z-0.5f)<br />			return flowWindGrid[i].velocity;<br />	return glm::vec3(99,99,99);<br />}<br /><br />void update()<br />{<br /> physx::PxParticleFluidReadData* rd = pf-&gt;lockParticleFluidReadData();<br /> if (rd-&gt;validParticleRange &gt; 0)<br /> {<br />	 physx::PxStrideIterator particleFlags(rd-&gt;flagsBuffer);<br />	// iterate over valid particle bitmap<br />	for (physx::PxU32 w = 0; w &lt;= (rd-&gt;validParticleRange-1) &gt;&gt; 5; w++)<br />	{<br />	   for (physx::PxU32 b = rd-&gt;validParticleBitmap[w]; b; b &amp;= b-1)<br />	   {<br />	      physx::PxU32 index = (w &lt;&lt; 5 | physx::lowestSetBit(b));<br />	      const physx::PxVec3&amp; position = rd-&gt;positionBuffer[index];<br /><br />	      if (particleFlags[index] &amp; physx::PxParticleFlag::eVALID)<br />	      {<br />                glm::vec3 newvelocity = checkWindPosition(glm::vec3(position.x,position.y,position.z));<br />		if(newvelocity != glm::vec3(99,99,99))<br />		{<br />		   windIndexBuffer.push_back(index);<br />		   windForce.push_back(physx::PxVec3(newvelocity.x,newvelocity.y,newvelocity.z));<br />		}<br />	      }<br />	   }<br />	}<br /> }<br /><br /> rd-&gt;unlock();<br /><br /> physx::PxStrideIterator forceBuffer(&amp;windForce[0]);<br /> physx::PxStrideIterator indexData(&amp;windIndexBuffer[0]);<br /> pf-&gt;addForces(windForce.size(), indexData, forceBuffer, physx::PxForceMode::eFORCE);<br />}]]></description>
   </item>
      <item>
      <title>Errors with cuda RC2 visual profiler</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2596/errors-with-cuda-rc2-visual-profiler</link>
      <pubDate>Fri, 16 Dec 2011 19:31:09 -0500</pubDate>
      <dc:creator>tcelvis</dc:creator>
      <guid isPermaLink="false">2596@/devforum/discussions</guid>
      <description><![CDATA[1. Execution starts normally (as indicated by output lines. After approx 20 seconds progress status switches to collecting results and then aborts with the following error dialog:<br />"Unable to locate CUDA libraries and establish connection with CUDA dirver<br />Error com.nvidia.viper.jni.CuException: CUDA_ERROR_INVALID_VALUE"<br /><br />Some reruns give error ... CUDA_OUT_OF_MEMORY<br /><br />2. Notices that when nvvp is launched 42 processes show up all looking identical. "top" output for each line is as follows:<br />/usr/local/cuda/libnvvp/jre/bin/java -jar /usr/local/cuda/libnvvp/plugins/org.eclipse.equinox.launcher_1.1.0.v20100507.jar -os linux -ws gtk -arch x86_64 -showsplash -launcher /usr/local/cuda/libnvvp/nvvp -name Nvvp --launcher.library /usr/local/cuda/libnvvp/plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.1.R36x_v20100810/eclipse_1309.so -startup /usr/local/cuda/libnvvp/plugins/org.eclipse.equinox.launcher_1.1.0.v20100507.jar -exitdata 6e000f -data <a href="/devforum/profile/user">@user</a>.home/nvvp_workspace -vm /usr/local/cuda/libnvv<br /><br />3. Visual profiler users guide included with RC2 still references computeprof and not nvvp. computeprof was not part of this distribution.<br /><br />4. This application runs file as the execuatable when not using visual profiler. The version 4.0 computeprof also worked fine on this application.<br /><br />5. Using Centos 5.5 linux on a quad-hex core chassis containing 8 Fermi 2090 GPUs.<br />Utilizing stream ids running 24 threads with 3 threads sharing each GPU]]></description>
   </item>
      <item>
      <title>cudaFree returning cudaErrorMemoryAllocation - bug?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8181/cudafree-returning-cudaerrormemoryallocation-bug</link>
      <pubDate>Mon, 14 May 2012 13:38:35 -0400</pubDate>
      <dc:creator>ajsimmonds</dc:creator>
      <guid isPermaLink="false">8181@/devforum/discussions</guid>
      <description><![CDATA[I have been encountering a strange problem using the cuda 4.2 tools where our application eventually receives a cudaErrorMemoryAllocation error when trying to perform a cudaFree on cudaMalloc'd memory. The number of allocations and deallocations performed varies but the problem can be reproduced in the app relatively easily. Once the error has been received once further frees and also cudaMemGetInfo continue to return the error.<br /><br />To further narrow down the error I have also written a test program that simply allocates areas using cudaMalloc and when this returns an out of memory error, releases one or more of the previously allocated areas to make space. This program, which launches no kernels, fails with the same symptoms. I have also tried this with the 4.0 tools and still receive the same error condition.<br /><br />If I limit the number of iterations such that the error is not encountered then it is quite likely that the free memory value returned by cudaGetMemInfo is larger than the value the program started with.<br /><br />This looks all the world to me as if there is a problem with the tracking of memory allocations within the SDK, so can anyone confirm this or possibly point me at things I may be doing wrong?!<br /><br />Many thanks<br />Andrew]]></description>
   </item>
      <item>
      <title>ArrayFire + Nsight???</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7996/arrayfire-nsight</link>
      <pubDate>Wed, 09 May 2012 21:08:48 -0400</pubDate>
      <dc:creator>sizheng</dc:creator>
      <guid isPermaLink="false">7996@/devforum/discussions</guid>
      <description><![CDATA[i'm trying ArrayFire, but it seems that i cannot debug arrayfire code by nsight~~~<br /><br />anybody knows how to?]]></description>
   </item>
      <item>
      <title>Parallel Nsight 2.2 RC2 - No Source Available</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7886/parallel-nsight-2-2-rc2-no-source-available</link>
      <pubDate>Fri, 04 May 2012 16:35:28 -0400</pubDate>
      <dc:creator>wdrozd</dc:creator>
      <guid isPermaLink="false">7886@/devforum/discussions</guid>
      <description><![CDATA[When selecting CUDA Debugging with memory checker enabled I get a kernel crash with a window in  Nsight that says "No Source Available". When I click on the link "Browse to Find Source", a message says "The source code cannot be displayed".<br /><br />My application compiles fine, so clearly it can find the source (for both the C code and the Cuda code)<br /><br />Also I have no problem stopping at a breakpoint set in my Kernel prior to the crash (grid launch failure)<br /><br />The call-stack says "No active Cuda Kernels".<br /><br />Can you please me determine how to set Nsight to detect the source?<br /><br />Thanks.]]></description>
   </item>
      <item>
      <title>simpleStreams example in SDK not working</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8101/simplestreams-example-in-sdk-not-working</link>
      <pubDate>Sat, 12 May 2012 03:36:29 -0400</pubDate>
      <dc:creator>madhur13490</dc:creator>
      <guid isPermaLink="false">8101@/devforum/discussions</guid>
      <description><![CDATA[I've installed CUDA 4.1 GPUComputingSDK and GPUComputing toolkit. I'm trying to see performance improvement for simpleStreams example given in src folder but it seems there is some problem in new version. Streamed version is consistently taking more time than non-streamed version. I've no modified code. It seems there is some bug new examples.]]></description>
   </item>
      <item>
      <title>How to write QT project file when the QT project contain &#039;.cu&#039; file</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8156/how-to-write-qt-project-file-when-the-qt-project-contain-cu-file</link>
      <pubDate>Mon, 14 May 2012 05:39:15 -0400</pubDate>
      <dc:creator>licongsheng1206163com</dc:creator>
      <guid isPermaLink="false">8156@/devforum/discussions</guid>
      <description><![CDATA[Hello All! I have write a code file named deviceQuery.cu and the compilation is successful, now i want to put it into Qt project, i want to know how to write the QT project(.pro) file. Thanks!]]></description>
   </item>
      <item>
      <title>Matrix multiplication doesn&#039;t work! Output always different...</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7861/matrix-multiplication-doesnt-work-output-always-different-</link>
      <pubDate>Fri, 04 May 2012 07:47:28 -0400</pubDate>
      <dc:creator>Z0K4</dc:creator>
      <guid isPermaLink="false">7861@/devforum/discussions</guid>
      <description><![CDATA[Hello everyone...<br /><br /><br />Recently I started playing with the CUDA computing, and I want to write a kernel that will multiply matrices... So I started searching and found that SDK has an example for matrix multiplication. The problem is that I always get a different output meaning that the resulting matrix is always different. How is that possible? Anyway, I was unable to integrate nvcc with the Visual Studio so I couldn't use debugger to see what went wrong. Any help is much appreciated! Here is the code:<br /><br /><code>#include&lt;stdio.h&gt;<br />#include&lt;cuda.h&gt;<br />#include&lt;cuda_runtime.h&gt;<br />#include&lt;cuda_runtime_api.h&gt;<br />#include&lt;device_functions.h&gt;<br /><br />static void HandleError(cudaError_t err, const char *file, int line)<br />{<br />    if(err!=cudaSuccess){<br />		printf("%s in %s file at line %s\n", cudaGetErrorString(err), file, line);<br />		exit(EXIT_FAILURE);<br />    }<br />}<br /><br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))<br /><br /><a href="/devforum/search?Search=%23ifndef&amp;Mode=like">#ifndef</a> _MATRIXMUL_KERNEL_H_<br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> _MATRIXMUL_KERNEL_H_<br /><br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> BLOCK_SIZE 4<br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> TILE_SIZE 4<br /><br />__global__ void matrixMul( int* A, int* B, int* C, int wA, int wB)<br />{<br />	int bx = blockIdx.x;<br />    int by = blockIdx.y;<br /><br />	int tx = threadIdx.x;<br />	int ty = threadIdx.y;<br /><br /><br />	int aBegin = wA * BLOCK_SIZE * by;<br /><br />	int aEnd   = aBegin + wA - 1;<br /><br />	int aStep  = BLOCK_SIZE;<br /><br />	int bBegin = BLOCK_SIZE * bx;<br /><br />	int bStep  = BLOCK_SIZE * wB;<br /><br />	float Csub=0;<br /><br />	for (int a = aBegin, b = bBegin; a &lt;= aEnd; a += aStep, b += bStep) <br />	{<br />		__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];<br /><br />		__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];<br /><br />		As[ty][tx] = A[a + wA * ty + tx];<br />		Bs[ty][tx] = B[b + wB * ty + tx];<br /><br />		__syncthreads();<br /><br /><a href="/devforum/search?Search=%23pragma&amp;Mode=like">#pragma</a> unroll<br /><br />		for (int k = 0; k &lt; BLOCK_SIZE; ++k)<br />			Csub += As[ty][k] * Bs[k][tx];<br /><br />		__syncthreads();<br />	}<br /><br />	int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;<br />	C[c + wB * ty + tx] = Csub;<br />}<br /><br /><a href="/devforum/search?Search=%23endif&amp;Mode=like">#endif</a><br /><br />int main()<br />{<br />	int *a=(int*)malloc(BLOCK_SIZE*BLOCK_SIZE*sizeof(int));<br />	int *b=(int*)malloc(BLOCK_SIZE*BLOCK_SIZE*sizeof(int));<br />	int *c=(int*)malloc(BLOCK_SIZE*BLOCK_SIZE*sizeof(int));<br /><br />	int *dev_a, *dev_b, *dev_c;<br /><br />	HANDLE_ERROR(cudaMalloc((void**)&amp;dev_a, BLOCK_SIZE*BLOCK_SIZE*sizeof(int*)));<br />	HANDLE_ERROR(cudaMalloc((void**)&amp;dev_b, BLOCK_SIZE*BLOCK_SIZE*sizeof(int*)));<br />	HANDLE_ERROR(cudaMalloc((void**)&amp;dev_c, BLOCK_SIZE*BLOCK_SIZE*sizeof(int*)));<br /><br />	for(int i=0; i&lt;BLOCK_SIZE*BLOCK_SIZE; i++)<br />	{<br />		a[i]=1;<br />		b[i]=2;<br />	}<br /><br />	HANDLE_ERROR(cudaMemcpy(dev_a, a, BLOCK_SIZE*BLOCK_SIZE*sizeof(int), cudaMemcpyHostToDevice));<br />	HANDLE_ERROR(cudaMemcpy(dev_b, b, BLOCK_SIZE*BLOCK_SIZE*sizeof(int), cudaMemcpyHostToDevice));<br /><br />	matrixMul&lt;&lt;&lt;BLOCK_SIZE, BLOCK_SIZE&gt;&gt;&gt;(dev_a, dev_b, dev_c, BLOCK_SIZE, BLOCK_SIZE);<br /><br />	HANDLE_ERROR(cudaMemcpy(c, dev_c, BLOCK_SIZE*BLOCK_SIZE*sizeof(int), cudaMemcpyDeviceToHost));<br /><br />	for(int i=0; i&lt;BLOCK_SIZE*BLOCK_SIZE; i++)<br />	{<br />		if(i%BLOCK_SIZE==0)<br />			printf("\n\n");<br />		printf("%d\t", a[i]);<br />	}<br /><br />	for(int i=0; i&lt;BLOCK_SIZE*BLOCK_SIZE; i++)<br />	{<br />		if(i%BLOCK_SIZE==0)<br />			printf("\n\n");<br />		printf("%d\t", b[i]);<br />	}<br /><br />	for(int i=0; i&lt;BLOCK_SIZE*BLOCK_SIZE; i++)<br />	{<br />		if(i%BLOCK_SIZE==0)<br />			printf("\n\n");<br />		printf("%d\t", c[i]);<br />	}<br /><br />	cudaFree(dev_a);<br />	cudaFree(dev_b);<br />	cudaFree(dev_c);<br /><br />	return 0;<br />}</code><br /><br /><br />Any suggestion?<br /><br />]]></description>
   </item>
      <item>
      <title>Problems encountered using nppiGetAffineTransform and nppiWarpAffine_8u_C3R</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7436/problems-encountered-using-nppigetaffinetransform-and-nppiwarpaffine_8u_c3r</link>
      <pubDate>Mon, 23 Apr 2012 15:55:24 -0400</pubDate>
      <dc:creator>lancewellspring</dc:creator>
      <guid isPermaLink="false">7436@/devforum/discussions</guid>
      <description><![CDATA[I have 2 problems.<br />1) Function call to nppiGetAffineTransform is returning NPP_AFFINE_QUAD_INCORRECT_WARNING.<br /><em>parameter srcRoi is:</em><br />x	0	int<br />y	0	int<br />width	5000	int<br />height	5000	int<br /><em>parameter quad is:</em><br />[0][0]	0.00000000000000000	double<br />[0][1]	102.69965808786287	double<br />[1][0]	5128.9289884048958	double<br />[1][1]	0.00000000000000000	double<br />[2][0]	5230.9576202374628	double<br />[2][1]	5149.2023110261380	double<br />[3][0]	102.53232406430637	double<br />[3][1]	5251.8818716857804	double<br /><br /><strong>Does the function expect the points of quad in a specific order? Right now they are: topleft, topright, botleft, botright.<br /></strong><br /><br />2) Function call to nppiWarpAffine_8u_C3R is returning NPP_STEP_ERROR.<br /><em>parameter pSrc is 75000000 bytes. <br />parameter srcSize is:</em><br />width	5000	int<br />height	5000	int<br /><em>parameter nSrcStep is 15000. <br />parameter srcRoi is:</em><br />x	0	int<br />y	0	int<br />width	5000	int<br />height	5000	int<br /><em>parameter pDst is 82419636 bytes.<br />parameter nDstStep is 15693.<br />parameter dstRoi is:</em><br />x	0	int<br />y	0	int<br />width	5231	int<br />height	5252	int<br /><em>parameter coeffs is:</em><br />[0][0]	1.0259909958801552	double<br />[0][1]	0.020409808328179044	double<br />[1][2]	0.00000000000000000	double<br />[1][0]	-0.020544040425657707	double<br />[2][1]	1.0300464714995274	double<br />[2][2]	102.69965808786287	double<br />parameter interpolation is NPPI_INTER_CUBIC.<br /><br />I dont have any idea what is going wrong here.<br /><br />Any help is greatly appreciated!<br /><br />I'm running on a Windows 7 machine, with a Quadro FX 1800M, using Visual Studio 2010. Running the basic cuda examples works just fine.]]></description>
   </item>
      <item>
      <title>Double Double or Arbitrary Precision</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8141/double-double-or-arbitrary-precision</link>
      <pubDate>Sun, 13 May 2012 21:43:53 -0400</pubDate>
      <dc:creator>rdunn</dc:creator>
      <guid isPermaLink="false">8141@/devforum/discussions</guid>
      <description><![CDATA[I need to compare a high precision (i.e at least 2x that of double) with results from my CUDA kernels in double and single precision to establish how accurate they are compared to other implementations. <br /><br />I have struggled and failed at getting GMP/MPIR to build on a windows platform with visual studio 2010. I've tried building GMP/MPIR with mingw to no avail too. <br /><br />Does nVidia have a double-double library?, I saw a forum post to it (<a href="http://forums.nvidia.com/index.php?showtopic=218452" target="_blank" rel="nofollow">http://forums.nvidia.com/index.php?showtopic=218452</a>), but are unable to find the actual file. <br />]]></description>
   </item>
      <item>
      <title>Trouble with processing image in rows</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8121/trouble-with-processing-image-in-rows</link>
      <pubDate>Sun, 13 May 2012 05:29:07 -0400</pubDate>
      <dc:creator>laz007</dc:creator>
      <guid isPermaLink="false">8121@/devforum/discussions</guid>
      <description><![CDATA[Hello!<br />I'm making an image filter that is processing the image in rows.<br />Two weeks I'm trying to figure out why it's not working when executed in parallel.<br />I use only threads in the Y dimension. Is that a problem?<br /><br /><br /><br />Here is part of the code:<br />BLOCKDIM_Y=16;<br />....<br />dim3 threads(1, BLOCKDIM_Y);<br />dim3 grid(1,  iDivUp(h, BLOCKDIM_Y));<br /><br />my_CUDA_filter&lt;&lt;&lt; grid, threads&gt;&gt;&gt;(sumR, sumG, sumB, mask,h,w, inD, outD, test);<br />...<br /><br />__global__ void my_CUDA_filter_simple222(int* sumR, int* sumG, int* sumB, int mask,int h,int w, u_int8_t *in, u_int8_t *out, int* test){<br />...<br />int iy = blockDim.y * blockIdx.y + threadIdx.y;<br />int ix=0;<br /><br />	if (iy&gt;=m &amp;&amp; iy&lt;(h-m)) {<br /><br />	//for(iy=m; iy&lt;h-m; iy++){<br /><br />	 ...<br />	for(ix=m+1;ix&lt;w-m;ix++){<br />	 ...<br />	 }<br />}<br /><br />The result image is messed up...<br />If I use for(iy=m; iy&lt;h-m; iy++){ <br />and run the kernel with one single thread (that means there is no parallelization) everything is OK.<br /><br />Any ideas?<br /><br />]]></description>
   </item>
      <item>
      <title>Portable pinned memory and multiple GPUs: Performance and stability</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7386/portable-pinned-memory-and-multiple-gpus-performance-and-stability</link>
      <pubDate>Sun, 22 Apr 2012 13:47:59 -0400</pubDate>
      <dc:creator>tbenson</dc:creator>
      <guid isPermaLink="false">7386@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I am having some problems using portable pinned memory to share one pinned buffer between multiple GPUs.  I have two separate issues:<br /><br />	 1) Performance of transfers for the GPUs not corresponding to the allocation context are massively degraded; and<br />	 2) It tends to crash my Linux host and force a reboot.<br /><br />I included the code at the end.  There are several flags at the top of the source file to control behavior, including NDEVICES, NBUFFERS, and USE_PINNED_MEMORY.  NDEVICES is the number of GPUs to use, NBUFFERS is the number of buffers to be allocated, and USE_PINNED_MEMORY determines whether or not the buffers are pinned.  The case that fails is NDEVICES = 2, NBUFFERS = 1, and USE_PINNED_MEMORY = true. If I use as many buffers as devices, then things work with or without pinned memory.  It also works without pinned memory for any number of buffers.  However, with the failing case, I get the following:<br /><br />[host:portable_pinned]$ ./portable <br />id = 0, cudaMemcpy time = 22.32 ms<br />id = 0, val = 3.000000 (should be 3.000000)<br />id = 1, cudaMemcpy time = 5457.76 ms<br />id = 1, val = 6.000000 (should be 6.000000)<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.826763] Stack:<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.828257] Call Trace:<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.845154] Code: f6 62 00 85 c0 74 10 e8 69 e4 65 00 0f 1f 00 eb 06 89 77 6c 89 4f 70 48 83 c5 10 5b c3 41 54 53 48 83 ec 08 48 83 ed 08 41 89 f4 &lt;39&gt; 77 6c 73 17 39 77 70 0f 87 ac 00 00 00 39 77 6c 73 09 39 77 <br /><br />The host at this point is only partially responsive and needs to be rebooted.  The system log is full of errors, but a sampling is attached.  This is using driver version 285.05.33, CUDA 4.1, Fedora 14, and kernel 2.6.35.6-45.  The GPUs are two Tesla C2050s that reside in a Tesla S2050 compute server.  They are connected to the host via a single PCI-e cable.  This is a single host in a cluster, so updating the driver is not trivial, although I will do so if this is a known bug.<br /><br />In any case, I suspect that the kernel/driver error is just a bug as I have done something similar in the past without this problem.  However, I still had the poor performance in the past.  Above, the PCIe transfer to the CUDA context in which the allocation was not made is over 200x slower than the transfer for the context in which the allocation was made.  Is this normal?  The documentation just says that cudaHostAllocPortable allows pinned memory to be recognized by other contexts, but does not mention the performance implications of accessing the memory.<br /><br />Thanks for any help/comments,<br /><br />Tom<br /><br />The code is below.  The Timing class is just a wrapper that I have for host timing.  I can include it if needed, but already had to rework this email due to character limitations.  The references can be commented out to compile  the code.<br /><br /><code><br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cuda_runtime.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cassert&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cstdio&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;pthread.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> "timing.hpp"<br /><br />namespace<br />{<br />    const size_t BUFSIZE = 32*1024*1024;<br />    const int NDEVICES = 2;<br />    const int NBUFFERS = 1;<br />    const bool USE_PINNED_MEMORY = true;<br />}<br /><br />struct Params<br />{<br />    float *buf;<br />    int id;<br />};<br /><br />__global__ void test_kernel(float *buf, float val) { buf[0] = val; }<br /><br />void *gpu_thread(void *v)<br />{<br />    Params *params = (Params *) v;<br /><br />    cudaSetDevice(params-&gt;id);<br /><br />    float *dev_buf;<br />    assert(cudaMalloc((void **) &amp;dev_buf, sizeof(float)*BUFSIZE) == cudaSuccess);<br /><br />    double start = Timing::ElapsedTimeMs();<br />    assert(cudaMemcpy(dev_buf, params-&gt;buf, sizeof(float)*BUFSIZE, cudaMemcpyHostToDevice) == cudaSuccess);<br />    double elapsed = Timing::ElapsedTimeMs() - start;<br />    printf("id = %d, cudaMemcpy time = %.2f ms\n", params-&gt;id, elapsed);<br /><br />    test_kernel&lt;&lt;&lt;1,1&gt;&gt;&gt;( dev_buf, (params-&gt;id+1) * 3.0f );<br />    assert(cudaThreadSynchronize() == cudaSuccess);<br /><br />    float retval;<br />    assert(cudaMemcpy(&amp;retval, dev_buf, sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess);<br /><br />    printf("id = %d, val = %f (should be %f)\n", params-&gt;id, retval, (params-&gt;id+1)*3.0f);<br /><br />    assert(cudaFree(dev_buf) == cudaSuccess);<br /><br />    return NULL;<br />}<br /><br />int main(int argc, char **argv)<br />{<br />    float *pinned[NDEVICES];<br /><br />    assert(NBUFFERS &lt;= NDEVICES);<br /><br />    for (int i = 0; i &lt; NBUFFERS; ++i)<br />    {<br />        assert(cudaSetDevice(i) == cudaSuccess);<br />        if (USE_PINNED_MEMORY)<br />        {<br />            assert(cudaHostAlloc((void **) &amp;pinned[i], sizeof(float)*BUFSIZE, cudaHostAllocPortable) == cudaSuccess);<br />        }<br />        else<br />        {<br />            pinned[i] = new float[BUFSIZE];<br />        }<br />        for (size_t k = 0; k &lt; BUFSIZE; ++k) { pinned[i][k] = 1.0f; }<br />    }<br /><br />    pthread_t tid[NDEVICES];<br />    Params params[NDEVICES];<br /><br />    for (int i = 0; i &lt; NDEVICES; ++i)<br />    {<br />        params[i].id = i;<br />        params[i].buf = pinned[i%NBUFFERS];<br />        assert(pthread_create(tid+i, NULL, gpu_thread, (void *) &amp;params[i]) == 0);<br />    }<br /><br />    for (int i = 0; i &lt; NDEVICES; ++i)<br />    {<br />        assert(pthread_join(tid[i], NULL) == 0);<br />    }<br /><br />    for (int i = 0; i &lt; NBUFFERS; ++i)<br />    {<br />        if (USE_PINNED_MEMORY)<br />        {<br />            assert(cudaFreeHost(pinned[i]) == cudaSuccess);<br />        }<br />        else<br />        {<br />            delete [] pinned[i];<br />        }<br />    }<br /><br />    return 0;<br />}<br /></code>]]></description>
   </item>
      <item>
      <title>nvcc 4.2 pragma unroll issue</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7741/nvcc-4-2-pragma-unroll-issue</link>
      <pubDate>Tue, 01 May 2012 15:42:47 -0400</pubDate>
      <dc:creator>dlowell</dc:creator>
      <guid isPermaLink="false">7741@/devforum/discussions</guid>
      <description><![CDATA[If exit condition is: i&lt;=nv-1 where nv is define as a macro setting nv = NV, <a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NV 16, then the unroll will be incorrectly implemented.<br /><br />Example, <br /><br /><code><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NV 16 <br />nv=NV;<br />int tid = threadIdx.x+blockDim.x*blockIdx.x;<br /><a href="/devforum/search?Search=%23pragma&amp;Mode=like">#pragma</a> unroll 2<br />for(int i=0;i&lt;=nv-1;i++){<br />  y[tid]+=a[i]*x[i*n+tid];<br />}</code><br /><br />The code above with nvcc 4.2 will produce incorrect code, where as nvcc 4.0 will produce correct code. The code below will produce correct output for nvcc 4.2.<br /><br /><code><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NV 16 <br />nv=NV;<br />int tid = threadIdx.x+blockDim.x*blockIdx.x;<br /><a href="/devforum/search?Search=%23pragma&amp;Mode=like">#pragma</a> unroll 2<br />for(int i=0;i&lt;nv;i++){<br />  y[tid]+=a[i]*x[i*n+tid];<br />}</code><br /><br />Anyone else have this issue?]]></description>
   </item>
      <item>
      <title>Linker error with c function in .cu file</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7881/linker-error-with-c-function-in-cu-file</link>
      <pubDate>Fri, 04 May 2012 16:08:40 -0400</pubDate>
      <dc:creator>basementscientist</dc:creator>
      <guid isPermaLink="false">7881@/devforum/discussions</guid>
      <description><![CDATA[I've created a kernel inside a .cu file. Also inside the .cu file is a c++ function that calls<br />the kernal. Everything compiles ok, but on the final linking, the c++ function is not visible to the rest of the program. How do I make the function visible?<br /><br />I am using Visual Studio 2010 on Windows 8, and the newest SDK and Toolkit.]]></description>
   </item>
      <item>
      <title>npp problems</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7321/npp-problems</link>
      <pubDate>Thu, 19 Apr 2012 11:05:15 -0400</pubDate>
      <dc:creator>lancewellspring</dc:creator>
      <guid isPermaLink="false">7321@/devforum/discussions</guid>
      <description><![CDATA[I have 2 problems.<br />1) Function call to <code>nppiGetAffineTransform</code> is returning NPP_AFFINE_QUAD_INCORRECT_WARNING.<br /><em>parameter srcRoi is:</em><br />x	0	int<br />y	0	int<br />width	5000	int<br />height	5000	int<br /><em>parameter quad is:</em><br />[0]	0x00000000002af1d0	double [2]<br />	[0]	0.00000000000000000	double<br />	[1]	102.69965808786287	double<br />[1]	0x00000000002af1e0	double [2]<br />	[0]	5128.9289884048958	double<br />	[1]	0.00000000000000000	double<br />[2]	0x00000000002af1f0	double [2]<br />	[0]	5230.9576202374628	double<br />	[1]	5149.2023110261380	double<br />[3]	0x00000000002af200	double [2]<br />	[0]	102.53232406430637	double<br />	[1]	5251.8818716857804	double<br /><br />Does the function expect the points of quad in a specific order?  Right now they are: topleft, topright, botleft, botright.<br /><br />2) Function call to <code>nppiWarpAffine_8u_C3R</code> is returning NPP_STEP_ERROR.<br /><em>parameter pSrc is 75000000 bytes.</em> <br /><em>parameter srcSize is:</em><br />width	5000	int<br />height	5000	int<br /><em>parameter nSrcStep is 15000.</em> <br /><em>parameter srcRoi is:</em><br />x	0	int<br />y	0	int<br />width	5000	int<br />height	5000	int<br /><em>parameter pDst is 82419636 bytes.</em><br /><em>parameter nDstStep is 15693.</em><br /><em>parameter dstRoi is:</em><br />x	0	int<br />y	0	int<br />width	5231	int<br />height	5252	int<br /><em>parameter coeffs is:</em><br /><br />coeffs	0x00000000002af328	double [2][3]<br />[0]	0x00000000002af328	double [3]<br />	[0]	1.0259909958801552	double<br />	[1]	0.020409808328179044	double<br />	[2]	0.00000000000000000	double<br />[1]	0x00000000002af340	double [3]<br />	[0]	-0.020544040425657707	double<br />	[1]	1.0300464714995274	double<br />	[2]	102.69965808786287	double<br /><em>parameter interpolation is NPPI_INTER_CUBIC.</em><br /><br />I dont have any idea what is going wrong here.<br /><br />Any help is greatly appreciated!<br /><br />I'm running on a Windows 7 machine, with a Quadro FX 1800M, using Visual Studio 2010.  Running the basic cuda examples works just fine.]]></description>
   </item>
      <item>
      <title>ME(Computer Science) Project</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7931/mecomputer-science-project</link>
      <pubDate>Mon, 07 May 2012 02:07:54 -0400</pubDate>
      <dc:creator>priancabhosale</dc:creator>
      <guid isPermaLink="false">7931@/devforum/discussions</guid>
      <description><![CDATA[i am not familier with CUDA tech.  But i am interested in this tech. I wanna do my ME(CSE) project in this field.can anybody suggest any topic to me &amp; where can i start from?]]></description>
   </item>
      <item>
      <title>i write this program but its not work, please can you tell me what wrong with it , im new programmer</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8006/i-write-this-program-but-its-not-work-please-can-you-tell-me-what-wrong-with-it-im-new-programmer</link>
      <pubDate>Thu, 10 May 2012 04:42:48 -0400</pubDate>
      <dc:creator>analdo</dc:creator>
      <guid isPermaLink="false">8006@/devforum/discussions</guid>
      <description><![CDATA[<a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> N 16<br /><br />__global__ void matAdd(float* A, float* B, float* C) <br />{ <br />		int i= threadIdx.x + threadIdx.y*blockDim.x; <br />		C[i]= A[i] + B[i];<br />	}<br />int main()<br />	{<br />		int i;<br />		int numBlocks = 1;<br />		size_t size = N* sizeof(float);<br />		dim3 threadsPerBlock (4, 4);<br /><br />		// Initialisation des vecteurs <br />		float A[]= {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};<br />		float B[]= {1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5};<br />		float C[N];			<br /><br />		// Allocation des vecteurs dans la mémoire du GPU <br />		float* d_A; <br />		cudaMalloc(&amp;d_A, size); <br />		float* d_B; <br />		cudaMalloc(&amp;d_B, size); <br />		float* d_C; <br />		cudaMalloc(&amp;d_C, size);<br /><br />		// Copie des vecteurs de la mémoire du CPU vers la mémoire du GPU <br />		cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice); <br />		cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);<br /><br /><br /><br />		// Execution du kernel<br />		matAdd&lt;&lt;&gt;&gt;(A, B, C);<br /><br />		cudaMemcpy(C,d_C,size,cudaMemcpyDeviceToHost);<br /><br />		printf("C= {");<br />		for(i=0;i			{<br />				printf("%2.2f ", C);<br />			}<br />		printf("}\n");<br />		cudaFree(d_A), cudaFree(d_B), cudaFree(d_C);<br />}<br />]]></description>
   </item>
      <item>
      <title>GPU-aware hash cell size ???</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7961/gpu-aware-hash-cell-size-</link>
      <pubDate>Wed, 09 May 2012 03:01:19 -0400</pubDate>
      <dc:creator>man82bs</dc:creator>
      <guid isPermaLink="false">7961@/devforum/discussions</guid>
      <description><![CDATA[Hi guys..<br /><br />Recently I am working on a project which uses hash table.<br /><br />And I am trying to use gpu for computations on the data in each cell.<br /><br />My problem is which cell size gives me best performance.<br /><br />For example, if there are normalized 1M points which has uniform distribution, <br /><br />how much points should be included in a cell?? <br /><br />Generally, heuristic experiments decide the cell size but I really need ideas.<br /><br />If there are anyone who have an idea, help me plz...<br /><br />Thanks.]]></description>
   </item>
      <item>
      <title>Failed lunching kernal</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7851/failed-lunching-kernal</link>
      <pubDate>Thu, 03 May 2012 19:52:11 -0400</pubDate>
      <dc:creator>Saouli</dc:creator>
      <guid isPermaLink="false">7851@/devforum/discussions</guid>
      <description><![CDATA[Hi again<br />after all what i learned about cuda and how to use but still not enghoth well i wrote cuda kernal for ray casting to render some DICOM files it's k<br />I used shared Memo and texture memo<br /><code><br />void CallCUDAKernel(dim3 gridDim, dim3 blockDim,unsigned int *Outputi, int *Winds,float *Spacing, int *VolDim,float *Boxmin,float *Boxmax,float *UP,float *AT,<br />	                                float *OThreshold,float *Omega,float *angle, float *CamPos)<br /><br />{<br /><br />	RaycastingRender&lt;&lt;&lt;gridDim, blockDim&gt;&gt;&gt;(Outputi, Winds,Spacing, VolDim,Boxmin,Boxmax,UP,AT,<br />	                               OThreshold,Omega, angle,CamPos);<br />}<br /></code><br />somthing like that before i call my cernal i do allocate all the varable on globale device using<br />cudaMalloc and cudaMemocpy Note some varaible should be cpy from Host complex structr to device<br /><br />I dont know why but my Kernal stop and give me cuda kernel : (11) invalide argument error<br />cuda kernel  invalide argument error]]></description>
   </item>
      <item>
      <title>Nsight + Geforce GTX 590</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4066/nsight-geforce-gtx-590</link>
      <pubDate>Fri, 27 Jan 2012 17:06:20 -0500</pubDate>
      <dc:creator>aresio</dc:creator>
      <guid isPermaLink="false">4066@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I'm about to buy a new computer equipped for CUDA development. I was considering the purchase of the GeForce GTX 590, because of its dual GPU that Parallel-nsight can exploit. <br /><br />Anyway, I checked out the "supported GPUs *FULL* list" and this card does not appear. <br /><br /><a href="http://developer.nvidia.com/parallel-nsight-supported-gpus" target="_blank" rel="nofollow">http://developer.nvidia.com/parallel-nsight-supported-gpus</a><br /><br />Is it a mistake? or, actually, nsight does not support the GTX 590?<br /><br />thank you very much]]></description>
   </item>
      <item>
      <title>Getting started, price-worthy hardware?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7891/getting-started-price-worthy-hardware</link>
      <pubDate>Fri, 04 May 2012 18:15:05 -0400</pubDate>
      <dc:creator>AniSkywalker</dc:creator>
      <guid isPermaLink="false">7891@/devforum/discussions</guid>
      <description><![CDATA[Hi!<br /><br /><br /><br />I'm new to both this forum and CUDA but it is very much in my line of interest. I already know both asm and some GPU-programming (float point arithmetics etc) with asm. <br /><br /><br /><br />I want to start with CUDA-programming. I'm searching for price-worthy and CUDA 4 compatible hardware. Since I am very new to the subject, I'd like to be directed to hardware choices that gives relevant experience when writing CUDA code. That is, if double gpu or double cpus are beneficial, I'd like to be pointed towards good and price-worthy solutions there. If a single gpu/cpu solutions is a good enough place to start and get experience (say 35000-50000 lines of code) then I'd go with it. And if if there is some solution that works for now and is upgradeable, I might go with it.<br /><br /><br /><br />Just to be clear, I wouldn't ask here if I wasn't entirely new to this, so please don't mock me if some of my questions are pure stupid. I just don't know better ways to form them right now...]]></description>
   </item>
      <item>
      <title>Thread indexing</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7606/thread-indexing</link>
      <pubDate>Thu, 26 Apr 2012 17:13:20 -0400</pubDate>
      <dc:creator>essaysoftware</dc:creator>
      <guid isPermaLink="false">7606@/devforum/discussions</guid>
      <description><![CDATA[I would like to index my threads from 0 to N with a <br />&lt;&lt;&gt;&gt; launch.<br /><br />How do I do this.]]></description>
   </item>
      <item>
      <title>Batch testing with Parallel Nsight</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7806/batch-testing-with-parallel-nsight</link>
      <pubDate>Wed, 02 May 2012 20:20:01 -0400</pubDate>
      <dc:creator>nunosilva800</dc:creator>
      <guid isPermaLink="false">7806@/devforum/discussions</guid>
      <description><![CDATA[Hello.<br />In building an OpenGL program that is basically a visualizer, and I would like to test it under various configurations (number of lights, model to load, and textures) to assess performance, scalability, etc...<br /><br />So I would like to know how I can make a script to define a bunch of tests, so that I can leave if doing them during the night, and go analyze results the next day. <br />I've found the TestRunner.exe program in C:\Program Files (x86)\NVIDIA Parallel Nsight 2.2\Common, but I don't know what parameters to use with it. <br /><br />I've searched though the user guide and the interwebs, but I can't find anything resembling batch testing with Nsight....<br /><br />How can I do it?<br />thx.]]></description>
   </item>
      <item>
      <title>CUDA integration in vs 11 beta.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6771/cuda-integration-in-vs-11-beta-</link>
      <pubDate>Sat, 07 Apr 2012 19:04:21 -0400</pubDate>
      <dc:creator>m227</dc:creator>
      <guid isPermaLink="false">6771@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br />Sorry if this question was asked before but here it is: is there a way to integrate CUDA in vs 11 beta the same way it is in vs 2010?<br />Thanks,<br />G. ]]></description>
   </item>
      <item>
      <title>Does current cuda-gdb allow single GPU debugging like Nsight 2.2? in CUDA 5 will support it?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7611/does-current-cuda-gdb-allow-single-gpu-debugging-like-nsight-2-2-in-cuda-5-will-support-it</link>
      <pubDate>Thu, 26 Apr 2012 19:31:13 -0400</pubDate>
      <dc:creator>oscarbg</dc:creator>
      <guid isPermaLink="false">7611@/devforum/discussions</guid>
      <description><![CDATA[As Nsight 2.2 now supports single GPU debugging via called software preemption cuda-gdb supports same technology on Linux or Mac? will it support it soon? as seems GTC will unveil nsight for mac and linux hope it's added there too as I think it will use cuda-gdb underneath..]]></description>
   </item>
      <item>
      <title>Issue using NSight 2.2 rc2 with VS11</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7796/issue-using-nsight-2-2-rc2-with-vs11</link>
      <pubDate>Wed, 02 May 2012 15:19:27 -0400</pubDate>
      <dc:creator>diver182</dc:creator>
      <guid isPermaLink="false">7796@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I installed VS11 (Ultimate) beta on my Win7 machine,<br />after that the 4.2x SDK and then Parallel NSight 2.2 rc2.<br />The NSight installer claimed to have made modifications to the VS11 installation.<br />But I can neither find the NSight menu on the upper pane nor the templates<br />for creating a cuda 4.2 project.<br /><br />Do I have to adjust anything to make it work or what did I miss?]]></description>
   </item>
      <item>
      <title>Getting new Laptop for Cuda and Cad work any suggestions?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7546/getting-new-laptop-for-cuda-and-cad-work-any-suggestions</link>
      <pubDate>Wed, 25 Apr 2012 18:04:11 -0400</pubDate>
      <dc:creator>Bleiner</dc:creator>
      <guid isPermaLink="false">7546@/devforum/discussions</guid>
      <description><![CDATA[Hello Everyone,<br /><br />I am looking into buying a new laptop soon for my work.  I use Autocad, but only to do basic things, but I am getting into leveraging GPU Cuda code to speed up massive calculations.  My question is.. if I am going to lean hard on the GPU to do double precision calculatoins (with a little cad on the side) what kind of laptop should I get.  I have seen several machines that have Dual GTX670M cards, but form what I am reading on the forum the Kepler cards are not great performers in the Double precision arena.  Does anyone have any thoughts on this.  I have an Asus with a GTX560M which works well, but I am looking to gain even more performance out of this next laptop.  Should I stay with the 5 series or do you think the New Kepler's (possible two in the laptop) would provide the most performance.<br /><br />Also do you think Quadro would have more Double precision performance than any Geforce card?<br /><br /><br />Thanks<br />Ben<br />]]></description>
   </item>
      <item>
      <title>Cuda Kernels Stop Running After Few Iterations</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7381/cuda-kernels-stop-running-after-few-iterations</link>
      <pubDate>Sat, 21 Apr 2012 14:15:31 -0400</pubDate>
      <dc:creator>Eman</dc:creator>
      <guid isPermaLink="false">7381@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I am writing a code that calls a number of kernels inside a for loop. The number of the loop iterations is 1000. When I run the program, the kernels stop running after a number of iterations. I tried to use cudaGetLastError(); but it didn't give me any information as the output was "Error: unknown error". AS I increase the size of the blocks and the number of threads, the kernels stop running sooner. For example, when the block size is 8 it stopped at iteration 740, while when the size of the block is 16, it stopped at iteration 440.  In each iteration the same resources is being re-used so I really don't understand what is the problem!. <br /><br />Any help will be appreciated. <br /><br />Thanks, <br /> ]]></description>
   </item>
      <item>
      <title>GAI sample games don&#039;t work, GT240 CC1.2</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7826/gai-sample-games-dont-work-gt240-cc1-2</link>
      <pubDate>Thu, 03 May 2012 08:07:03 -0400</pubDate>
      <dc:creator>Pawe</dc:creator>
      <guid isPermaLink="false">7826@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br />I am working on parallel AI game-tree-search project and just wanted to run examples from NVIDIA GAI project as there is everything I would need. Unfortuanately sample applications do not work proparly on my PC (with CC=1.2, windows x64,  and latest drivers installed). Actually I can only run a game with CPU AI. When GPU is toggle only random player plays and always wins. I must be sure that these libraries work proparly so I can write similliar one for another game in my project. Please help!]]></description>
   </item>
      <item>
      <title>How to measure the effective memory bandwidth?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5966/how-to-measure-the-effective-memory-bandwidth</link>
      <pubDate>Thu, 15 Mar 2012 12:04:02 -0400</pubDate>
      <dc:creator>cf3372338</dc:creator>
      <guid isPermaLink="false">5966@/devforum/discussions</guid>
      <description><![CDATA[Hello.<br /><br />It has been widely said that high memory bandwidth (data transfer rates between global memory and local cache) is the key factor to performance. My question is how to properly measure the elapsed time for the memory copy. The following is my code:<br /><br />int main (void) {<br /><br />         cudaEventRecord(start, 0); <br />         Kernel&lt;&lt;&lt; grids,1 &gt;&gt;&gt;(n, x); <br />         cudaEventRecord(stop, 0);<br />         cudaEventSynchronize(stop);<br />         cudaEventElapsedTime(&amp;elapsedTime, start, stop);  <br /><br />}<br /><br />where the kernel function is defined as:<br /><br />__global__ void Kernel (int n, double* x){<br />        int tid = blockIdx.x + blockIdx.y * gridDim.x;<br />        double y;<br />        if (tid &lt; n)<br />           y = x[tid];<br />}<br /><br />Is it a correct way? I appreciate your help, feedback, and comments.]]></description>
   </item>
      <item>
      <title>cuda measure execution time</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7811/cuda-measure-execution-time</link>
      <pubDate>Thu, 03 May 2012 01:25:57 -0400</pubDate>
      <dc:creator>vlbthambawita</dc:creator>
      <guid isPermaLink="false">7811@/devforum/discussions</guid>
      <description><![CDATA[How to measure execution time of cuda program? <br />what is the wrong with following code? it always return (-) values as the result? why?<br /><br />	 cudaEvent_t s1,e1;<br />	float time;<br />	cudaEventCreate(&amp;s1);<br />	cudaEventCreate(&amp;e1);<br />	cudaEventRecord(s1,0);<br /><br /><del></del> kernel&lt;&lt;&lt;&gt;&gt;&gt;<br /><br />        cudaEventSynchronize(e1);<br />	cudaEventElapsedTime(&amp;time,s1,e1);<br />      printf("time=%f\n",time);]]></description>
   </item>
      <item>
      <title>Streaming Multiprocessors</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7676/streaming-multiprocessors</link>
      <pubDate>Sat, 28 Apr 2012 16:08:59 -0400</pubDate>
      <dc:creator>Saouli</dc:creator>
      <guid isPermaLink="false">7676@/devforum/discussions</guid>
      <description><![CDATA[Hello <br />How can we know the Number of Streaming Multiprocessors in Nvdia devices and how much threads can take like the Nvidia G80 have i guess 16 SMP each one can take 8 blocks of threads and max thread shoud be 768 thread]]></description>
   </item>
      <item>
      <title>LNK2001 Undefined symbol not found</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7801/lnk2001-undefined-symbol-not-found</link>
      <pubDate>Wed, 02 May 2012 15:58:17 -0400</pubDate>
      <dc:creator>ronthompson</dc:creator>
      <guid isPermaLink="false">7801@/devforum/discussions</guid>
      <description><![CDATA[I'm using cuda 4.1 and Visual Studio 2008 C++. I have several routines in C that are called from my main program in C++. The C and CUDA routines are compiled with nvcc and the C++ is compiled with the compiler supplied in VS 2008.  It all compiles fine. It won't link because it can't find the routines in C.  I use the extern "C" with the C routines in the global area of my main. It does not help. I believe it is linking using the VS2008 linker and not nvcc.  What am I doing wrong?<br /><br />Ron]]></description>
   </item>
      </channel>
</rss>
