<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
      <title>Tagged with gpu-profiling - NVIDIA Developer Forums</title>
      <link>http://forums.developer.nvidia.com/devforum/discussions/tagged/gpu-profiling/feed.rss</link>
      <pubDate>Wed, 16 May 12 17:32:36 -0400</pubDate>
         <description>Tagged with gpu-profiling - NVIDIA Developer Forums</description>
   <language>en-CA</language>
   <atom:link href="/devforum/discussions/taggedgpu-profiling/feed.rss" rel="self" type="application/rss+xml" />
   <item>
      <title>LiveKernelEvent</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8276/livekernelevent</link>
      <pubDate>Wed, 16 May 2012 07:11:50 -0400</pubDate>
      <dc:creator>Michi10</dc:creator>
      <guid isPermaLink="false">8276@/devforum/discussions</guid>
      <description><![CDATA[Hello support NVIDIA<br /><br />Since 3 weeks, I have a big problem in the video or games comes with me on the screen such a picture.<br /><br />[url=<a href="http://www.abload.de/image.php?img=foto0089s2dri.jpg][img]http://www.abload.de/thumb/foto0089s2dri.jpg[/img][/url]" target="_blank" rel="nofollow">http://www.abload.de/image.php?img=foto0089s2dri.jpg][img]http://www.abload.de/thumb/foto0089s2dri.jpg[/img][/url]</a><br /><br />After a few minutes, this image comes back.<br /><br />[url=<a href="http://www.abload.de/image.php?img=foto00903gckl.jpg][img]http://www.abload.de/thumb/foto00903gckl.jpg[/img][/url]" target="_blank" rel="nofollow">http://www.abload.de/image.php?img=foto00903gckl.jpg][img]http://www.abload.de/thumb/foto00903gckl.jpg[/img][/url]</a><br /><br />Then I looked at the Fehlerrebort me and understand only station so I can make only so enter it you:<br /><br />Produkt<br />Windows<br /><br />Problem<br />Grafikkartenfehler<br /><br />Datum<br />16.05.2012 11:06<br /><br />Status<br />Nicht berichtet<br /><br />Beschreibung<br />Aufgrund eines Videohardwareproblems ist Windows nicht mehr voll funktionsfähig.<br /><br />Problemsignatur<br />Problemereignisame:	LiveKernelEvent<br />Betriebsystemversion:	6.0.6002.2.2.0.768.3<br />Gebietsschema-ID:	1031<br /><br />Dateien zur Beschreibung des Problems<br />WD-20120516-1106.dmp<br />sysdata.xml<br />Version.txt<br /><br />Weitere Informationen über das Problem<br />BCCode:	117<br />BCP1:	87FA1510<br />BCP2:	92358ACE<br />BCP3:	00000000<br />BCP4:	00000000<br />OS Version:	6_0_6002<br />Service Pack:	2_0<br />Product:	768_1<br /><br />I was with my computer at stores where I bought the parts and the graphics card and now even the technician is verwiert and white but not what he do sol my graphics card is only 1 month old. The burden has not revealed the whole PC and the driver version is 296.10 this software I de 5 times and installed but always the same result.<br /><br />What can that this be it suddenly is pink and one freezes the whole computer?<br /><br />From the UIG wonder what can be even?<br /><br />The technician said to me the computer is fine, but what is the error?<br /><br />I defend glad if you please me an only for the problem very soon can give a response?<br /><br />With friendly regards<br />Michael Hirschegger]]></description>
   </item>
      <item>
      <title>How do I use &quot;prof_trigger&quot; (user profile triggers) from my kernel?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8226/how-do-i-use-prof_trigger-user-profile-triggers-from-my-kernel</link>
      <pubDate>Tue, 15 May 2012 13:07:03 -0400</pubDate>
      <dc:creator>m4dc4p</dc:creator>
      <guid isPermaLink="false">8226@/devforum/discussions</guid>
      <description><![CDATA[The CUPTI Event API provides counters for user profiling (prof_trigger_00 through prof_trigger_07).<br /><br />I can figure out how to read those counters, but how do I write to them or in some other use those counters from my kernel?<br /><br />I am using a Tesla (v1.1 capability) GPU.<br /><br />Thanks!]]></description>
   </item>
      <item>
      <title>Portable pinned memory and multiple GPUs: Performance and stability</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7386/portable-pinned-memory-and-multiple-gpus-performance-and-stability</link>
      <pubDate>Sun, 22 Apr 2012 13:47:59 -0400</pubDate>
      <dc:creator>tbenson</dc:creator>
      <guid isPermaLink="false">7386@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I am having some problems using portable pinned memory to share one pinned buffer between multiple GPUs.  I have two separate issues:<br /><br />	 1) Performance of transfers for the GPUs not corresponding to the allocation context are massively degraded; and<br />	 2) It tends to crash my Linux host and force a reboot.<br /><br />I included the code at the end.  There are several flags at the top of the source file to control behavior, including NDEVICES, NBUFFERS, and USE_PINNED_MEMORY.  NDEVICES is the number of GPUs to use, NBUFFERS is the number of buffers to be allocated, and USE_PINNED_MEMORY determines whether or not the buffers are pinned.  The case that fails is NDEVICES = 2, NBUFFERS = 1, and USE_PINNED_MEMORY = true. If I use as many buffers as devices, then things work with or without pinned memory.  It also works without pinned memory for any number of buffers.  However, with the failing case, I get the following:<br /><br />[host:portable_pinned]$ ./portable <br />id = 0, cudaMemcpy time = 22.32 ms<br />id = 0, val = 3.000000 (should be 3.000000)<br />id = 1, cudaMemcpy time = 5457.76 ms<br />id = 1, val = 6.000000 (should be 6.000000)<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.826763] Stack:<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.828257] Call Trace:<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.845154] Code: f6 62 00 85 c0 74 10 e8 69 e4 65 00 0f 1f 00 eb 06 89 77 6c 89 4f 70 48 83 c5 10 5b c3 41 54 53 48 83 ec 08 48 83 ed 08 41 89 f4 &lt;39&gt; 77 6c 73 17 39 77 70 0f 87 ac 00 00 00 39 77 6c 73 09 39 77 <br /><br />The host at this point is only partially responsive and needs to be rebooted.  The system log is full of errors, but a sampling is attached.  This is using driver version 285.05.33, CUDA 4.1, Fedora 14, and kernel 2.6.35.6-45.  The GPUs are two Tesla C2050s that reside in a Tesla S2050 compute server.  They are connected to the host via a single PCI-e cable.  This is a single host in a cluster, so updating the driver is not trivial, although I will do so if this is a known bug.<br /><br />In any case, I suspect that the kernel/driver error is just a bug as I have done something similar in the past without this problem.  However, I still had the poor performance in the past.  Above, the PCIe transfer to the CUDA context in which the allocation was not made is over 200x slower than the transfer for the context in which the allocation was made.  Is this normal?  The documentation just says that cudaHostAllocPortable allows pinned memory to be recognized by other contexts, but does not mention the performance implications of accessing the memory.<br /><br />Thanks for any help/comments,<br /><br />Tom<br /><br />The code is below.  The Timing class is just a wrapper that I have for host timing.  I can include it if needed, but already had to rework this email due to character limitations.  The references can be commented out to compile  the code.<br /><br /><code><br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cuda_runtime.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cassert&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cstdio&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;pthread.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> "timing.hpp"<br /><br />namespace<br />{<br />    const size_t BUFSIZE = 32*1024*1024;<br />    const int NDEVICES = 2;<br />    const int NBUFFERS = 1;<br />    const bool USE_PINNED_MEMORY = true;<br />}<br /><br />struct Params<br />{<br />    float *buf;<br />    int id;<br />};<br /><br />__global__ void test_kernel(float *buf, float val) { buf[0] = val; }<br /><br />void *gpu_thread(void *v)<br />{<br />    Params *params = (Params *) v;<br /><br />    cudaSetDevice(params-&gt;id);<br /><br />    float *dev_buf;<br />    assert(cudaMalloc((void **) &amp;dev_buf, sizeof(float)*BUFSIZE) == cudaSuccess);<br /><br />    double start = Timing::ElapsedTimeMs();<br />    assert(cudaMemcpy(dev_buf, params-&gt;buf, sizeof(float)*BUFSIZE, cudaMemcpyHostToDevice) == cudaSuccess);<br />    double elapsed = Timing::ElapsedTimeMs() - start;<br />    printf("id = %d, cudaMemcpy time = %.2f ms\n", params-&gt;id, elapsed);<br /><br />    test_kernel&lt;&lt;&lt;1,1&gt;&gt;&gt;( dev_buf, (params-&gt;id+1) * 3.0f );<br />    assert(cudaThreadSynchronize() == cudaSuccess);<br /><br />    float retval;<br />    assert(cudaMemcpy(&amp;retval, dev_buf, sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess);<br /><br />    printf("id = %d, val = %f (should be %f)\n", params-&gt;id, retval, (params-&gt;id+1)*3.0f);<br /><br />    assert(cudaFree(dev_buf) == cudaSuccess);<br /><br />    return NULL;<br />}<br /><br />int main(int argc, char **argv)<br />{<br />    float *pinned[NDEVICES];<br /><br />    assert(NBUFFERS &lt;= NDEVICES);<br /><br />    for (int i = 0; i &lt; NBUFFERS; ++i)<br />    {<br />        assert(cudaSetDevice(i) == cudaSuccess);<br />        if (USE_PINNED_MEMORY)<br />        {<br />            assert(cudaHostAlloc((void **) &amp;pinned[i], sizeof(float)*BUFSIZE, cudaHostAllocPortable) == cudaSuccess);<br />        }<br />        else<br />        {<br />            pinned[i] = new float[BUFSIZE];<br />        }<br />        for (size_t k = 0; k &lt; BUFSIZE; ++k) { pinned[i][k] = 1.0f; }<br />    }<br /><br />    pthread_t tid[NDEVICES];<br />    Params params[NDEVICES];<br /><br />    for (int i = 0; i &lt; NDEVICES; ++i)<br />    {<br />        params[i].id = i;<br />        params[i].buf = pinned[i%NBUFFERS];<br />        assert(pthread_create(tid+i, NULL, gpu_thread, (void *) &amp;params[i]) == 0);<br />    }<br /><br />    for (int i = 0; i &lt; NDEVICES; ++i)<br />    {<br />        assert(pthread_join(tid[i], NULL) == 0);<br />    }<br /><br />    for (int i = 0; i &lt; NBUFFERS; ++i)<br />    {<br />        if (USE_PINNED_MEMORY)<br />        {<br />            assert(cudaFreeHost(pinned[i]) == cudaSuccess);<br />        }<br />        else<br />        {<br />            delete [] pinned[i];<br />        }<br />    }<br /><br />    return 0;<br />}<br /></code>]]></description>
   </item>
      <item>
      <title>Getting started, price-worthy hardware?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7891/getting-started-price-worthy-hardware</link>
      <pubDate>Fri, 04 May 2012 18:15:05 -0400</pubDate>
      <dc:creator>AniSkywalker</dc:creator>
      <guid isPermaLink="false">7891@/devforum/discussions</guid>
      <description><![CDATA[Hi!<br /><br /><br /><br />I'm new to both this forum and CUDA but it is very much in my line of interest. I already know both asm and some GPU-programming (float point arithmetics etc) with asm. <br /><br /><br /><br />I want to start with CUDA-programming. I'm searching for price-worthy and CUDA 4 compatible hardware. Since I am very new to the subject, I'd like to be directed to hardware choices that gives relevant experience when writing CUDA code. That is, if double gpu or double cpus are beneficial, I'd like to be pointed towards good and price-worthy solutions there. If a single gpu/cpu solutions is a good enough place to start and get experience (say 35000-50000 lines of code) then I'd go with it. And if if there is some solution that works for now and is upgradeable, I might go with it.<br /><br /><br /><br />Just to be clear, I wouldn't ask here if I wasn't entirely new to this, so please don't mock me if some of my questions are pure stupid. I just don't know better ways to form them right now...]]></description>
   </item>
      <item>
      <title>Batch testing with Parallel Nsight</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7806/batch-testing-with-parallel-nsight</link>
      <pubDate>Wed, 02 May 2012 20:20:01 -0400</pubDate>
      <dc:creator>nunosilva800</dc:creator>
      <guid isPermaLink="false">7806@/devforum/discussions</guid>
      <description><![CDATA[Hello.<br />In building an OpenGL program that is basically a visualizer, and I would like to test it under various configurations (number of lights, model to load, and textures) to assess performance, scalability, etc...<br /><br />So I would like to know how I can make a script to define a bunch of tests, so that I can leave if doing them during the night, and go analyze results the next day. <br />I've found the TestRunner.exe program in C:\Program Files (x86)\NVIDIA Parallel Nsight 2.2\Common, but I don't know what parameters to use with it. <br /><br />I've searched though the user guide and the interwebs, but I can't find anything resembling batch testing with Nsight....<br /><br />How can I do it?<br />thx.]]></description>
   </item>
      <item>
      <title>How to measure the effective memory bandwidth?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5966/how-to-measure-the-effective-memory-bandwidth</link>
      <pubDate>Thu, 15 Mar 2012 12:04:02 -0400</pubDate>
      <dc:creator>cf3372338</dc:creator>
      <guid isPermaLink="false">5966@/devforum/discussions</guid>
      <description><![CDATA[Hello.<br /><br />It has been widely said that high memory bandwidth (data transfer rates between global memory and local cache) is the key factor to performance. My question is how to properly measure the elapsed time for the memory copy. The following is my code:<br /><br />int main (void) {<br /><br />         cudaEventRecord(start, 0); <br />         Kernel&lt;&lt;&lt; grids,1 &gt;&gt;&gt;(n, x); <br />         cudaEventRecord(stop, 0);<br />         cudaEventSynchronize(stop);<br />         cudaEventElapsedTime(&amp;elapsedTime, start, stop);  <br /><br />}<br /><br />where the kernel function is defined as:<br /><br />__global__ void Kernel (int n, double* x){<br />        int tid = blockIdx.x + blockIdx.y * gridDim.x;<br />        double y;<br />        if (tid &lt; n)<br />           y = x[tid];<br />}<br /><br />Is it a correct way? I appreciate your help, feedback, and comments.]]></description>
   </item>
      <item>
      <title>Streaming Multiprocessors</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7676/streaming-multiprocessors</link>
      <pubDate>Sat, 28 Apr 2012 16:08:59 -0400</pubDate>
      <dc:creator>Saouli</dc:creator>
      <guid isPermaLink="false">7676@/devforum/discussions</guid>
      <description><![CDATA[Hello <br />How can we know the Number of Streaming Multiprocessors in Nvdia devices and how much threads can take like the Nvidia G80 have i guess 16 SMP each one can take 8 blocks of threads and max thread shoud be 768 thread]]></description>
   </item>
      <item>
      <title>HW Debug Support</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7751/hw-debug-support</link>
      <pubDate>Tue, 01 May 2012 19:41:18 -0400</pubDate>
      <dc:creator>Vector</dc:creator>
      <guid isPermaLink="false">7751@/devforum/discussions</guid>
      <description><![CDATA[Which gpus have hardware debug support to use with NSight 2.2's single gpu debug feature?<br />Thanks<br />]]></description>
   </item>
      <item>
      <title>Timing asynch streams with cudaDeviceSynchronize and two events</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7746/timing-asynch-streams-with-cudadevicesynchronize-and-two-events</link>
      <pubDate>Tue, 01 May 2012 18:20:21 -0400</pubDate>
      <dc:creator>dlowell</dc:creator>
      <guid isPermaLink="false">7746@/devforum/discussions</guid>
      <description><![CDATA[Here is what the docs say about cudaEventRecord on stream 0:<br />Records an event. If stream is non-zero, the event is recorded after all preceding operations in stream have been completed; otherwise, it is recorded after all preceding operations in the CUDA context have been completed.<br /><br /><code>      cudaStream_t *streamid;<br />      streamid = (cudaStream_t*)malloc(nstreams*sizeof(cudaStream_t));<br />      float elapsed;<br />      cudaEvent_t start, stop;<br />      cudaEventCreate(&amp;start);<br />      cudaEventCreate(&amp;stop);<br />      cudaDeviceSynchronize();<br />      cudaEventRecord(start,0);<br />      /*invoke device kernel*/<br />       for(i=0;i&lt;nstreams;i++){<br />          cudaStreamCreate(&amp;(streamid[i]));<br />          orcu_kernel&lt;&lt;&lt;dimGrid,dimBlock,0,streamid[i]&gt;&gt;&gt;(n,dev_y,dev_x);<br />      }<br />      cudaDeviceSynchronize();<br />      cudaEventRecord(stop,0);<br />      cudaEventSynchronize(stop);<br />      cudaEventElapsedTime(&amp;elapsed,start,stop);<br />      cudaEventDestroy(start);<br />      cudaEventDestroy(stop);<br />      for(i=0;i&lt;nstreams;i++){<br />         cudaStreamDestroy(streamid[i]);<br />      }</code><br /><br />According to the api docs as long as we stick with the zero stream it is fine. <br /><br />I checking of a more experienced CUDA dev knows if this is a legitimate way of timing multiple streams.<br /><br />Update: <br />Also according the the literature out there I don't need cudaDeviceSynchronize() after the kernel call....<br /><br />Thanks all. ]]></description>
   </item>
      <item>
      <title>cudaEvent timers vs. Host timers</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7541/cudaevent-timers-vs-host-timers</link>
      <pubDate>Wed, 25 Apr 2012 16:35:14 -0400</pubDate>
      <dc:creator>dlowell</dc:creator>
      <guid isPermaLink="false">7541@/devforum/discussions</guid>
      <description><![CDATA[In doing performance testing we are trying two methods of timing.<br />Cuda event based timing and system timer.<br />We are running on a Fermi 2070 sm_20, with CUDA SDK 4.2<br />I haven't seen anything the internet that makes it clear whether one is superior over the other in terms of timing. I've seen the cudaDeviceSynchronize() used for this purpose,...any insight would be valuable. <br /><br />Thanks ahead of time!<br /><br /><br />The first is the built in event based:<br /><br /><code>cudaEventRecord(start, 0);<br />kernel&lt;&lt;&lt;grid,block&gt;&gt;&gt;(devy,devx,alpha,length);<br />cudaEventRecord(stop, 0);<br />cudaEventSynchronize(stop);</code><br /><br /><br />The second is a system time based timer using a barrier.<br /><br /><code>  start = getclock();<br />  kernel&lt;&lt;&lt;dimGrid,dimBlock&gt;&gt;&gt;(devy,devx,alpha,length);<br />  cudaDeviceSynchronize();<br />  finish = getclock();</code><br /><br />where getclock() is defined as:<br /><br /><code><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;sys/time.h&gt;<br />double getclock(){<br />  struct timezone tzp;<br />  struct timeval tp;<br />  gettimeofday (&amp;tp, &amp;tzp);<br />  return (tp.tv_sec + tp.tv_usec*1.0e-6);<br />}</code><br /><br /><br />The kernel we are running is:<br /><br /><code>__global__ void  kernel(double* devY,double* devX, double alpha, int length){<br /> /* w &lt;- y + alpha*x */<br />  int tid = blockIdx.x*blockDim.x+threadIdx.x;<br />  if(tid&lt;length){<br />    devY[tid]=alpha*devX[tid];<br />  }<br />}</code><br />]]></description>
   </item>
      <item>
      <title>2 GTX265 cards not detected for debugging</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6796/2-gtx265-cards-not-detected-for-debugging</link>
      <pubDate>Mon, 09 Apr 2012 01:11:30 -0400</pubDate>
      <dc:creator>vinaybgavirangaswamy</dc:creator>
      <guid isPermaLink="false">6796@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />Thank you for reading this and taking time to help me configure for debugging<br /><br />I am trying to do code line by line debugging on a system with 2 gtx 265 cards. I have tried to list my system configuration below if thinking it might help<br /><br />Mother board gigabyte 990fxa-ud3<br />graphics card gtx265 (one from zotech and other is from asus)<br />OS: Windows 7 (64 bit)<br />graphics driver: devdriver_4.1_winvista-win7_64_286.19_general<br />CUDE toolkit: cudatoolkit_4.1.28_win_64<br />parallel nsight: Parallel_Nsight_Win64_2.1.0.12046<br />SLI: disabled<br />WDDM TDR enabled: false<br />wpf hardware acceleration: disabled by running registry file in nsight common folder<br /><br />I am not able to make other graphics card as headless as I do not have that option in nvidia controller. I have attached few pictures of what I see in the controller screen<br /><br />Please help me as I am trying to use this for a course project...<br /><br />Thank you in advance!]]></description>
   </item>
      <item>
      <title>calculating on shader</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5906/calculating-on-shader</link>
      <pubDate>Wed, 14 Mar 2012 10:47:06 -0400</pubDate>
      <dc:creator>zimmerlinde</dc:creator>
      <guid isPermaLink="false">5906@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />everybody speak about the shaders of the Tegra SoC. I<br />read there are x vertex and y fragment shaders. But nowhere i can find<br />an explanation of using parallelism for GLSL<br /><br />How can i use more<br />than one vertex or fragment shader? I’v never seen arguments or code for<br />use more than one vertex or fragment shader. I have no problem with<br />using one fragment and one vertex shader but i dont unserstand how i can<br />use more than one. Does the Compiler choose how many shaders to use?<br /><br />Thank you for your answers]]></description>
   </item>
      <item>
      <title>Parallel Nsight crash</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6726/parallel-nsight-crash</link>
      <pubDate>Thu, 05 Apr 2012 22:07:14 -0400</pubDate>
      <dc:creator>2930</dc:creator>
      <guid isPermaLink="false">6726@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br />My program crashes in the target machine inside NVDA.Graphics.Interception.100.dll, after a call to d3d11 SwapChain present.<br />Anyone knows what could be causing this? i can profile the sample programs without a problem.<br />The target machine is a laptop with a quadro 2000M.<br />I've attached a capture of the relevant parts of the call stack.<br /><br />Thanks.]]></description>
   </item>
      <item>
      <title>Optimus chipsets and NVApi</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6596/optimus-chipsets-and-nvapi</link>
      <pubDate>Mon, 02 Apr 2012 03:18:25 -0400</pubDate>
      <dc:creator>PiiX</dc:creator>
      <guid isPermaLink="false">6596@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />Here is my problem, since i got this new optimus chipset ( nVidia GT540m + core i5 ) i'm in a lot of trouble. With graphic and performances problems. So I tried to debug the nvidia context.<br /><br />And strangely enough, I got a reportable bug that way.<br /><br />NVApi doesn't work properly in my framework.<br /><br />so :<br /><br />NvDisplayHandle pNvDispHandle1;<br />NvAPI_Status res = NvAPI_Initialize();<br />res = NvAPI_EnumNvidiaDisplayHandle(0, &amp;pNvDispHandle1);<br />NvAPI_Unload();<br /><br />NvApi is initialized fined but "NvApi_EnumNvidiaDisplayHandle" returns "-6" : Device not found.<br />Wether I start with Intel or nVidia.<br />But this little piece of code does work on a standard workstation with GeForce 9600 and intel E8400.<br /><br />So I mainly want to ask if there is someone around with a optimus PC that can call this function and have a "nvapi_ok" returned ?]]></description>
   </item>
      <item>
      <title>unexpected situation when using nvidia-smi command</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6246/unexpected-situation-when-using-nvidia-smi-command</link>
      <pubDate>Fri, 23 Mar 2012 03:44:26 -0400</pubDate>
      <dc:creator>AlanKao2012</dc:creator>
      <guid isPermaLink="false">6246@/devforum/discussions</guid>
      <description><![CDATA[Hi there:<br /><br />If anything goes fine, it should be like this When I keyed in "nvidia-smi":<br />+------------------------------------------------------+                       <br />| NVIDIA-SMI 2.285.05   Driver Version: 285.05.33      |                       <br />|-------------------------------+----------------------+----------------------+<br />| Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB / DB |<br />| Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute M. |<br />|===============================+======================+======================|<br />| 0.  Tesla C2070               | 0000:02:00.0  On     |       Off            |<br />|  38%   80 C  P8    Off /  Off |   1%   54MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 1.  Tesla C2070               | 0000:03:00.0  Off    |       Off            |<br />|  30%   57 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 2.  Tesla C2070               | 0000:83:00.0  Off    |       Off            |<br />|  30%   66 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 3.  Tesla C2070               | 0000:84:00.0  Off    |       Off            |<br />|  30%   67 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| Compute processes:                                               GPU Memory |<br />|  GPU  PID     Process name                                       Usage      |<br />|=============================================================================|<br />|  No running compute processes found                                         |<br />+-----------------------------------------------------------------------------+<br /><br />But sometimes, it halts for a while, and:<br />+------------------------------------------------------+                       <br />| NVIDIA-SMI 2.285.05   Driver Version: 285.05.33      |                       <br />|-------------------------------+----------------------+----------------------+<br />| Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB / DB |<br />| Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute M. |<br />|===============================+======================+======================|<br />| 0.  Tesla C2070               | 0000:02:00.0  On     |       Off            |<br />|  38%   ERR!  P8    Off /  Off |   1%   54MB / 6143MB |   99%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 1.  Tesla C2070               | 0000:03:00.0  Off    |       Off            |<br />|  30%   57 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 2.  Tesla C2070               | 0000:83:00.0  Off    |       Off            |<br />|  30%   66 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 3.  Tesla C2070               | 0000:84:00.0  Off    |       Off            |<br />|  30%   67 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| Compute processes:                                               GPU Memory |<br />|  GPU  PID     Process name                                       Usage      |<br />|=============================================================================|<br />|  No running compute processes found                                         |<br />+-----------------------------------------------------------------------------+<br /><br />It happens to a supermicro 7046GT server with 4 tesla C2070. Each time this error occurs, I always reboot it, but soon I would find that it would not successfully boot up and need to be rebooted again.<br />It is so disturbing and it DOES cause some errors when trying to do computing on GPU.<br /><br />Is there anyone who has encountered this situation?<br />Is it more likely to be a hardware problem(motherboard, GPU, ...)? or software problem(bad usage of program, driver, OS)?<br /><br />Any reply will be appreciated. Thank you very much.]]></description>
   </item>
      <item>
      <title>nsight debug session of opengl application missing extensions</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6006/nsight-debug-session-of-opengl-application-missing-extensions</link>
      <pubDate>Fri, 16 Mar 2012 07:03:06 -0400</pubDate>
      <dc:creator>phpfreaked9</dc:creator>
      <guid isPermaLink="false">6006@/devforum/discussions</guid>
      <description><![CDATA[I have an opengl application which I would like to profile with Nsight, during my debug section I noticed, that a lot of extensions are not initialized. For the initialization I use glew and the application works w/o nsight attached. Examples of extensions that remain uninitialized are frame-buffer.<br /><br />I have reinstalled the latest drivers, disabled WDDM TDR. The monitor states it has been properly configured for debugging. However the issue persists. Could anybody lend me a hand with this?  <br /><br />]]></description>
   </item>
      <item>
      <title>Issue reading environment variables on remote machine</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6051/issue-reading-environment-variables-on-remote-machine</link>
      <pubDate>Sat, 17 Mar 2012 14:54:23 -0400</pubDate>
      <dc:creator>diver182</dc:creator>
      <guid isPermaLink="false">6051@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I'm using Parallel Nsight in a remote setup (laptop -&gt; remote machine).<br />Both systems use the same software versions (Cuda Toolkit 4.1/Nsight 2.1/Windows 7)<br />The application requires the use of environment variables to retrieve paths to test data.<br /><br />The following part of a routine causes an assertion failure,<br />which does not appear when executing the application either on the laptop or the remote machine explicitly<br />(meaning without Parallel Nsight, by executing the application directly on either laptop or remote machine):<br /><br /><code><br />char * val;<br />size_t reqSize;<br /><br />// env_var := name of the environment variable to retrieve the value for<br />getenv_s(&amp;reqSize, NULL, 0, env_var); <br /><br />val = (char*)malloc(reqSize * sizeof(char));<br />if (!val)<br />   _RPT0(_CRT_ERROR, "Failed to ...");<br /><br />getenv_s(&amp;reqSize, val, reqSize, env_var);<br />// ...<br /></code><br /><br />The assertion failure states:<br />"File: f:\dd\vctools\crt_bld\self_x86\crt\ src\getenv.c Line:266<br />Expression: (buffer != NULL &amp;&amp; sizeInTChars &gt; 0) || (buffer == NULL &amp;&amp; sizeInTChars == 0)"<br /><br />(Please note that drive f: does not exist on either the laptop or the remote machine).<br /><br />What could be the cause for this behaviour and how would I fix this?<br /><br />Suggestions greatly appreciated. ]]></description>
   </item>
      <item>
      <title>Instrumented driver for profiling on Linux</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5256/instrumented-driver-for-profiling-on-linux</link>
      <pubDate>Wed, 29 Feb 2012 05:58:11 -0500</pubDate>
      <dc:creator>thorfdbg</dc:creator>
      <guid isPermaLink="false">5256@/devforum/discussions</guid>
      <description><![CDATA[Dear NVidia team,<br /><br />looking currently into OpenCL development, I'm missing a method for profiling my kernel code which runs slower than expected. The nvvp debugger on Linux is of less help than expected as it cannot collect all necessary data for a full analysis [I get an error saying "CUPTI_ERROR_PARAMETER_SIZE_NOT_SUFFICIENT"]. I believe this might be because I'm not using an instrumented X driver for my 560GT graphics card. However, the latest available driver I could find, NVPerfKit-Linux-x86_64-195.36.31, does not support the Fermi-based chips. Where would I find newer instrumented drivers and/or a fully functional nvvp profiler?]]></description>
   </item>
      <item>
      <title>NSight: Crashes</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5566/nsight-crashes</link>
      <pubDate>Wed, 07 Mar 2012 10:59:25 -0500</pubDate>
      <dc:creator>MatthiasK</dc:creator>
      <guid isPermaLink="false">5566@/devforum/discussions</guid>
      <description><![CDATA[Hey,<br /><br />I'm trying to use NSight for graphics profiling &amp; debugging (D3D11). I encountered a lot of issues which make NSight somewhat unusable for me at this time:<br /><br />1) Resizing a swap chain does not work with NSight. It fails with a debug message (triggered by IDXGISwapChain::ResizeBuffer):<br /><br />"DXGI Error: Swapchain cannot be resized unless all outstanding buffer references have been released"<br /><br />Without NSight attached everything works/resizes fine.<br /><br />2) The frame profiling tool immediately crashes visual studio when I click on an event. So no way to see or investigate any performance data at all. This is really a pity since it's exactly what I'm after.<br /><br />3) Viewing any kind of buffer (e.g. vertex buffer, rgba8 texture raw memory buffer, ...) crashes visual studio<br /><br />4) Frame timings does not work at all. Error: "Timeout waiting for workload results". No crash.<br /><br />5) Minor: The texture array slider sometimes does not work and always snaps back to level 1. No crash.<br /><br />The chosen target is an (intentionally old) GT8800 on Vista-32. Host running on Win7-64 with Visual Studio 2010.<br /><br />Is there anything I can do to circumvent all those problems? Will a new version address them?<br /><br />Thanks, <br />-Matthias]]></description>
   </item>
      <item>
      <title>Using performance counters under Windows 7 64 bit</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4921/using-performance-counters-under-windows-7-64-bit</link>
      <pubDate>Mon, 20 Feb 2012 10:15:12 -0500</pubDate>
      <dc:creator>gdewan</dc:creator>
      <guid isPermaLink="false">4921@/devforum/discussions</guid>
      <description><![CDATA[I have been trying to get the nvidia performance counters working but having no luck.  The drivers I am using are the ones that are part of the latest NSight.<br /><br />When I initially call NVPMInit, I get NVPM_OK returned.  But If I try to call any NVPM* function after that, it returns NVPM_ERROR_NOT_INITIALIZED.<br /><br />What do I need to do to get NVPM working?<br /><br />Additional information about my setup:<br /><br />[Display]<br />Operating System:	Windows 7 Professional, 64-bit (Service Pack 1)<br />DirectX version:	11.0 <br />GPU processor:		GeForce GTX 580<br />Driver version:		286.16<br />DirectX support:	11<br />CUDA Cores:		512 <br />Core clock:		772 MHz <br />Shader clock:		1544 MHz<br />Memory clock:		2004 MHz (4008 MHz data rate) <br />Memory interface:	384-bit <br />Total available graphics memory:	4095 MB<br />Dedicated video memory:	1536 MB GDDR5<br />System video memory:	0 MB<br />Shared system memory:	2559 MB<br />Video BIOS version:	70.10.20.00.80<br />IRQ:			16<br />Bus:			PCI Express x16 Gen2<br /><br />[Components]<br /><br />easyUpdatusAPIU64.DLL		1.5.20.0		NVIDIA Update Components<br />WLMerger.exe		1.5.20.0		NVIDIA Update Components<br />Nvlhr.exe		1.5.20.0		NVIDIA Update Components<br />daemonu.exe		1.5.20.0		NVIDIA Update Components<br />ComUpdatusPS.dll		1.5.20.0		NVIDIA Update Components<br />ComUpdatus.exe		1.5.20.0		NVIDIA Update Components<br />NvUpdtr.dll		1.5.20.0		NVIDIA Update Components<br />NvUpdt.dll		1.5.20.0		NVIDIA Update Components<br />nvui.dll		7.17.12.8616		NVIDIA User Experience Driver Component<br />nvxdsync.exe		8.17.12.8616		NVIDIA User Experience Driver Component<br />nvxdplcy.dll		8.17.12.8616		NVIDIA User Experience Driver Component<br />nvxdbat.dll		8.17.12.8616		NVIDIA User Experience Driver Component<br />nvxdapix.dll		8.17.12.8616		NVIDIA User Experience Driver Component<br />NVCPL.DLL		8.17.12.8616		NVIDIA User Experience Driver Component<br />nvCplUIR.dll		3.9.732.0		NVIDIA Control Panel<br />nvCplUI.exe		3.9.732.0		NVIDIA Control Panel<br />nvWSSR.dll		6.14.12.8616		NVIDIA Workstation Server<br />nvWSS.dll		6.14.12.8616		NVIDIA Workstation Server<br />nvViTvSR.dll		6.14.12.8616		NVIDIA Video Server<br />nvViTvS.dll		6.14.12.8616		NVIDIA Video Server<br />NVSTVIEW.EXE		7.17.12.8616		NVIDIA 3D Vision Photo Viewer<br />NVSTTEST.EXE		7.17.12.8616		NVIDIA 3D Vision Test Application<br />NVSTRES.DLL		7.17.12.8616		NVIDIA 3D Vision Module  (0)<br />nvDispSR.dll		6.14.12.8616		NVIDIA Display Server<br />NVMCTRAY.DLL		8.17.12.8616		NVIDIA Media Center Library<br />nvDispS.dll		6.14.12.8616		NVIDIA Display Server<br />PhysX		09.11.0621		NVIDIA PhysX<br />NVCUDA.DLL		8.17.12.8616		NVIDIA CUDA 4.1.1 driver<br />nvGameSR.dll		6.14.12.8616		NVIDIA 3D Settings Server<br />nvGameS.dll		6.14.12.8616		NVIDIA 3D Settings Server<br />]]></description>
   </item>
      <item>
      <title>SDK 4.1 NVVP visual profiler - unable to collect metric values (looking for memory access data)</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5056/sdk-4-1-nvvp-visual-profiler-unable-to-collect-metric-values-looking-for-memory-access-data</link>
      <pubDate>Thu, 23 Feb 2012 14:10:40 -0500</pubDate>
      <dc:creator>connertorcroboticscom</dc:creator>
      <guid isPermaLink="false">5056@/devforum/discussions</guid>
      <description><![CDATA[I'm running the linux version of the 4.1 release of NVVP, the new visual profiler, with driver 285.05.33 on an NVIDIA GTX-560.<br /><br />I cannot get the system to collect metric/event data even though this is a compute capability 2.0 card.  The test program executes multiple times (24 typically), but then returns:<br />Metric/Event Collection Failed<br />Unable to collect metric and event values<br />4<br /><br />The code is developed in OpenCL not CUDA.<br /><br />I am looking for the memory access information (what was coalesced) that was previously available with the old computeprof tool under 4.0, but I can't find  that information under NVVP.  I was hoping that the metric data would give that information.<br /><br />Thanks for any help.<br /><br />David<br /><br />]]></description>
   </item>
      <item>
      <title>NVIDIA Parallel Nsight 2.1 Release Candidate 2 now available!</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2171/nvidia-parallel-nsight-2-1-release-candidate-2-now-available</link>
      <pubDate>Mon, 05 Dec 2011 19:16:15 -0500</pubDate>
      <dc:creator>Sebastien Domine</dc:creator>
      <guid isPermaLink="false">2171@/devforum/discussions</guid>
      <description><![CDATA[<br /><p class="MsoNormal">NVIDIA Parallel Nsight 2.1 Release Candidate 2 now available! </p><br /><p class="MsoNormal">Dear Parallel Nsight User,</p><br /><p class="MsoNormal">Building on the NVIDIA Parallel Nsight™ 2.1 Release Candidate 1 release with multiple bug fixes and stability improvements, we are proud to announce the release of <b>NVIDIA Parallel Nsight™ 2.1 Release Candidate 2</b>. This release<br />brings support for the new <b>CUDA Toolkit 4.1 </b>Release Candidate 2, which can be downloaded under the CUDA Registered Developer Program (<a href="http://www.developer.nvidia.com/join">www.developer.nvidia.com/join</a>). Parallel Nsight 2.1 adds a number of new features to enhance debugging and profiling capabilities. </p><br /><p class="MsoNormal">This release requires <b>NVIDIA Display Driver Release 285.86</b>, available on the same download site. </p><br /><ul style="list-style-type:disc;margin-top:0in;"><li class="MsoNormal" style="margin-bottom:.0001pt;"> Traced workloads can now <b>navigate the dependencies and call stack</b> to allow the developer to follow through GPU workloads, corresponding API calls and host code that was the cause of the activity.</li><li class="MsoNormal" style="margin-bottom:.0001pt;"><b>CUDA warp watch</b> visualizes variables and expressions across an entire CUDA warp.</li><li class="MsoNormal"></li></ul>]]></description>
   </item>
      <item>
      <title>Degradion Performance 4.1 over 4.0</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4436/degradion-performance-4-1-over-4-0</link>
      <pubDate>Wed, 08 Feb 2012 09:29:40 -0500</pubDate>
      <dc:creator>kalman</dc:creator>
      <guid isPermaLink="false">4436@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br />due the fact our application has to not be simply fast but it should perform<br />some operations with fixed deadlines (we analyze a continuous radio signal)<br />we perform several time per day benchmarks of all our algorithm.<br /><br />We are experiencing a clear degradation adopting CUDA 4.1 over the old CUDA 4.0.<br /><br />I have attached 4 images showing the historical performance data of 4 algorithms <br />(they are not all the affected ones, but the simplest to show you the kernel code).<br /><br />For all graphs the reported time is in milliseconds (y-axis).<br /><br />All kernels are launched in this way:<br /><br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> BLOCK_SIZE (1&lt;&lt;9)<br /><br />dim3 myThreads(BLOCK_SIZE);<br />dim3 myGrid( (aSize + BLOCK_SIZE - 1) / BLOCK_SIZE);<br /><br />Kernel&lt;&lt;&lt; myGrid, myThreads&gt;&gt;&gt;(.....);<br /><br />We have the C2050 cards with ECC off.<br /><br />============================================================================<br />Sum of two complex vectors (2^20 complex):  add_cc.png<br /><br />__global__ void<br />VectorVectorSumKernelCC_O(const float2* aIn1,<br />	                  const float2* aIn2,<br />                          float2* aOut,<br />                          const unsigned int aSize) {<br />  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;<br />  if (myPos &lt; aSize) {<br />    aOut[myPos].x = aIn1[myPos].x + aIn2[myPos].x;<br />    aOut[myPos].y = aIn1[myPos].y + aIn2[myPos].y;<br />  }<br />}<br />============================================================================<br />Product of two complex vectors (2^20 complex): mul_cc.png<br /><br />__global__ void<br />MulKernel_cv_cv_o(const float2* aIn1,<br />                  const float2* aIn2,<br />                  float2* aOut,<br />                  const unsigned int aSize) {<br />  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;<br />  if (myPos &lt; aSize) {<br />    const float myReal1 = aIn1[myPos].x;<br />    const float myReal2 = aIn2[myPos].x;<br />    const float myImag1 = aIn1[myPos].y;<br />    const float myImag2 = aIn2[myPos].y;<br />    aOut[myPos].x = myReal1 * myReal2 - myImag1 * myImag2;<br />    aOut[myPos].y = myReal1 * myImag2 + myImag1 * myReal2;<br />  }<br />}<br />============================================================================<br />Product of two complex vectors (2^20 complex), in place: mul_cc_i.png<br /><br />__global__ void<br />MulKernel_cv_cv_i(const float2* aIn,<br />                        float2* aInOut,<br />                  const unsigned int aSize) {<br />  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;<br />  if (myPos &lt; aSize) {<br />    const float myTmp = aInOut[myPos].x;<br />    const float myInR = aIn[myPos].x;<br />    const float myInI = aIn[myPos].y;<br />    aInOut[myPos].x = myInR * aInOut[myPos].x - myInI * aInOut[myPos].y;<br />    aInOut[myPos].y = myInR * aInOut[myPos].y + myInI * myTmp;<br />  }<br />}<br />============================================================================<br />Tone generation (2^20 vector long): tone.png<br /><br />__global__ void<br />ComplexExpKernel(float2* aInOut,<br />                 const unsigned int aSize,<br />                 const float aMagnitude,<br />                 const float aNormalizedFrequency,<br />                 const float aInverseFrequency,<br />                 const float aPhase) {<br /><br />  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;<br />  if (myPos &lt; aSize) {<br />    const float myArgument = aNormalizedFrequency * fmodf((float)myPos, aInverseFrequency) + aPhase;<br />    aInOut[myPos].x = aMagnitude * __cosf(myArgument);<br />    aInOut[myPos].y = aMagnitude * __sinf(myArgument);<br />  }<br />}<br />============================================================================ ]]></description>
   </item>
      <item>
      <title>Can we change the number of running cores in Tesla Card?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4021/can-we-change-the-number-of-running-cores-in-tesla-card</link>
      <pubDate>Thu, 26 Jan 2012 14:14:30 -0500</pubDate>
      <dc:creator>hugepuff</dc:creator>
      <guid isPermaLink="false">4021@/devforum/discussions</guid>
      <description><![CDATA[I want to know how to change the number for the running cores in Tesla card.<br />When you call and initial a kernel, you have pointed out how many thread and stream processor you are going to use.<br />However if you have 480 core gpu card,and you are running a program with only 120 thread. What the status of the other 360 stream processor?<br />Have they automatically closed by Nivdia power saving strategy for saving power? ]]></description>
   </item>
      <item>
      <title>CUDA Profiler 4.1 issue -</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1591/cuda-profiler-4-1-issue-</link>
      <pubDate>Thu, 17 Nov 2011 05:57:09 -0500</pubDate>
      <dc:creator>omegaRho</dc:creator>
      <guid isPermaLink="false">1591@/devforum/discussions</guid>
      <description><![CDATA[Hi, <br /><br />I am getting this error and am unable to figure out how to fix it. <br /><br />[quote]<br />Unable to collect metric and event values. <br />The order of kernel execution does not match the timeline. To associate events and metrics with the correct kernel, the application must behave identically on each run. Discarding all collected events and metrics.<br />[/quote]<br /><br />What does it mean by identically ? Does it mean that the timing must be exactly the same for each run ?<br /><br />I am running the 4.1 version of the profiler on linux. The cuda 4.0 profiler gave me no issues for the same executable.]]></description>
   </item>
      <item>
      <title>Need a driver update for GeForce 330M</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4121/need-a-driver-update-for-geforce-330m</link>
      <pubDate>Sun, 29 Jan 2012 16:41:39 -0500</pubDate>
      <dc:creator>twist69</dc:creator>
      <guid isPermaLink="false">4121@/devforum/discussions</guid>
      <description><![CDATA[My notebook is AcerAspire5745G-434G64Mn<br />My OS is Windows 7 Ultimate x64<br />My video card is GeForce 330M (HW ID is PCI\VEN_10DE&amp;DEV_0A29&amp;SUBSYS_035B1025&amp;REV_A2<br />)<br /><br />I need a driver update, because some games and apps need latest drivers, but when I download 330M drivers from nVidia, they is not compatible with my video card.<br />Sometimes video drivers crash, but now I can't give info for this problem. I think when you do latest version of drivers compatible with my card, it will be fix.<br /><br />Help me please!]]></description>
   </item>
      <item>
      <title>PerfHUD executable</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3661/perfhud-executable</link>
      <pubDate>Wed, 18 Jan 2012 17:00:11 -0500</pubDate>
      <dc:creator>Jim Gray</dc:creator>
      <guid isPermaLink="false">3661@/devforum/discussions</guid>
      <description><![CDATA[Hi!  I'm trying to get started with PerfHUD, to analyze a directX application on windows/PC.  Unfortunately, I'm not even able to find the PerfHUD executable.  :(<br /><br />I've read the document titled "Getting Started with PerfHUD 6.0".  It says "Launch PerfHUD by double-clicking the desktop icon, or using the Start Menu."<br /><br />Well, I've installed the PerfKit *twice* and still, I don't have a desktop icon, or a Start Menu item for PerfHUD.  Also, searching on my hard drive, I don't find perfhud.exe or anything like that.  For reference, the file I'm installing from is NVIDIA_PerfKit_x86_6.72.0719.0645.exe.<br /><br />Obviously, I'm missing some crucial information.  Any help would be appreciated!  Thanks!<br /><br />Jim Gray<br />]]></description>
   </item>
      <item>
      <title>CUDA Command Line Profiler</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3226/cuda-command-line-profiler</link>
      <pubDate>Mon, 09 Jan 2012 22:33:06 -0500</pubDate>
      <dc:creator>lemonherb</dc:creator>
      <guid isPermaLink="false">3226@/devforum/discussions</guid>
      <description><![CDATA[Update:<br /><br />It seems to be working correctly if I use a larger data set. I'll do some more experiments to see if it is. I guess these two types of counters aren't as sensitive as some of the others.<br /><br />Hi. I was trying to use the command line profiler to see how many DRAM reads my code was doing.<br />However, neither<br />l2_subp0_read_sector_misses<br />or<br />fb_subp0_read_sectors<br />gives any meaningful answers. <br /><br />For a simple data-copy kernel that does <br />trg[blockDim.x * gridDim.x + threadIdx.x] = src[blockDim.x * gridDim.x + threadIdx.x]<br />these counters always return '11' no matter how much data I copy, whereas other counters such as <br />gld_request<br />and<br />l1_global_load_miss<br />gives the correct readings.<br /><br />I am using a GTX580 and using CUDA 4.0 'nvcc --version' returns<br /><br />nvcc: NVIDIA (R) Cuda compiler driver<br />Copyright (c) 2005-2011 NVIDIA Corporation<br />Built on Thu_May_12_11:09:45_PDT_2011<br />Cuda compilation tools, release 4.0, V0.2.1221<br /><br />Thank you!<br />]]></description>
   </item>
      <item>
      <title>Debugging Shaders</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3016/debugging-shaders</link>
      <pubDate>Tue, 03 Jan 2012 17:58:14 -0500</pubDate>
      <dc:creator>NiteLordz</dc:creator>
      <guid isPermaLink="false">3016@/devforum/discussions</guid>
      <description><![CDATA[I am hoping to use nsight to debug a shader issue i am having.  however, when i view the shader list, it says that the shader is not avaiable, but allows me to see the disassembly. this doesn't help :(<br /><br />my question is what options do i need to enable to allow me to debug the shaders.<br /><br />i run an offline shader compiler that compiles my code into byte code ( with DEBUG flag set ), and i load the file using <br /><br />hr = device-&gt;CreateVertexShader(byteCodeVS_, byteCodeSizeVS_, &amp;vertexShader_);<br /><br />once inside my engine.  <br /><br />Thanks much for your help]]></description>
   </item>
      <item>
      <title>OpenGL application crashes after starting with NSight Graphics Debugging</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2001/opengl-application-crashes-after-starting-with-nsight-graphics-debugging</link>
      <pubDate>Wed, 30 Nov 2011 13:42:56 -0500</pubDate>
      <dc:creator>Ident8</dc:creator>
      <guid isPermaLink="false">2001@/devforum/discussions</guid>
      <description><![CDATA[If I start my application from Visual Studio 2010 over the NSight bar using Graphics Debugging, the application will create the OpenGL context as usual, start working for a bit and then crashes.<br /><br />I cannot keep my console open to see if something special was written there, because i cant "start without debugging" also if i write the console into a txt log file i have the problem that the NSight message are not printed into there, they only get into the console, I cannot redirect them with cout. But actually, after starting like 20 times and trying to figure what NSight writes into my console, i still couldn't see anything out of ordinary in the short time i had to look at the log.<br /><br />I checked all Visual Studio 2010 Outputs, there is nothing special written there by NSight. I don't see why it crashes.<br />Also i cannot run any of the samples. I use the latest early release candidate of Nvidia NSight. I get errors if i try to build any sample project, the errors relate to fx files, i gave up after a while.<br /><br />My graphics card is a Quadro FX1800M, my application uses OpenGL (2 contexts, 1st gets destroyed after a short bit and then the 2nd gets created, but that shouldnt be an issue) i use VS 2010 on a Windows 7 X64 machine but the program is compiled x86.<br /><br />]]></description>
   </item>
      <item>
      <title>Does the PerfSDK support the 580 GTX?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1496/does-the-perfsdk-support-the-580-gtx</link>
      <pubDate>Mon, 14 Nov 2011 06:52:44 -0500</pubDate>
      <dc:creator>Claude Dareau</dc:creator>
      <guid isPermaLink="false">1496@/devforum/discussions</guid>
      <description><![CDATA[I am having trouble using with PerfHUD with a 580 GTX. I see the message "Data unavailable. Failed to initialize NVIDIA PerfSDK." and then after a seemingly random time period the pc will reboot. <br /><br />We also have a couple of 460 GTXs that we are using that are working just fine so I am working on the assumption that the 580 GTX is not supported.  Am I correct to make this assumption?<br /><br />Driver version is 285.62; we are running a 32bit DX9 based application so using the 32bit version of PerfHUD; the OS is 64bit Win7. This is true for both the working 460 GTX and the non working 580 GTX configs.]]></description>
   </item>
      <item>
      <title>How to enable the PerfHUD ES on Android HoneyComb(3.1) device from Window host</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/756/how-to-enable-the-perfhud-es-on-android-honeycomb3-1-device-from-window-host</link>
      <pubDate>Wed, 28 Sep 2011 23:51:33 -0400</pubDate>
      <dc:creator>maceo1975</dc:creator>
      <guid isPermaLink="false">756@/devforum/discussions</guid>
      <description><![CDATA[hi there<br />my test device is the Acer A500 tablet with Android 3.1.<br />I've followed the tegra_perfhud_quickstart document to install and setup the perfhud profiling tool to monitor my android device with v3.1.<br />however, I cannot find the perfhud_switch execute file (enable_perfhud.bat) to enabling my device for the perfHUD.<br />where can I have it?<br />(I saw an Android 3.1 tablet used as a target device in the video demo from your website which introduce the perfHUB. hence, there must something can do on honeycomb)<br /><br />Thanks....]]></description>
   </item>
      <item>
      <title>Profiling GLSL 4 shaders</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1121/profiling-glsl-4-shaders</link>
      <pubDate>Sun, 23 Oct 2011 16:07:30 -0400</pubDate>
      <dc:creator>[Deleted User]</dc:creator>
      <guid isPermaLink="false">1121@/devforum/discussions</guid>
      <description><![CDATA[<div class="Deleted">The user and all related content has been deleted.</div>]]></description>
   </item>
      <item>
      <title>NVIDIA Parallel Nsight 2.1 Release Candidate 1 now available for download!</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1431/nvidia-parallel-nsight-2-1-release-candidate-1-now-available-for-download</link>
      <pubDate>Thu, 10 Nov 2011 20:50:50 -0500</pubDate>
      <dc:creator>Sebastien Domine</dc:creator>
      <guid isPermaLink="false">1431@/devforum/discussions</guid>
      <description><![CDATA[Dear Parallel Nsight Users,<br /><br />NVIDIA is proud to announce the new release of NVIDIA Parallel Nsight™ 2.1 Release Candidate 1. This new release brings support for the new CUDA Toolkit 4.1 Release Candidate 1, which can be downloaded under the CUDA Registered Developer Program (www.developer.nvidia.com/join). Parallel Nsight 2.1 also adds new features for an enhanced CUDA debugging and profiling experience, and new features for DirectX graphics developers such as the ability to edit shaders, while the application is running, and measure drawcall timings for more advanced profiling analysis. <br /><br />Parallel Nsight 2.1 RC1 can be downloaded from <a href="http://parallelnsight.nvidia.com/content/parallel-nsight-early-access" target="_blank" rel="nofollow">http://parallelnsight.nvidia.com/content/parallel-nsight-early-access</a>, and requires Driver Release 285.67, available on the same download site.<br /><br />For a complete list of the new exciting 2.1 features, go to <a href="http://parallelnsight.nvidia.com/content/parallel-nsight-early-access" target="_blank" rel="nofollow">http://parallelnsight.nvidia.com/content/parallel-nsight-early-access</a>.<br /><br />We recommend developers to give this new release an early trial and send us feedback via the developer tools forums or report issues by emailing ParallelNsight-Support@nvidia.com. <br /><br />The NVIDIA Developer Tools Team<br />]]></description>
   </item>
      <item>
      <title>OPENGL ES 2.0; Tegra2; GPU; How much GPU cores are used into glDrawArrays/glDrawElements functions;</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1401/opengl-es-2-0-tegra2-gpu-how-much-gpu-cores-are-used-into-gldrawarraysgldrawelements-functions</link>
      <pubDate>Thu, 10 Nov 2011 07:58:29 -0500</pubDate>
      <dc:creator>GreenTroll</dc:creator>
      <guid isPermaLink="false">1401@/devforum/discussions</guid>
      <description><![CDATA[Hi everybody!<br /><br />Do Anybody have information about how much GPU cores will be used when I call glDrawArrays/glDrawElements???<br /><br />A bit more details to explain my question.<br />Processor Tegra2 has 4cores GPU. To work libGLESv2.so is used.<br />After all preparatory works have been done (create and link shaders; upload textures and etc), I call DRAW function which started rasterization and create image in framebuffer. <br />I think, DRAW function has to use as more cores as possible to do rasterizarion more fast.<br />But I can't found any documents which confirm my theory. <br />Description of OPENGL has only information about there own level API, and, understandably, not any information about below levels. NVIDIA don't present description how libGLESv2.so is realized.<br /><br /><br />Thanks for your attention.<br />With Best Wishes.<br /><br /><br /> ]]></description>
   </item>
      <item>
      <title>DirectX Application crashes when run with Parallel NSight 2.0</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1016/directx-application-crashes-when-run-with-parallel-nsight-2-0</link>
      <pubDate>Sat, 15 Oct 2011 10:41:34 -0400</pubDate>
      <dc:creator>TiagoVCota</dc:creator>
      <guid isPermaLink="false">1016@/devforum/discussions</guid>
      <description><![CDATA[When I try to profile any DirectX application with NSight the application crashes (I believe it happens when Present() is called).<br /><br />I'm trying to profile on an Asus N55SF with an nVidia GeForce GT 555M(+Intel HD Graphics) and with driver version 285.38 installed (the only driver that I'm able to install after version 268.xx).<br /><br />What might be causing this problem?]]></description>
   </item>
      <item>
      <title>Very Slow Opengl Performance with Geforce 580/590 gtx</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/876/very-slow-opengl-performance-with-geforce-580590-gtx</link>
      <pubDate>Fri, 07 Oct 2011 05:41:26 -0400</pubDate>
      <dc:creator>hanno hugenberg</dc:creator>
      <guid isPermaLink="false">876@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br /><br />We ran into some (for us) un-explainable performance issues with our new Zotac geforce 580 gtx and Zotac geforce 590 gtx cards.<br /><br />The Software we use is our own development and based on OpenSG 1.8 ( uses Opengl 2.1 as far as i know). <br />Testing Hardware operates with Windows 7 and XP, both 64 bit. <br />Driver was the latest: 280.26<br /><br />We got a Scene with around 800 Scenegraph Nodes and 4 Million Triangles.<br /><br />Resulting FPS (Identical View and Resolution):<br /><br />Geforce 8800:         11<br />Quadro 5600:          22<br />Quadro 4600:          16<br />Geforce 580gtx:       ~3,5 ?<br />Geforce 590gtx:       ~3,5 ?<br /><br />So, we only reach around 3,5 fps with the newest graphic cards? Its reproducable on different pcs with other zotac 580/590 geforce cards.<br /><br />We tried a lot, thought about it for a week and have no conclusion.<br /><br />Can you help? Any suggestions why we got this drop in frames?<br />We bought around 9 590 cards for our Cluster system to increase the frames, and not only the electric power consumption :P<br /><br />What can it be? Driver issues? Bad Ram, incompatible mainboard, Downwards-incompability with lower OpenGL versions?<br /><br />Thank you very much in advance for your thoughts an time.<br /><br />Hanno Hugenberg<br />Fraunhofer IFF Magdeburg<br />Germany]]></description>
   </item>
      <item>
      <title>cuda filter with ouput of this block is the input of the next block</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/851/cuda-filter-with-ouput-of-this-block-is-the-input-of-the-next-block</link>
      <pubDate>Thu, 06 Oct 2011 11:26:19 -0400</pubDate>
      <dc:creator>nguyenxh</dc:creator>
      <guid isPermaLink="false">851@/devforum/discussions</guid>
      <description><![CDATA[Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:<br /><br />for(int h=0; h    for(int w=1; w    image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];<br />    }<br />}<br />If I define:<br /><br />dim3 threads_perblock(32, 32)<br /><br />then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.<br /><br />Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.<br /><br />so that is the problem.<br /><br />How would I solve this?<br /><br />Thanks in advance.]]></description>
   </item>
      <item>
      <title>Function profiling on GPU with cuda</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/561/function-profiling-on-gpu-with-cuda</link>
      <pubDate>Mon, 19 Sep 2011 04:30:21 -0400</pubDate>
      <dc:creator>nikaiw</dc:creator>
      <guid isPermaLink="false">561@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br />I'm developping with CUDA on linux. I tried using "Nvidia compute visual profiler" lately. However the results were below my expectation, I was in fact expecting for more profiling information about the different gpu function like the GPU compute time for each function.<br /><br />I guess and understand that the GPU can't really do a differentiation between the function because they are all inlined by nvcc and just merged into a single loadable binary file.<br /><br />But, is there any way to have such information ? By leaving for an exemple a label that the profiling tool could know we have reach or otherwise using a special inline ptx instruction for such purpose ?<br /><br />I'm interested by any way to do this. Maybe it is possible to get those information with Parallel insight on windows and two gpu ? Maybe it's even possible to have better information using two GPU on linux ?<br /><br />For the record, I'm running the code and profiler remotly using ssh.<br /><br />thanks :)<br /><br />EDIT: Moved to "GPU Computing" Category.]]></description>
   </item>
      <item>
      <title>Driver installation problem with Nvidia Optimus configuration: intel HD and GT540M (GF108) on Vista</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/716/driver-installation-problem-with-nvidia-optimus-configuration-intel-hd-and-gt540m-gf108-on-vista</link>
      <pubDate>Mon, 26 Sep 2011 16:15:19 -0400</pubDate>
      <dc:creator>mbets</dc:creator>
      <guid isPermaLink="false">716@/devforum/discussions</guid>
      <description><![CDATA[<img src="http://nickbets.dyn-o-saur.com/flash/foto/nvidia_driver/NvidiaGT540M_Problem.jpg" alt="Nvidia Driver installation problem" />]]></description>
   </item>
      <item>
      <title>HOW CAN I USE ALL THE THREADS IN MY GPU CARD</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/571/how-can-i-use-all-the-threads-in-my-gpu-card</link>
      <pubDate>Mon, 19 Sep 2011 16:28:20 -0400</pubDate>
      <dc:creator>CHENHE</dc:creator>
      <guid isPermaLink="false">571@/devforum/discussions</guid>
      <description><![CDATA[HI all<br /><br />I am using my 9400GT to do some experiments. What I want to do is to use all the threads in this GPU. Does this equals to (the number of threads per block)*(number of blocks in one grid)* (number of grid it supports).<br /><br />Chen <br />University of Nebraska-Lincoln<br /><br />EDIT: Moved to GPU Computing Category]]></description>
   </item>
      <item>
      <title>Anyone knows what is the cache policy between global memory and local area memory that GPU using</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/621/anyone-knows-what-is-the-cache-policy-between-global-memory-and-local-area-memory-that-gpu-using</link>
      <pubDate>Tue, 20 Sep 2011 15:15:10 -0400</pubDate>
      <dc:creator>CHENHE</dc:creator>
      <guid isPermaLink="false">621@/devforum/discussions</guid>
      <description><![CDATA[Any comments, documents, or links will be appreciated!]]></description>
   </item>
      <item>
      <title>What are the steps to do for adding a 3D application to the Optimus driver profile</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/456/what-are-the-steps-to-do-for-adding-a-3d-application-to-the-optimus-driver-profile</link>
      <pubDate>Mon, 12 Sep 2011 07:22:11 -0400</pubDate>
      <dc:creator>oivindolavesencompusoftno</dc:creator>
      <guid isPermaLink="false">456@/devforum/discussions</guid>
      <description><![CDATA[I have a 3D application and need to add it to the Optimus driver profile but I’m not able to find where to start. What are the procedures for adding an application to the Optimus driver profile?]]></description>
   </item>
      </channel>
</rss>
