Understanding Natural Language with Deep Neural Networks Using Torch

Originally published at: https://developer.nvidia.com/blog/understanding-natural-language-deep-neural-networks-using-torch/

This post was co-written by Soumith Chintala and Wojciech Zaremba of Facebook AI Research. Language is the medium of human communication. Giving machines the ability to learn and understand language enables products and possibilities that are not imaginable today. One can understand language at varying granularities. When you learn a new language, you start with…

The performance comparison section could use some clarification. Both the CPU type as well as the means of parallelization used in the CPU code are omitted. One can just hope that it's not some low-end several years old i7 CPU (mobile? do desktop i7 with 2.7 GHz exist even?) that you compare the latest GeForce cards against. The means of parallelization in the CPU code is not mentioned either. And than we have not even talked about power consumption and other aspects.

Given that past, even very recent, posts on this blog have made highly questionable and rather unfair performance comparisons, I can not help thinking that the same is happening here.

I'm afraid that in a more honest comparison, e.g. i7-5930K vs GTX 980 or 2x 4790K vs GTX 980 the 5-10x difference shown above will change drastically.

It is time for changing the culture of sloppy comparisons - and here I'm giving the authors the benefit of the doubt that this was just a mistake.

Hi pSz!

Soumith should respond to your specific comments, but I'd like to discuss your claim that "past, even very recent, posts on this blog have made highly questionable and rather unfair performance comparisons."

I think this claim is quite untrue, but if you can provide some pointers to specific examples, I'd be happy to investigate.

I think that the "culture of sloppy comparisons" actually changed years ago, and at NVIDIA we are very careful with accelerated computing comparisons. Again, if you can point out specific examples on this blog that you think are "sloppy" I'd be happy to look into the details.

Mark

Mark, coincidentally, it is your recent 7.0 RC overview blog article that compares performance of the new solver library performance on K40 against some 4-5 years old Sandy Bridge desktop CPU.

Hi pSz. Hope you are enjoying GTC! I have updated the post with a more recent comparison to a quite expensive recent Haswell-based Xeon CPU. http://devblogs.nvidia.com/...

Thanks for fixing it, I really appreciate your responsiveness to criticism! I have not seen any comment by Soumith, though.

I hope the cuSOLVER home page also get gets updated soon: https://developer.nvidia.co... !

Hey pSz,

While your skepticism is good, I've given the source code (and instructions) on how to run the code right in the blogpost, there is nothing to really hide here.
I've rerun the benchmark on the Digits box which has: 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz. Hopefully that satisfies your constraints for a fair comparison. It takes 207 mins in total on CPU and 29 mins on GPU (which is actually an even more speedup than the benchmark I've given in the blogpost). The BLAS used is OpenBLAS (latest trunk) with multiple cores enabled for all blas operations.

A full gist is here for you:
https://gist.github.com/sou...

p.s.: sorry for the delayed reply. This is the earliest that I could take time to rerun the benchmarks on something more appropriate.

--
Soumith

Thanks Soumith for the clarification.

I still believe that to have a complete description of what your "Table 2" compares at least the following information is necessary:
- exact model number of the CPU;
- compiler, flags, etc. as well as means of parallelization used in the CPU code (or a reference to what the code used is capable of, e.g. threading, SIMD);
- amount of resources used on the CPU to execute the experiment (number of threads, HT on/off, turbo boost, etc.)

The same applies to the GPU code and experiments done with it! Without all of that, IMHO such comparisons belong to the blogs of the specialist communities who may not care much about such "irrelevant" technicalities.

> I've given the source code (and instructions) on how to run the code right in the blogpost, there is nothing to really hide here.

I suspect that you're missing my point, after this comment even more so. This is the "Parallel Forall blog" and not a machine learning specialist one. Based on the description (http://devblogs.nvidia.com/... the blog advertises itself as focused on "detailed technical information on a variety of massively parallel programming topics" and mentions "high-performance programming techniques" as one of the topics discussed. Finally, in the last paragraph it highlights how GPU computing claims its space in the world of HPC/scientific computing.

To live up to these ideals, I believe at least this blog (but preferably the entire HPC/computing division at NVIDIA, including marketing) needs to become (even) better at being more "C" and less "H". To be successful with a "parent and provider" living off of the gamer community and opponents as well as partners deeply embedded in the scientific and technical computing world (and the minds of those part of it), I think it is highly beneficial, if not necessary, for new players like GPUs and NVIDIA to be as honest as possible to the coders visiting such a technical blog. And if the competition pulls dirty tricks, disprove their numbers - or ask the community to do it. I'm sure many will gladly contribute to the extent possible!

BTW, Mark: it just occurred to me that you explicitly refer to the particularly high price the CPU you compare against.

That reminded me of something... so I did a quick search - and although I'm fine with comparing socket-vs-board -, to be fair I should share what I have found:

- The Xeon E5-2697v3 costs $2,769.99*, has a 145W TDP;
- The Tesla K40 costs $3,499.99*, has a 235W TDP.

I can only wonder what would a comparison between a pair of CPUs that match the price and TDP of the K40, e.g. two E5-2680v3: 2x$1,799.99* / 2x120W (capped at 117.5W each :).

*Prices from newegg.com

pSz, have you considered the cost of memory? The Xeon CPU price does not include 12GB of GDDR5 RAM with SSE, while the K40 price does. The Xeon CPU TDP does not include the power for the memory, while the K40 TDP does.

Good point! However, the GPU does need a host too, doesn't it?

Today this host would most often have at least as much memory as the GPU and in fact you will rarely have drastically less memory in GPU-equipped servers than in non-GPU equipped ones (given a fixed set of use-cases with a certain memory requirement). Plugging the K40 into a desktop box defeats the purpose of the Tesla (among other things its ECC), so the actual difference between the CPU-only and the CPU+GPU server platform that we should be comparing will likely boil down to two cases. Either single-socket + GPU vs dual socket without GPU comparison if density is not a concern. However, as a single-socket machine will have only so much memory bandwidth and PCI lanes on the CPU-side, quite likely a very realistic comparison is dual-socket lower-end CPU + GPU (e.g. 2x2620v3 + 2xK40) vs dual-socket higher-end CPU (e.g. 2x2680v3).

PS: You could of course throw in a third class of systems into the comparison: exotic stuff like the One Stop Systems 16-way 3U HDCA box which gives 14*16=224 GPUs/42U rack if we don't count switches and hosts, so more realistically 12*16=192 GPUs/rack (if feeding this beast with power and dissipating its heat is even possible in this setup). For workloads that are *very* parallel and very GPU-friendly such a system can do miracles. However, even this density is not unheard of in the CPU-only world. Take the Dell PowerEdge FX2 (up to four half-wide 1U 2s server module into 2U) which allows ~164 sockets in 42U = a rack, or the Supermicro MicroBlade (up to 28 2s modules in 7U) and based on specs up to 192 sockets/42U rack.

I've given you the exact model number of CPU in the first comment. It is multi-thread capable (using OpenMP where appropriate) and it is SIMD enabled for BLAS operations (the sigmoid and softmax are not SIMD, understandably so, as the instructions to do SIMD for those operations are not obvious or universal). Amount of resources used on CPU is not recorded.

Perhaps I'm confused, but where exactly does your comment state what your article's "Table 2" compares against? And in my humble opinion, your article needs amending rather than comments that provide _additional_ data rather than fixing the existing stuff.

Please consider Mark's actions as inspiration. Instead of posting additional random info in the comments section, he actually posted new data and complete benchmark info *first* and foremost.

Let me say it again: I applaud his prompt and effective actions. At the same time we are instead arguing here (you and me) about something simple and straightforward: that the benchmark data in your article is incomplete (and possible bogus).

Soumith, thanks for the wonderful post! It was very interesting to read your blog. At any rate, I have a few questions about the source code for the LSTM. How can I get help / advice understanding the code? Specifically, I am wondering about the lines that look like the following:

local i2h = nn.Linear(params.rnn_size, 4*params.rnn_size)(x)

What exactly is the variable x doing there? Why is the function in the foillowing format?

foo (parm1, param2) (param3) ?

This would make sense if foo returns a function, but it doesn't seem that way...

Thanks in advance for your help!

-Hidekazu

Hi Hidekazu,

I wouldn't do justice to explaining this, compared to this excellent post by Adam Paszke who tears apart that piece of code and explains it with the help of nice diagrams and math:

http://apaszke.github.io/ls...

Thanks very much for the reference! This is very helpful!

Hi Walt. What do you mean by: "how do I talk to it"?

Thanks a lot for such easily understandable example of LSTM

Hi Soumith,
In the TDNN/CNN example, I believe there's some issue with this line:
m:add(nn.TemporalConvolution(sentenceLength, 150, embeddingSize))
From the nn readme for Temporal Convolution:
module = nn.TemporalConvolution(inputFrameSize, outputFrameSize, kW, [dW])
In above example, I guess the input-frame size is embeddingSize, and outputFrameSize is 150.
Hence, this seems more appropriate:
m:add(nn.TemporalConvolution(embeddingSize, 150))