====== Resources ======
  * [[https://docs.nvidia.com/cuda/index.html | Nvidia Programming documents]]
  * Video: [[https://www.bilibili.com/video/BV15E411x7yT?p=1 | 哈工大-苏统华]] | [[]] | [[https://www.bilibili.com/video/BV1Rg411T7H6 | 樊哲勇-CUDA编程]]

====== Tutorials ======
  * [[http://supercomputingblog.com/cuda-tutorials/ | tutorial]]. A more detailed tutorial with some background knowledge and nice examples.
  * {{gpu:cuda-by-example.pdf | CUDA by Example}}. I would highly recommend this book. It is very suitable to serve as a tutorial and is easy to read. NB: Due to the copyright issue, please DON'T spread this pdf file!
  * [[http://docs.nvidia.com/cuda/index.html | CUDA official documents]]
  * [[http://www.udacity.com/overview/Course/cs344/CourseRev/1 | A video course mentioned by Lenna]]
  * [[http://geco.mines.edu/tesla/cuda_tutorial_mio/index.html | A brief tutorial]]
  * [[http://www.cvg.ethz.ch/teaching/2011spring/gpgpu/cuda_memory.pdf | CUDA Memory Architecture]]
  * [[http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0644-GTC2012-Molecule-Dynamics-GPU.pdf | A presentation by Scott Le Grand]]
  * [[http://sbel.wisc.edu/Courses/ME964/2012/ | Lecture from University of Wisconsin-Madison]]
  * [[http://code.google.com/p/stanford-cs193g-sp2010/wiki/GettingStartedWithCUDA | tutorial]]. Really concise. You'd better have basic knowledge of C language before reading it. 
====== Higher-Level wrapping of CUDA/OpenCL ======
  * [[https://developer.nvidia.com/thrust | Thrust]]: a high-level C++ interface to CUDA (released by nVidia), [[https://github.com/thrust | Github Link]]
  * [[http://www.accelereyes.com/products/arrayfire | ArrayFire]]: a commercial library for C/C++/Fortran; support both CUDA and OpenCL
  * [[http://viennacl.sourceforge.net/index.html | ViennaCL]]: a C++ interface supporting CUDA/OpenCL/OpenMP
  * [[http://code.google.com/p/cudpp/ | cudapp]]: something similar to Thrust. Seems not in active development
  * [[http://mathema.tician.de/software/pycuda | PyCUDA]]: a python wrapper for CUDA

====== MD packages supporting GPU ======
  * [[http://ambermd.org/gpus/ | AMBER]]
  * [[http://www.ks.uiuc.edu/Research/gpu/ | NAMD]]
  * [[https://simtk.org/home/openmm/ | OpenMM]]
  * [[https://simtk.org/project/xml/downloads.xml?group_id=161#package_id600 | Gromacs via OpenMM]]
  * [[https://simtk.org/project/xml/downloads.xml?group_id=161#package_id1009 | Tinker via OpenMM]]
  * [[http://gcl.cis.udel.edu/projects/fenzi/index.php | Fen Zi]]
  * [[http://www.acellera.com/products/acemd/ | ACEMD]]
  * [[http://www.charmm.org/documentation/c37b1/gpu.html | CHARMM 37]]
  * [[http://en.wikipedia.org/wiki/Molecular_modeling_on_GPUs | Molecular modeling on GPUs]]
====== Purchase nVidia GPUs ======
  * Recommended vendor: [[http://www.colfax-intl.com/NHome.html | Colfax]]
  * Where to buy: http://www.nvidia.com/object/tesla_wtb.html
  * Online shopping (GTX580): http://www.nvidia.com/object/buy_now_results_ci.html?id=GFGTX580

====== FAQs ======
===== Texture memory =====
  * See this [[http://stackoverflow.com/questions/12340265/what-is-the-size-of-my-cuda-texture-memory | Ref]]
It is a common misconception, but there is no such thing as "texture memory" in CUDA GPUs. There are only textures, which are global memory allocations accessed through dedicated hardware which has inbuilt cache, filtering and addressing limitations which lead to the size limits you see reported in the documentation and device query. So the limit is either roughly the free amount of global memory (allowing for padding and alignment in CUDA arrays) or the dimensional limits you already quoted.

===== Local memory =====
  * See this [[http://developer.download.nvidia.com/CUDA/training/register_spilling.pdf | Ref]]
  * Not really a “memory” – bytes are stored in global memory
  * Differences from global memory:
    * Addressing is resolved by the compiler
    * Stores are cached in L1

===== How to choose block size and grid size =====
  * http://stackoverflow.com/questions/4391162/cuda-determining-threads-per-block-blocks-per-grid

===== Streaming multiprocessors, Blocks and Threads =====
  * http://stackoverflow.com/questions/3519598/streaming-multiprocessors-blocks-and-threads-cuda

====== Applications ======
===== Nonlinear fitting =====
  * http://devernay.free.fr/hacks/cminpack/index.html
  * https://github.com/zitmen/cuLM
  * http://www.amazon.com/CUDA-Application-Design-Development-Farber/dp/0123884268
  * https://devtalk.nvidia.com/default/topic/392660/cuda-programming-and-performance/nonlinear-least-squares-/2/