Resources

Nvidia Programming documents
Video: 哈工大-苏统华 | gpu | 樊哲勇-CUDA编程

Tutorials

tutorial. A more detailed tutorial with some background knowledge and nice examples.
CUDA by Example. I would highly recommend this book. It is very suitable to serve as a tutorial and is easy to read. NB: Due to the copyright issue, please DON'T spread this pdf file!
CUDA official documents
A video course mentioned by Lenna
A brief tutorial
CUDA Memory Architecture
A presentation by Scott Le Grand
Lecture from University of Wisconsin-Madison
tutorial. Really concise. You'd better have basic knowledge of C language before reading it.

Higher-Level wrapping of CUDA/OpenCL

Thrust: a high-level C++ interface to CUDA (released by nVidia), Github Link
ArrayFire: a commercial library for C/C++/Fortran; support both CUDA and OpenCL
ViennaCL: a C++ interface supporting CUDA/OpenCL/OpenMP
cudapp: something similar to Thrust. Seems not in active development
PyCUDA: a python wrapper for CUDA

MD packages supporting GPU

Purchase nVidia GPUs

Recommended vendor: Colfax
Where to buy: http://www.nvidia.com/object/tesla_wtb.html
Online shopping (GTX580): http://www.nvidia.com/object/buy_now_results_ci.html?id=GFGTX580

FAQs

Texture memory

See this Ref

It is a common misconception, but there is no such thing as “texture memory” in CUDA GPUs. There are only textures, which are global memory allocations accessed through dedicated hardware which has inbuilt cache, filtering and addressing limitations which lead to the size limits you see reported in the documentation and device query. So the limit is either roughly the free amount of global memory (allowing for padding and alignment in CUDA arrays) or the dimensional limits you already quoted.

Local memory

See this Ref
Not really a “memory” – bytes are stored in global memory
Differences from global memory:
- Addressing is resolved by the compiler
- Stores are cached in L1

How to choose block size and grid size

http://stackoverflow.com/questions/4391162/cuda-determining-threads-per-block-blocks-per-grid

Streaming multiprocessors, Blocks and Threads

http://stackoverflow.com/questions/3519598/streaming-multiprocessors-blocks-and-threads-cuda

Table of Contents