CUDA is Nvidia’s C-like API for non-graphic number crunching on their 8xxx level and above video cards. For certain operations, it is amazingly fast. Unfortunately, it is painful in the extreme to use, especially when compared to Numpy, Python’s wonderful scientific computing package.
So, to marry the two, I wrote for myself some wrapper code. It’s pretty much only good for one thing: multiplying large matrices together really fast. But it’s really good at it. (and it’s really easy to use) For example:
import numpy
from pycublas import CUBLASMatrix
A = CUBLASMatrix( numpy.mat([[1,2,3],[4,5,6]],numpy.float32) )
B = CUBLASMatrix( numpy.mat([[2,3],[4,5],[6,7]],numpy.float32) )
C = A*B
print C.np_mat()
All CUBLAS alloc and free calls are mapped to the CUBLASMatrix object’s life in Python, so you don’t have to worry about memory management. (other than filling up the card, or course)
Here are some performance numbers: (includes memory transfer times)
(4160×4160)*(4160×4160) = 43.0X faster than numpy
(4096×4096)*(4096×4096) = 34.0X
(3900×3900)*(3900×3900) = 47.3X
(2048×2048)*(2048×2048) = 28.2X
(1024×1024)*(1024×1024) = 58.8X
(512×512)*(512×512) = 24.1X
(256×256)*(256×256) = 6.3X
(128×128)*(128×128) = 1.1X
CPU: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping 06
GPU: nVidia Corporation GeForce 8800 GT (rev a2)
Note: This version only supports float32.
Note: CUBLAS limits matrix dims to (65536×65536).
Source code available here: pycublas.py (rename download to pycublas.py to use)