Luajit is so damn fast

Because of reading libdnn source code, I started to survey about the existing technologies around DNN. Aside from caffe and Theano, one emerging framework as a keyword keep showing up in my reading is Torch 7. Unusual part of Torch 7 is that it is implemented with luajit, which in the era of R and python is a quite noticeable difference. I have read from 云風’s blog where luajit is pretty good, just I don’t have time to get my feet wet. And now it’s time.

I use an arbitrary benchmark game to test its performance. It’s not based on a solid analysis nor scientific proof, but I think it’s enough to get some clues of how performant it is. The test is to generate 1G of random bytes with dd. The program should read these in 512-byte chunk as key, and trying to detect if there exists duplicate in this pool of byte stream.

dd if=/dev/urandom of=large_random.txt bs=1024 count=0 seek=$[1000*1000]

Let’s take a look of results. The same algorithm was implemented in lua, python, c++, haskell and coffee-script. luajit is lightning fast, at least in this case. It only took around one second on my Macbook Air 2013 model. What follows is haskell, where I am pretty impressed that haskell’s hash map could be fine tuned to so much fast, and the code is borrowed from some stack overflow thread long time ago. And then it is C++, it took around four seances. I think it is because I did extra work in copying buffer’s content to string, and that took extra amount of time. Or either it is the unordered_map is not that well optimised. The next is python, where most of the dictionary operations are implemented in C, so I wouldn’t be amazed by this speed.

The nodejs’s speed is quite disappointing, I am not sure it is the third party library for streaming dragging it slow or it is nodejs’s inherent problem. It took around fifteen seconds to finish, and io.js runs a little bit faster than node. I probably did something wrong from this result, any correction are welcomed.

➜  uniq-blocks git:(master) ✗ time luajit lua/uniq-blocks.lua large_random.txt
luajit lua/uniq-blocks.lua large_random.txt  0.87s user 0.30s system 98% cpu 1.183 total
➜  uniq-blocks git:(master) ✗ time haskell/dist/build/uniq-blocks/uniq-blocks large_random.txt
haskell/dist/build/uniq-blocks/uniq-blocks large_random.txt  1.99s user 0.24s system 98% cpu 2.263 total
➜  uniq-blocks git:(master) ✗ time cpp/a.out large_random.txt
cpp/a.out large_random.txt  3.75s user 0.33s system 99% cpu 4.098 total
➜  uniq-blocks git:(master) ✗ time python python/ large_random.txt
python python/ large_random.txt  4.54s user 1.41s system 99% cpu 5.974 total
➜  uniq-blocks git:(master) ✗ time node js/uniq-blocks.js large_random.txt
reading large_random.txt …
node js/uniq-blocks.js large_random.txt  15.17s user 0.71s system 99% cpu 15.921 total
➜  uniq-blocks git:(master) ✗ time /opt/boxen/homebrew/Cellar/iojs/1.2.0/bin/iojs js/uniq-blocks.js large_random.txt
reading large_random.txt …
/opt/boxen/homebrew/Cellar/iojs/1.2.0/bin/iojs js/uniq-blocks.js   13.24s user 0.62s system 99% cpu 13.931 total

With these figures, I am quite impressed by the luajit. I would definitely invest more of my time on it.

The various codes are as follows:

import os
import os.path
import sys

def dedupeFile(dd, fp):
    fd =, os.O_RDONLY)
    for block in iter(lambda :, 512), ‘’):
        dd.setdefault(block, 0)
        dd[block] = dd[block] + 1
    return dd

dd = {}
dedupeFile(dd, sys.argv[1])
function usage(program_name)
    print(“Usage: “ .. program_name .. “ filename”)

function dedupeFile(file)
    local f =, “rb”)
    local t = {}

    while true do
        local block = f:read(512)

        if not block then break end

        if t[block] == nil then
            t[block] = 1
            t[block] = t[block] + 1


if arg[1] == nil then
    do return end
    filename = arg[1]
fs = require ‘fs’
path = require ‘path’
lazy = require ‘lazy’
BlockStream = require ‘block-stream’

printLine = (line) -> process.stdout.write line + ‘\n’

usage = -> printLine (“Usage: “ + process.argv[1] + “ <filename>”)

if process.argv.length < 3
    filename = process.argv[2]
    printLine(“reading “ + filename + “ …”)

    d = {}

    dedupeFile = (chunk) ->
        if chunk of d
            d[chunk] += 1
            d[chunk] = 1

    block = new BlockStream(512)
    rstream = fs.createReadStream(filename).pipe(block)
    block.on(‘data’, (chunk) -> dedupeFile(chunk))
    block.on(‘end’, -> printLine(‘Done’))
import           Control.Monad
import qualified Data.ByteString.Lazy as LB
import           Data.Foldable
import           Data.HashMap.Strict hiding (foldl’)
import           Data.Int
import qualified Data.List            as DL
import           System.Environment

type DummyDedupe = HashMap LB.ByteString Int64

toBlocks :: Int64 -> LB.ByteString -> [LB.ByteString]
toBlocks n bs | LB.null bs = []
              | otherwise = let (block, rest) = LB.splitAt n bs
                            in block : toBlocks n rest

dedupeBlocks :: [LB.ByteString] -> DummyDedupe -> DummyDedupe
dedupeBlocks = flip $ DL.foldl’ (\acc block -> insertWith (+) block 1 $! acc)

dedupeFile :: FilePath -> DummyDedupe -> IO DummyDedupe
dedupeFile fp dd = LB.readFile fp >>= return . (`dedupeBlocks` dd) . toBlocks 512

main :: IO ()
main = do
  dd <- getArgs >>= (`dedupeFile` empty) . head
  putStrLn . show . (*512) . size $ dd
  putStrLn . show . (*512) . foldl’ (+) 0 $ dd
#include <unordered_map>
#include <string>
#include <fstream>
#include <iostream>
#include <array>

void dedupleFile(std::ifstream &f)
    std::unordered_map<std::string, int> M;
    std::array<char, 512> buffer = { 0 };
    char *ptr = &buffer[0];

    while (, 512)) {
        std::string key(ptr);
        if (M.count(key) > 0) {
            M[key] += 1;
        } else {
            M[key] = 1;

int main(int argc, char *argv[])
    if (argc < 2) {
        std::cerr << “Usage: “ << argv[0] << “ <filename>” << std::endl;
        return 1;

    std::ifstream is(argv[1], std::ifstream::binary);
    if (is) {
    } else {
        std::cerr << “reading failed” << std::endl;

    return 0;