ROUGH DRAFT authorea.com/5919
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • terracuda

    Summary

    We will create a highly abstract CUDA API for Lua with an aim at programmers unfamiliar with GPU-level parallelism.

    Background

    Lua is a fast, lightweight, and embeddable scripting language found in places like Wikipedia, World of Warcraft, Photoshop Lightroom, and more. Lua's simple syntax and dynamic typing also make it an ideal language for novice programmers. Traditionally, languages like Lua find themselves abstracted miles above low-level parallel frameworks like CUDA, and consequently GPU parallelism was limited to programmers using a systems language like C++. Frameworks like Terra, however, work to close that gap, making low-level programming accessible in a high-level interface. However, these interfaces still require a number of calls to C libraries and intimate knowledge of the CUDA library. For example, the following code runs a simple CUDA kernel in Terra:

    terra foo(result : &float)
        var t = tid()
        result[t] = t
    end
    
    local R = terralib.cudacompile({ bar = foo })
    
    terra run_cuda_code(N : int)
        var data : &float
        C.cudaMalloc([&&opaque](&data),sizeof(float)*N)
        var launch = terralib.CUDAParams { 1,1,1, N,1,1, 0, nil }
        R.bar(&launch,data)
        var results : &float = [&float](C.malloc(sizeof(float)*N))
        C.cudaMemcpy(results,data,sizeof(float)*N,2)
        return results;
    end
    
    results = run_cuda_code(16)
    

    Other high-level CUDA bindings like PyCUDA and JCuda suffer the same problem.

    The Challenge

    The problem is challenging foremost on the level of architecture. Designing an API is never easy, and attempting to expose GPU-level parallelism to a language as high-level as Lua requires a great deal of care to be usable while still being useful. Creating such an API requires significant knowledge of the abstraction layers between Lua, C, and CUDA as well as knowledge of the typical use cases for high-level parallelism.

    My partner and I know neither Terra nor LLVM (which Terra compiles to), so creating these high-level bindings requires a great deal of initial investment. The existing interface between Terra and CUDA is sketchy at best, so we will need to implement significant new functionality into Terra in order for the Circle Renderer to function properly.

    Resources

    For machines, we'll just be using any computers equipped with NVIDIA GPUs (i.e. Will's laptop and the Gates 5k machines). No other special hardware/software will be needed. We'll be building upon the Terra language and also using LuaGL for some of the demos.

    Goals

    The project has three main areas: writing the API, creating programs using the API, and benchmarking the code against other languages/compilers.

    We plan to achieve:

    • Writing the API
      • Allow arbitrary Lua code to be executed in the GPU over a table.
      • Optimize threads/warp usage to the input data.
      • Abstract the API such that the user needs no C libraries and as little Terra as possible.
    • Creating programs
      • Make a simple saxpy
      • Write matrix operations like transpose or pseudoinverse/SVD
      • Port the Assignment 2 Circle Renderer over to vanilla Lua (using LuaGL)
    • Benchmarking
      • For each program, benchmark it against equivalent implementations in: vanilla Lua, Terra without CUDA, and C.

    We hope to achieve:

    • Achieve better performance than vanilla C.
    • Implement shared memory in Terra.
    • Implement linking against libraries like cublas.