Opportunities for container environments on Cray XC30 with GPU devices
Lucas Benedicic, Miguel Gila, Sadaf Alam, Thomas C. Schulthess
Thanks to the significant popularity gained lately by Docker, the HPC community has recently started exploring container technology and potential benefits its use would bring to the users of supercomputing systems like the Cray XC series. In this paper, we explore feasibility of diverse, nontraditional data and computing oriented use cases with practically no overhead thus achieving native execution performance. Working in close collaboration with NERSC and an engineering team at Nvidia, CSCS is working on extending the Shifter framework in order to enable GPU access to containers at scale. We also briefly discuss the implications of using containers within a shared HPC system from the security point of view to provide service that does not compromise the stability of the system or the privacy of the use. Furthermore, we describe several valuable lessons learned through our analysis and share the challenges we encountered.
Keywords Linux containers; Docker; GPU; GPGPU; HPC systems
In contrast with the now long known hypervisor-based virtualization technologies, containers provide a level of virtualization which allows running multiple isolated user-space instances on top of a common host kernel. Since a hypervisor emulates the hardware, both the “guest” operating system and the “host” operating system run different kernels. The communication of the “guest” system with the actual hardware is implemented through an abstraction layer provided by the hypervisor. Clearly, this software layer creates a performance overhead due to the mapping between the emulated and bare-metal hardwares. Containers, on the other hand, are light, flexible and easy to deploy. Their size is measured in megabytes, which is much less than hypervisors that require a much larger software stack and gigabytes of memory. This characteristic makes containers easily transferable across nodes within an HPC system (horizontal scaling), and deployable within one compute node and thereby increasing its density (vertical scaling).
As the role of graphics processing units (GPUs) is becoming increasingly important in providing power-efficient and massively-parallel computational power to the scientific community in general and HPC in particular. It is well known that even a single GPU-CPU framework provides advantages that multiple CPUs on their own do not offer due to the distinguished design of discrete GPUs.
Despite previous studies on GPU virtualization, the possibilities provided by different virtualization approaches in a strict HPC context still remain unclear. The lack of standardized designs and tools that would enable container access to GPU devices means this is still an active area of research. For this reason, it is important to understand the tradeoffs and the technical requirements that container-based technology imposes on GPU devices when deployed in a hybrid supercomputing system. One example of such a system is the Cray XC30 called Piz Daint, which is in production at the Swiss National Supercomputing Center (CSCS) in Lugano, Switzerland. The system features 28 cabinets with a total of 5,272 compute nodes, each of which is equipped with an 8-core 64-bit Intel SandyBridge CPU (Intel® Xeon® E5-2670), an Nvidia® Tesla® K20X with 6 gigabytes of GDDR5 memory, and 32 gigabytes of host memory.
Working in close collaboration with the National Energy Research Scientific Computing Center (NERSC) and an engineering team of the Nvidia CUDA division, CSCS is working on extending the Shifter framework (Jacobsen 2015) in order to enable scalable GPU access from containers. Container environment opened up opportunities to enable workload that were typically constrained by a specialized light weight operating system. It allows CSCS to consolidate workloads and workflows that currently require dedicated clusters and specialized systems. As an example, by closely collaborating with the Large Hadron Collider (LHC) experiments ATLAS, CMS and LHCb and their Swiss representatives in the Worldwide LHC Computing Grid (WLCG), CSCS is able to utilize the Shifter framework to enable complex, specific High Energy Physics (HEP) workflows on our Cray supercomputers.
The preliminary results of this work show an improvement in vertical scaling of the system and consolidation of complex workflows with minimal impact to users and performance. This is possible thanks to the deployment of multiple independent containers (processes) sharing the same GPU device. The increased density can significantly improve the overall performance of distributed, GPU-enabled applications by increasing GPU utilization and, at the same time, reducing their communication footprint. Additionally, it is also possible to tailor specific versions of the CUDA toolkit and scientific libraries to different applications without having to perform a complex configuration at the system level. This use case is even feasible for different applications sharing the same compute node. Using examples and results of a subset of LHC experiments workflows, we demonstrate that there is a minimal impact to user interface (job submission script) and utilization of resources as compared to a dedicated environment.
The layout of the paper is as follows: we begin with the motivation for this work, specifically extension of containers to include GPU resources and design challenges that are associated with incorporating one and more GPU devices. This will be followed by implementation details for GPU and LHC workflows in the Cray environment. In section 4, we describe vertical scaling of the solution to accommodate GPU and node sharing for multiple containers. We conclude with future plans and opportunities to build on our efforts.
The goal is to provide container’s users the ability to access the compute resources of the GPUs available in the Piz Daint hybrid system. In particular, accessing the compute resources of Nvidia’s GPU like the Tesla K20x installed on each of the compute nodes of Piz Daint is done through CUDA (Cook 2012).
CUDA is Nvidia’s GPGPU solution that provides access to the GPU hardware through a C-like language (CUDA C), rather than the traditional approach of relying on the graphics programming interface. CUDA extends the C language by allowing the programmer to define CUDA kernels, i.e., special C functions, that are executed in parallel by several concurrent CUDA threads on the GPU CUDA exposes two programming interfaces: the driver API and the runtime API, this last one being built on top of the CUDA driver API. Providing access to the CUDA runtime API to container’s users by extending Shifter’s functionality is the main focus of this paper. Since CUDA is not a fully open standard, some internal details have not been officially documented. For this reason, an engineering team at Nvidia has been engaged to understand the details about the underlying hardware driver and its runtime libraries.
Certain scientific workflows do not adhere to common HPC practices, consider for example the case of the WLCG, where the software is pre-built and pre-validated by each of the LHC experiments and centrally exposed to all computing facilities using a http-based, read-only filesystem. In this specific context, the software is pre-packaged for RHEL-compatible operating systems and running it on Cray Linux Environment is not immediate, as re-building all the software, taking into account the interdependencies of the various applications is a very complex task on its own and can, potentially, imply application re-validation by the experiments.
Shifter enabled our Cray XC supercomputers to overcome this limitation by being able to run unmodified ATLAS, CMS and LHCb production jobs.
From a security standpoint, Shifter suits much better than Docker the HPC environment. This is because, by default, Docker allows any user to become root within the context of the container, and this may have implications if the container has access to shared filesystems. Consider the case in which an container image has malicious code embedded in it, since no validation and security checks are usually performed to the images prior to executing applications, this malicious code could have root access to any filesystem mounted on the compute node. On the other hand, Shifter runs applications in userland, with a very small part of it requiring root privileges (SUID) to create and destroy loop devices and mountpoints. Effectively, this limits the effects of malicious code to what the user could do outside the container.