PEARC20 Workshop: HAL: Computer System for Scalable Deep Learning

We describe design, deployment and operation of a computer system purposely built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with four NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards an efficient execution of IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch based deep neural network to produce state-of-the-art performance results.

Contact Us

Join our mailing list