Deep Learning MRI (HAL)

HAL Cluster

The Deep Learning Major Research Instrument Project

The instrument will serve as a focal point for Illinois’ rapidly expanding and globally relevant deep learning research community, enable expansion of several diverse research programs, and contribute to STEM education and training.

NCSA’s new Deep Learning Major Research Instrument Project will develop and deploy an innovative instrument for accelerating deep learning research at the University of Illinois. The instrument will integrate the latest computing, storage, and interconnect technologies in a purpose-built, shared-use system. In turn, the instrument will deliver unprecedented performance levels for extreme data-intensive fields of research across many diverse disciplines like computer vision, natural language processing, artificial intelligence, healthcare and education. The instrument’s development will be driven by Illinois’ deep learning community in collaboration with IBM and NVIDIA.

Co-PIs

William “Bill” Gropp: NCSA Director

Volodymyr Kindratenko: NCSA Senior Research Scientist

Roy Campbell: Sohaib and Sara Abbasi Professor, University of Illinois Department of Computer Science

Jian Peng: Assistant Professor, University of Illinois Department of Computer Science

Campus Collaborators
  • G. Allen, Astronomy/Education
  • R. Brunner, Astronomy
  • A. Das, Electrical and Computer Engineering
  • M. Do, Electrical and Computer Engineering
  • E. Escudero, NCSA
  • Y. Fan, Electrical and Computer Engineering
  • R. Farivar, Computer Science
  • D. Forsyth, Computer Science
  • D. George, Astronomy
  • I. Gupta, Computer Science
  • J. Han, Computer Science
  • Y. Hashash, Civil and Environmental Engineering
  • H. Hashemi, Computer Science
  • N. He, Industrial and Enterprise Systems Engineering
  • J. Hockenmaier, Computer Science
  • N. Hovakimyan, Mechanical Science and Engineering
  • T. Huang, Electrical and Computer Engineering
  • M. Hudson, Crop Sciences
  • W.M. Hwu, Electrical and Computer Engineering
  • M.H. Johnson, Electrical and Computer Engineering
  • D. Katz, NCSA
  • G. Ko, Electrical and Computer Engineering
  • S. Koyejo, Computer Science
  • S. Lazebnik, Computer Science
  • K. McHenry, NCSA
  • R. Nagi, Industrial and Enterprise Systems Engineering
  • K. Nahrstedt, Computer Science
  • L. Paquette, Education
  • G. Robinson, Carl R. Woese Institute for Genomic Biology
  • R. Rutenbar, Computer Science
  • M. Sammons, Computer Science
  • L. Schwartz, Linguistics
  • S. Sinha, Computer Science
  • J. Sirignano, Industrial and Enterprise Systems Engineering
  • P. Smaragdis, Computer Science
  • E. Tajkhorshid, Molecular and Cellular Biology
  • J. Towns, NCSA
  • M. Turk, Information Sciences
  • S. Wang, Geography
  • R. Yeh, Electrical and Computer Engineering
  • J. Yu, Electrical and Computer Engineering
  • C. Zhai, Computer Science

Resources

How to Request Access to Currently Deployed DL Resources

  • HAL cluster — a 16-node IBM POWER9 cluster with NVIDIA V100 GPUs. For hardware details and current system status see this wiki page. The system provides IBM PowerAI platform, which includes Caffe, TensorFlow, and Pytorch.
  • Nano cluster — an 8-node Intel Xeon cluster with NVIDIA P100 and V100 GPUs. For hardware details, current cluster status, and request access please see this wiki page. Several DL frameworks, including TensorFlow and Pytorch, are installed on this system.

Access Requirements

  1. You must provide a short description of the project (1 paragraph) and answer a few questions about the intended use of the system. If the description is unclear to us, we will contact you for more information.
  2. You agree to provide a short written report about your experiences with the system, both positive and negative, as well as titles and reference information (e.g., BibTeX entry) for any papers that you create that make use of these systems.
  3. You agree to acknowledge the resource with the credit statement listed below.
  4. You understand and agree that the provided hardware and software is experimental and not of production quality. While we will work with the users to minimize the distraction and accommodate their needs to the best of our abilities, we do not guarantee system or software availability, data integrity, or any quality of services (e.g., we may reboot the system in the middle of your job, install new hardware and software that will make all of your prior work unusable, delete your files, etc.). Obviously, we will make every effort to warn the users about such actions, but we cannot guarantee this.

Acknowledge Use of These Systems

Please include the following credit statements with any publication you produce:

For HAL cluster: This work utilizes resources supported by the National Science Foundation’s Major Research Instrumentation program, grant #1725729, as well as the University of Illinois at Urbana-Champaign.

For Nano cluster: This work utilizes resources provided by the Innovative Systems Laboratory at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign.


Research Results

Publicly Released Software

Publications

  • Volodymyr Kindratenko, Dawei Mu, Yan Zhan, John Maloney, Sayed Hadi Hashemi, Benjamin Rabe, Ke Xu, Roy Campbell, Jian Peng, and William Gropp. 2020. HAL: Computer System for Scalable Deep Learning. InPractice and Experience in Advanced Research Computing (PEARC ’20), July 26–30, 2020, Portland, OR, USA. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3311790.3396649
  • Venkatakrishnan, Ramshankar, Ashish Misra, and Volodymyr Kindratenko. “High-Level Synthesis-Based Approach for Accelerating Scientific Codes on FPGAs.” Computing in Science & Engineering 22.4 (2020): 104-109. https://doi.org/10.1109/MCSE.2020.2996072
  • Misra, Ashish, and Volodymyr Kindratenko. “HLS-Based Acceleration Framework for Deep Convolutional Neural Networks.” International Symposium on Applied Reconfigurable Computing. Springer, Cham, 2020. https://doi.org/10.1007/978-3-030-44534-8_17
  • S. Hashemi, P. Rausch, B. Rabe, K. Chou, S. Liu, V. Kindratenko, R. Campbell, “tensorflow-tracing: A Performance Tuning Framework for Production,” In Proc. 2019 USENIX Conference on Operational Machine Learning (OpML’19), 2019.
  • S. H. Hashemi, S. Abdu Jyothi, R. H. Campbell, “TicTac: Improving Distributed Deep Learning With Communication Scheduling,” SysML Conference 2019.
  • S. H. Hashemi, S. Abdu Jyothi, R. H. Campbell, “On Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems,” SysML Conference 2018.

Presentations and Posters

  • V. Kindratenko, “POWER9 AI Cluster at NCSA” University Power Systems HPC/AI User Meeting, December 21, 2019, SC19 – Denver, CO.
  • S. H. Hashemi, S. Abdu Jyothi, and R. H. Campbell, “Network Efficiency through Model-Awareness in Distributed Machine Learning Systems,” NSDI ’18, Seattle, WA.
  • S. H. Hashemi and R. H. Campbell, “Making a Case for Timed RPCs in Iterative Systems,” OSDI ’18, San Diego, CA.
  • S. H. Hashemi, B. Rabe, V. Kindratenko, “Building a Scalable Deep Learning Platform,” University Power Systems HPC/AI User Meeting, December 15, 2018, SC18 – Dallas, TX.
Center for Artificial Intelligence Innovation
1205 W. Clark St.
Urbana, Illinois 61801
Email: caii_ai@lists.illinois.edu
CookieSettings CookieSettings