Efficient Execution of Machine Learning Workloads on GPUs

학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2023. 2. 전병곤. Machine learning (ML) workloads are becoming increasingly important in many types of real-world applications. We attribute this trend to the development of software systems for ML, which have facilitated the widespread adoption of heterogeneous accel...

Full description

Bibliographic Details
Main Author:	유경인
Other Authors:	전병곤, Gyeong-In Yu, 공과대학 컴퓨터공학부
Format:	Doctoral or Postdoctoral Thesis
Language:	English
Published:	서울대학교 대학원 2023
Subjects:	machine learning deep learning scheduling inference serving generative models Transformer joint training 621.39 Orca
Online Access:	https://hdl.handle.net/10371/193329 https://dcollection.snu.ac.kr/common/orgView/000000175556

Description
Summary:	학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2023. 2. 전병곤. Machine learning (ML) workloads are becoming increasingly important in many types of real-world applications. We attribute this trend to the development of software systems for ML, which have facilitated the widespread adoption of heterogeneous accelerators such as GPUs. Todays ML software stack has made great improvements in terms of efficiency, however, not all use cases are well supported. In this dissertation, we study how to improve execution efficiency of ML workloads on GPUs from a software system perspective. We identify workloads where current systems for ML have inefficiencies in utilizing GPUs and devise new system techniques that handle those workloads efficiently. We first present Nimble, a ML execution engine equipped with carefully optimized GPU scheduling. The proposed scheduling techniques can be used to improve execution efficiency by up to 22.34×. Second, we propose Orca, an inference serving system specialized for Transformer-based generative models. By incorporating new scheduling and batching techniques, Orca significantly outperforms state-of-the-art systems – 36.9× throughput improvement at the same level of latency. The last topic of this dissertation is WindTunnel, a framework that translates classical ML pipelines into neural networks, providing GPU training capabilities for classical ML workloads. WindTunnel also allows joint training of pipeline components via backpropagation, resulting in improved accuracy over the original pipeline and neural network baselines. 최근 경향을 보면 다양한 종류의 애플리케이션에서 머신 러닝(ML) 워크로드가 점 점 더 중요하게 활용되고 있다. 이는 ML용 시스템 소프트웨어의 개발을 통해 GPU 와 같은 이기종 가속기의 광범위한 활용이 가능해졌기 때문이다. 많은 연구자들의 관심 덕에 ML용 시스템 소프트웨어 스택은 분명 하루가 다르게 개선되고 있지만, 여전히 모든 사례에서 높은 효율성을 보여주지는 못한다. 이 학위논문에서는 시스 템 소프트웨어 관점에서 GPU 환경에서 ML 워크로드의 실행 효율성을 개선하는 방법을 연구한다. 구체적으로는 오늘날의 ML용 시스템이 GPU를 효율적으로 사 용하지 못하는 워크로드를 규명하고 더 나아가서 해당 워크로드를 효율적으로 처리할 수 있는 시스템 기술을 고안하는 것을 목표로 한다. 본 논문에서는 먼저 최적화된 GPU 스케줄링을 갖춘 ML 실행 엔진인 Nimble 을 소개한다. 새 스케줄링 기법을 통해 Nimble은 기존 대비 GPU 실행 ...

Efficient Execution of Machine Learning Workloads on GPUs

Similar Items