TDML -- A Trustworthy Distributed Machine Learning Framework ...

Recent years have witnessed a surge in deep learning research, marked by the introduction of expansive generative models like OpenAI's SORA and GPT, Meta AI's LLAMA series, and Google's FLAN, BART, and Gemini models. However, the rapid advancement of large models (LM) has intensified...

Full description

Bibliographic Details
Main Authors: Wang, Zhen, Wang, Qin, Yu, Guangsheng, Chen, Shiping
Format: Report
Language:unknown
Published: arXiv 2024
Subjects:
DML
Online Access:https://dx.doi.org/10.48550/arxiv.2407.07339
https://arxiv.org/abs/2407.07339
Description
Summary:Recent years have witnessed a surge in deep learning research, marked by the introduction of expansive generative models like OpenAI's SORA and GPT, Meta AI's LLAMA series, and Google's FLAN, BART, and Gemini models. However, the rapid advancement of large models (LM) has intensified the demand for computing resources, particularly GPUs, which are crucial for their parallel processing capabilities. This demand is exacerbated by limited GPU availability due to supply chain delays and monopolistic acquisition by major tech firms. Distributed Machine Learning (DML) methods, such as Federated Learning (FL), mitigate these challenges by partitioning data and models across multiple servers, though implementing optimizations like tensor and pipeline parallelism remains complex. Blockchain technology emerges as a promising solution, ensuring data integrity, scalability, and trust in distributed computing environments, but still lacks guidance on building practical DML systems. In this paper, we propose a ...