ORCA: A Network and Architecture Co-design for Offloading us-scale Datacenter Applications

Responding to the "datacenter tax" and "killer microseconds" problems for datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle th...

Full description

Bibliographic Details
Main Authors: Yuan, Yifan, Huang, Jinghan, Sun, Yan, Wang, Tianchen, Nelson, Jacob, Ports, Dan R. K., Wang, Yipeng, Wang, Ren, Tai, Charlie, Kim, Nam Sung
Format: Report
Language:unknown
Published: arXiv 2022
Subjects:
Online Access:https://dx.doi.org/10.48550/arxiv.2203.08906
https://arxiv.org/abs/2203.08906
Description
Summary:Responding to the "datacenter tax" and "killer microseconds" problems for datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes ORCA, a holistic network and architecture co-design solution that leverages current RDMA and emerging cache-coherent off-chip interconnect technologies. Specifically, ORCA consists of four hardware and software components: (1) unified abstraction of inter- and intra-machine communications managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly processing requests received by NIC; and (4) adaptive device-to-host data transfer for modern server memory systems consisting of both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype ORCA with a commercial system and evaluate three popular datacenter applications: in-memory key-value store, chain replication-based distributed transaction system, and deep learning recommendation model inference. The evaluation shows that ORCA provides 30.1~69.1% lower latency, up to 2.5x higher throughput, and 3x higher power efficiency than the current state-of-the-art solutions.