Refactoring the EVP solver for improved performance – a case study based on CICE v6.5

This study focuses on the performance of CICE and its Elastic-Viscous-Plastic (EVP) dynamical solver. The study has been conducted in two steps. First, the standard EVP solver has been extracted from CICE for experiments with refactored versions of it. Secondly, one refactored version was integrated...

Full description

Bibliographic Details
Main Authors: Rasmussen, Till Andreas Soya, Poulsen, Jacob, Ribergaard, Mads Hvid, Sasanka, Ruchira, Craig, Anthony P., Hunke, Elizabeth Clare, Rethmeier, Stefan
Format: Text
Language:English
Published: 2024
Subjects:
Online Access:https://doi.org/10.5194/gmd-2024-40
https://gmd.copernicus.org/preprints/gmd-2024-40/
Description
Summary:This study focuses on the performance of CICE and its Elastic-Viscous-Plastic (EVP) dynamical solver. The study has been conducted in two steps. First, the standard EVP solver has been extracted from CICE for experiments with refactored versions of it. Secondly, one refactored version was integrated and tested as part of the full model. Two dominant bottlenecks were revealed. The first is the number of MPI and OpenMP synchronization points required for halo exchanges during each time-step combined with the irregular domain of active sea ice points. The second is the lack of Single Instruction Multiple Data (SIMD) code generation. The study refactors the standard EVP solver based on two generic patterns. The first pattern exposes how general finite-differences on masked multi-dimensional arrays can be expressed in order to produce significantly better code generation. The primary change is that the memory access pattern is changed from random access to direct access. The second pattern exposes an alternative approach to handle static grid properties. The measured single core improvement is increased by more than a factor of five compared to the standard implementation. The refactored implementation strong scales on the Intel® Xeon® Scalable Processor Series node until the available bandwidth of the node is used. For the Intel® Xeon® CPU Max Series Series there is sufficient bandwidth to allow the strong scaling to continue for all the cores on the node resulting in a single node improvement factor of 35 over the standard implementation. This study also show improved performance on GPU processors.