Concurrent Online Testing for Many Core Systems-on-Chips

Shrinking transistor sizes have introduced new challenges and opportunities for system-on-chip (SoC) design and reliability. Smaller transistors are more susceptible to early lifetime failure and electronic wear-out, greatly reducing their reliable lifetimes. However, smaller transistors will also a...

Full description

Bibliographic Details
Other Authors: Mahapatra, Rabinarayan N., Walker, Duncan M., Kim, Eun J., Choi, Seong G.
Format: Thesis
Language:English
Published: 2010
Subjects:
Online Access:http://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615
id fttexasamuniv:oai:repository.tamu.edu:1969.1/ETD-TAMU-2010-12-8615
record_format openpolar
spelling fttexasamuniv:oai:repository.tamu.edu:1969.1/ETD-TAMU-2010-12-8615 2023-05-15T15:33:35+02:00 Concurrent Online Testing for Many Core Systems-on-Chips Mahapatra, Rabinarayan N. Walker, Duncan M. Kim, Eun J. Choi, Seong G. December 2010 http://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615 en_US eng http://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615 concurrent online testing many core safety-critical systems in-field testing electronic wearout Thesis 2010 fttexasamuniv 2014-03-30T10:12:52Z Shrinking transistor sizes have introduced new challenges and opportunities for system-on-chip (SoC) design and reliability. Smaller transistors are more susceptible to early lifetime failure and electronic wear-out, greatly reducing their reliable lifetimes. However, smaller transistors will also allow SoC to contain hundreds of processing cores and other infrastructure components with the potential for increased reliability through massive structural redundancy. Concurrent online testing (COLT) can provide sufficient reliability and availability to systems with this redundancy. COLT manages the process of testing a subset of processing cores while the rest of the system remains operational. This can be considered a temporary, graceful degradation of system performance that increases reliability while maintaining availability. In this dissertation, techniques to assist COLT are proposed and analyzed. The techniques described in this dissertation focus on two major aspects of COLT feasibility: recovery time and test delivery costs. To reduce the time between failure and recovery, and thereby increase system availability, an anomaly-based test triggering unit (ATTU) is proposed to initiate COLT when anomalous network behavior is detected. Previous COLT techniques have relied on initiating tests periodically. However, determining the testing period is based on a device's mean time between failures (MTBF), and calculating MTBF is exceedingly difficult and imprecise. To address the test delivery costs associated with COLT, a distributed test vector storage (DTVS) technique is proposed to eliminate the dependency of test delivery costs on core location. Previous COLT techniques have relied on a single location to store test vectors, and it has been demonstrated that centralized storage of tests scales poorly as the number of cores per SoC grows. Assuming that the SoC organizes its processing cores with a regular topology, DTVS uses an interleaving technique to optimally distribute the test vectors across the entire chip. DTVS is analyzed both empirically and analytically, and a testing protocol using DTVS is described. COLT is only feasible if the applications running concurrently are largely unaffected. The effect of COLT on application execution time is also measured in this dissertation, and an application-aware COLT protocol is proposed and analyzed. Application interference is greatly reduced through this technique. Thesis Attu Texas A&M University Digital Repository
institution Open Polar
collection Texas A&M University Digital Repository
op_collection_id fttexasamuniv
language English
topic concurrent online testing
many core
safety-critical systems
in-field testing
electronic wearout
spellingShingle concurrent online testing
many core
safety-critical systems
in-field testing
electronic wearout
Concurrent Online Testing for Many Core Systems-on-Chips
topic_facet concurrent online testing
many core
safety-critical systems
in-field testing
electronic wearout
description Shrinking transistor sizes have introduced new challenges and opportunities for system-on-chip (SoC) design and reliability. Smaller transistors are more susceptible to early lifetime failure and electronic wear-out, greatly reducing their reliable lifetimes. However, smaller transistors will also allow SoC to contain hundreds of processing cores and other infrastructure components with the potential for increased reliability through massive structural redundancy. Concurrent online testing (COLT) can provide sufficient reliability and availability to systems with this redundancy. COLT manages the process of testing a subset of processing cores while the rest of the system remains operational. This can be considered a temporary, graceful degradation of system performance that increases reliability while maintaining availability. In this dissertation, techniques to assist COLT are proposed and analyzed. The techniques described in this dissertation focus on two major aspects of COLT feasibility: recovery time and test delivery costs. To reduce the time between failure and recovery, and thereby increase system availability, an anomaly-based test triggering unit (ATTU) is proposed to initiate COLT when anomalous network behavior is detected. Previous COLT techniques have relied on initiating tests periodically. However, determining the testing period is based on a device's mean time between failures (MTBF), and calculating MTBF is exceedingly difficult and imprecise. To address the test delivery costs associated with COLT, a distributed test vector storage (DTVS) technique is proposed to eliminate the dependency of test delivery costs on core location. Previous COLT techniques have relied on a single location to store test vectors, and it has been demonstrated that centralized storage of tests scales poorly as the number of cores per SoC grows. Assuming that the SoC organizes its processing cores with a regular topology, DTVS uses an interleaving technique to optimally distribute the test vectors across the entire chip. DTVS is analyzed both empirically and analytically, and a testing protocol using DTVS is described. COLT is only feasible if the applications running concurrently are largely unaffected. The effect of COLT on application execution time is also measured in this dissertation, and an application-aware COLT protocol is proposed and analyzed. Application interference is greatly reduced through this technique.
author2 Mahapatra, Rabinarayan N.
Walker, Duncan M.
Kim, Eun J.
Choi, Seong G.
format Thesis
title Concurrent Online Testing for Many Core Systems-on-Chips
title_short Concurrent Online Testing for Many Core Systems-on-Chips
title_full Concurrent Online Testing for Many Core Systems-on-Chips
title_fullStr Concurrent Online Testing for Many Core Systems-on-Chips
title_full_unstemmed Concurrent Online Testing for Many Core Systems-on-Chips
title_sort concurrent online testing for many core systems-on-chips
publishDate 2010
url http://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615
genre Attu
genre_facet Attu
op_relation http://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615
_version_ 1766364111297314816