Concurrent Online Testing for Many Core Systems-on-Chips

Shrinking transistor sizes have introduced new challenges and opportunities for system-on-chip (SoC) design and reliability. Smaller transistors are more susceptible to early lifetime failure and electronic wear-out, greatly reducing their reliable lifetimes. However, smaller transistors will also a...

Full description

Bibliographic Details
Main Author: Lee, Jason Daniel
Other Authors: Mahapatra, Rabinarayan N., Walker, Duncan M., Kim, Eun J., Choi, Seong G.
Format: Thesis
Language:English
Published: 2010
Subjects:
Online Access:https://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615
id fttexasamuniv:oai:oaktrust.library.tamu.edu:1969.1/ETD-TAMU-2010-12-8615
record_format openpolar
spelling fttexasamuniv:oai:oaktrust.library.tamu.edu:1969.1/ETD-TAMU-2010-12-8615 2023-07-16T03:57:33+02:00 Concurrent Online Testing for Many Core Systems-on-Chips Lee, Jason Daniel Mahapatra, Rabinarayan N. Walker, Duncan M. Kim, Eun J. Choi, Seong G. December 2010 application/pdf https://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615 en_US eng https://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615 concurrent online testing many core safety-critical systems in-field testing electronic wearout Thesis text 2010 fttexasamuniv 2023-06-27T22:22:35Z Shrinking transistor sizes have introduced new challenges and opportunities for system-on-chip (SoC) design and reliability. Smaller transistors are more susceptible to early lifetime failure and electronic wear-out, greatly reducing their reliable lifetimes. However, smaller transistors will also allow SoC to contain hundreds of processing cores and other infrastructure components with the potential for increased reliability through massive structural redundancy. Concurrent online testing (COLT) can provide sufficient reliability and availability to systems with this redundancy. COLT manages the process of testing a subset of processing cores while the rest of the system remains operational. This can be considered a temporary, graceful degradation of system performance that increases reliability while maintaining availability. In this dissertation, techniques to assist COLT are proposed and analyzed. The techniques described in this dissertation focus on two major aspects of COLT feasibility: recovery time and test delivery costs. To reduce the time between failure and recovery, and thereby increase system availability, an anomaly-based test triggering unit (ATTU) is proposed to initiate COLT when anomalous network behavior is detected. Previous COLT techniques have relied on initiating tests periodically. However, determining the testing period is based on a device's mean time between failures (MTBF), and calculating MTBF is exceedingly difficult and imprecise. To address the test delivery costs associated with COLT, a distributed test vector storage (DTVS) technique is proposed to eliminate the dependency of test delivery costs on core location. Previous COLT techniques have relied on a single location to store test vectors, and it has been demonstrated that centralized storage of tests scales poorly as the number of cores per SoC grows. Assuming that the SoC organizes its processing cores with a regular topology, DTVS uses an interleaving technique to optimally distribute the test vectors across the entire ... Thesis Attu Texas A&M University Digital Repository
institution Open Polar
collection Texas A&M University Digital Repository
op_collection_id fttexasamuniv
language English
topic concurrent online testing
many core
safety-critical systems
in-field testing
electronic wearout
spellingShingle concurrent online testing
many core
safety-critical systems
in-field testing
electronic wearout
Lee, Jason Daniel
Concurrent Online Testing for Many Core Systems-on-Chips
topic_facet concurrent online testing
many core
safety-critical systems
in-field testing
electronic wearout
description Shrinking transistor sizes have introduced new challenges and opportunities for system-on-chip (SoC) design and reliability. Smaller transistors are more susceptible to early lifetime failure and electronic wear-out, greatly reducing their reliable lifetimes. However, smaller transistors will also allow SoC to contain hundreds of processing cores and other infrastructure components with the potential for increased reliability through massive structural redundancy. Concurrent online testing (COLT) can provide sufficient reliability and availability to systems with this redundancy. COLT manages the process of testing a subset of processing cores while the rest of the system remains operational. This can be considered a temporary, graceful degradation of system performance that increases reliability while maintaining availability. In this dissertation, techniques to assist COLT are proposed and analyzed. The techniques described in this dissertation focus on two major aspects of COLT feasibility: recovery time and test delivery costs. To reduce the time between failure and recovery, and thereby increase system availability, an anomaly-based test triggering unit (ATTU) is proposed to initiate COLT when anomalous network behavior is detected. Previous COLT techniques have relied on initiating tests periodically. However, determining the testing period is based on a device's mean time between failures (MTBF), and calculating MTBF is exceedingly difficult and imprecise. To address the test delivery costs associated with COLT, a distributed test vector storage (DTVS) technique is proposed to eliminate the dependency of test delivery costs on core location. Previous COLT techniques have relied on a single location to store test vectors, and it has been demonstrated that centralized storage of tests scales poorly as the number of cores per SoC grows. Assuming that the SoC organizes its processing cores with a regular topology, DTVS uses an interleaving technique to optimally distribute the test vectors across the entire ...
author2 Mahapatra, Rabinarayan N.
Walker, Duncan M.
Kim, Eun J.
Choi, Seong G.
format Thesis
author Lee, Jason Daniel
author_facet Lee, Jason Daniel
author_sort Lee, Jason Daniel
title Concurrent Online Testing for Many Core Systems-on-Chips
title_short Concurrent Online Testing for Many Core Systems-on-Chips
title_full Concurrent Online Testing for Many Core Systems-on-Chips
title_fullStr Concurrent Online Testing for Many Core Systems-on-Chips
title_full_unstemmed Concurrent Online Testing for Many Core Systems-on-Chips
title_sort concurrent online testing for many core systems-on-chips
publishDate 2010
url https://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615
genre Attu
genre_facet Attu
op_relation https://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8615
_version_ 1771544177606131712