Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Reitz, Lukas; Fohry, Claudia

dc.date.accessioned	2024-03-18T09:50:46Z
dc.date.available	2024-03-18T09:50:46Z
dc.date.issued	2024-03-13
dc.identifier	doi:10.17170/kobra-202403149779
dc.identifier.uri	http://hdl.handle.net/123456789/15562
dc.description.sponsorship	Gefördert im Rahmen des Projekts DEAL	ger
dc.language.iso	eng
dc.rights	Namensnennung 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.subject	asynchronous many-task programming	eng
dc.subject	fault tolerance	eng
dc.subject	task-level checkpointing	eng
dc.subject	work stealing	eng
dc.subject.ddc	004
dc.title	Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters	eng
dc.type	Aufsatz
dcterms.abstract	Exascale supercomputers consist of millions of processing units, and this number is still growing. Therefore, hardware failures, such as permanent node failures, become increasingly frequent. They can be tolerated with system-level Checkpoint/Restart, which saves the whole application state transparently and, if needed, restarts the application from the saved state; or with application-level checkpointing, which saves only relevant data via explicit calls in the program. The former approach requires no additional programming expense, whereas the latter is more efficient and allows to continue program execution after failures on the intact resources (localized shrinking recovery). An increasingly popular programming paradigm is asynchronous many-task (AMT) programming. Here, programmers identify parallel tasks, and a runtime system assigns the tasks to worker threads. Since tasks have clearly defined interfaces, the runtime system can automatically extract and save their interface data. This approach, called task-level checkpointing (TC), combines the respective strengths of system-level and application-level checkpointing. AMTs come in many variants, and so far, TC has only been applied to a few, rather simple variants. This paper considers TC for a different AMT variant: nested fork–join (NFJ) programs that run on clusters of multicore nodes under work stealing. We present the first TC scheme for this setting. It performs a localized shrinking recovery and can handle multiple node failures. In experiments with four benchmarks, we observed execution time overheads of around 44 % at 1536 workers, and negligible recovery costs. Additionally, we developed and experimentally validated a prediction model for the running times of the scheme.	eng
dcterms.accessRights	open access
dcterms.creator	Reitz, Lukas
dcterms.creator	Fohry, Claudia
dc.relation.doi	doi:10.1007/s42979-024-02624-8
dc.subject.swd	Programmierung	ger
dc.subject.swd	Fehlertoleranz	ger
dc.subject.swd	Fixpunkt <Datensicherung>	ger
dc.subject.swd	Cluster	ger
dc.type.version	publishedVersion
dcterms.source.identifier	eissn:2661-8907
dcterms.source.journal	SN Computer Science	eng
dcterms.source.volume	Volume 5
kup.iskup	false
dcterms.source.articlenumber	320

Dateien zu dieser Ressource

Name:: license_rdf
Größe:: 908Bytes
Format:: application/rdf+xml

Öffnen

Name:: s42979_024_02624_8.pdf
Größe:: 1.313Mb
Format:: PDF

Öffnen

Das Dokument erscheint in:

Artikel [1186]

Zur Kurzanzeige

Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: Namensnennung 4.0 International