计算机工程与科学
計算機工程與科學
계산궤공정여과학
COMPUTER ENGINEERING & SCIENCE
2009年
11期
87-90
,共4页
刘杰%张亦添%迟利华%徐涵%蒋杰%胡庆丰
劉傑%張亦添%遲利華%徐涵%蔣傑%鬍慶豐
류걸%장역첨%지리화%서함%장걸%호경봉
高性能计算%容错%checkpoint/restart%并行程序
高性能計算%容錯%checkpoint/restart%併行程序
고성능계산%용착%checkpoint/restart%병행정서
high performance computing%faults tolerant%checkpoint/restart%parallel programming
大型科学与工程计算需要实现空前复杂的数值模拟计算和处理空前庞大的数据,有必要设计一个容错环境,自动调度加载故障程序.基于并行作业和系统提供的checkpoint/restart功能,本文设计了一个用户级的并行作业容错自动调度环境,包括并行程序容错调度的自动感知、自动加载和数据完整性保证算法.测试结果表明,并行作业容错自动调度环境保证了checkpoint数据的完整性,并在应用程序出错退出以后,调度环境可以自动感知,自动提交运行作业,实现了并行作业无需用户干预的容错自动调度计算,避免了系统资源和计算时间的浪费.
大型科學與工程計算需要實現空前複雜的數值模擬計算和處理空前龐大的數據,有必要設計一箇容錯環境,自動調度加載故障程序.基于併行作業和繫統提供的checkpoint/restart功能,本文設計瞭一箇用戶級的併行作業容錯自動調度環境,包括併行程序容錯調度的自動感知、自動加載和數據完整性保證算法.測試結果錶明,併行作業容錯自動調度環境保證瞭checkpoint數據的完整性,併在應用程序齣錯退齣以後,調度環境可以自動感知,自動提交運行作業,實現瞭併行作業無需用戶榦預的容錯自動調度計算,避免瞭繫統資源和計算時間的浪費.
대형과학여공정계산수요실현공전복잡적수치모의계산화처리공전방대적수거,유필요설계일개용착배경,자동조도가재고장정서.기우병행작업화계통제공적checkpoint/restart공능,본문설계료일개용호급적병행작업용착자동조도배경,포괄병행정서용착조도적자동감지、자동가재화수거완정성보증산법.측시결과표명,병행작업용착자동조도배경보증료checkpoint수거적완정성,병재응용정서출착퇴출이후,조도배경가이자동감지,자동제교운행작업,실현료병행작업무수용호간예적용착자동조도계산,피면료계통자원화계산시간적낭비.
Large-scale scientific and engineering computing needs to realize unprecedented complex numerical simulation and process huge data,and it is necessary to design a fault-tolerant environment for auto-reloading the failed parallel tasks. Based on parallel jobs and the system-provided checkpoint/restart function, we design a user-level,fault-tolerant environment for job auto-scheduling, including the auto-perception of fault-tolerant parallel program scheduing, auto-reloading, and data integrity ensuring. The experimental results demonstrate that the design of the fault-tolerant environment achieves the de-sign requirements of parallel program scheduling which requires auto-reloading the failed applications and ensures the cor-rectness and completeness of the checkpoint data.