计算机科学与探索
計算機科學與探索
계산궤과학여탐색
Journal of Frontiers of Computer Science & Technology
2015年
10期
1153-1162
,共10页
车永刚%张理论%王勇献%徐传福%程兴华
車永剛%張理論%王勇獻%徐傳福%程興華
차영강%장이론%왕용헌%서전복%정흥화
多核%集成众核%CFD应用程序%OpenMP%性能分析
多覈%集成衆覈%CFD應用程序%OpenMP%性能分析
다핵%집성음핵%CFD응용정서%OpenMP%성능분석
multicore%many integrated core%CFD application%OpenMP%performance analysis
多核与众核已成为当前主流的高性能计算体系结构,OpenMP编程是开发其并行计算能力的主要手段之一。针对一个实际高阶精度结构网格CFD(computational fluids dynamics)应用程序,采用基于硬件计数器的性能测试和模型分析的方法,系统地研究了其在Intel Xeon E5 Sandy Bridge多核处理器和Intel Knights Corner集成众核协处理器上的OpenMP性能。重点分析了OpenMP库开销、线程负载均衡性、主存访问带宽对性能的影响,发现因OpenMP并行引入的冗余计算对并行效率影响很小,但串行计算部分和负载不均衡性对并行效率影响大,主存访问带宽对浮点性能的影响大。还比较了该程序两种体系结构上的性能差异,讨论了性能进一步优化的方向。
多覈與衆覈已成為噹前主流的高性能計算體繫結構,OpenMP編程是開髮其併行計算能力的主要手段之一。針對一箇實際高階精度結構網格CFD(computational fluids dynamics)應用程序,採用基于硬件計數器的性能測試和模型分析的方法,繫統地研究瞭其在Intel Xeon E5 Sandy Bridge多覈處理器和Intel Knights Corner集成衆覈協處理器上的OpenMP性能。重點分析瞭OpenMP庫開銷、線程負載均衡性、主存訪問帶寬對性能的影響,髮現因OpenMP併行引入的冗餘計算對併行效率影響很小,但串行計算部分和負載不均衡性對併行效率影響大,主存訪問帶寬對浮點性能的影響大。還比較瞭該程序兩種體繫結構上的性能差異,討論瞭性能進一步優化的方嚮。
다핵여음핵이성위당전주류적고성능계산체계결구,OpenMP편정시개발기병행계산능력적주요수단지일。침대일개실제고계정도결구망격CFD(computational fluids dynamics)응용정서,채용기우경건계수기적성능측시화모형분석적방법,계통지연구료기재Intel Xeon E5 Sandy Bridge다핵처리기화Intel Knights Corner집성음핵협처리기상적OpenMP성능。중점분석료OpenMP고개소、선정부재균형성、주존방문대관대성능적영향,발현인OpenMP병행인입적용여계산대병행효솔영향흔소,단천행계산부분화부재불균형성대병행효솔영향대,주존방문대관대부점성능적영향대。환비교료해정서량충체계결구상적성능차이,토론료성능진일보우화적방향。
Multicore and manycore are becoming mainstream architectures in high performance computing. OpenMP programming is one of the primary methods to exploit the parallel computing capabilities of them. By using a sys-tematic approach which incorporates hardware performance counter based measurement and model based analysis, this paper evaluates the OpenMP performance of a real-world high order structured grids based CFD (computational fluids dynamics) application on Xeon E5 Sandy Bridge, an Intel multicore processor, and Knights Corner, an Intel many integrated core coprocessor. This paper analyzes the performance impacts of the OpenMP library cost, the load balance among different OpenMP threads, and the memory bandwidth to the application. The results show that the redundant computation introduced by OpenMP parallel programming is not significant. The serial portion and the load imbalance significantly affect the parallel efficiency. And memory access bandwidth significantly affects the achieved floating point performance. This paper also compares the performance differences between two archi-tectures and discusses the directions of further performance tuning.