计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2014年
3期
59-62,81
,共5页
MapReduce模型%维度%事实%代理键%并行查找%聚合
MapReduce模型%維度%事實%代理鍵%併行查找%聚閤
MapReduce모형%유도%사실%대리건%병행사조%취합
MapReduce model%dimension%fact%surrogate key%parallel lookup%aggregation
针对传统的抽取、转换和加载工具在面临数据仓库中海量事实数据时效率较低的问题,从事实表查找代理键和多粒度事实预聚合2个角度出发,提出在渐变维度表上的多路并行查找算法和在不同粒度上对事实数据进行聚合的算法。第1种算法综合考虑了渐变维度和大维度的情况,运用分布式缓存方法将小维度表复制到各个数据节点的内存中,同时对事实数据和大维度数据采用相同的分区函数进行分区,从而解决内存不足的问题,在Map阶段实现多路查找代理键,避免由于数据传输产生的网络延迟。第2种算法在Reduce阶段之后增加Merge阶段,可有效解决事实数据按照不同粒度进行聚合的问题。实验结果表明,与Hive数据仓库相比,2种算法在并行处理数据仓库的事实数据的问题上具有更高的处理效率。
針對傳統的抽取、轉換和加載工具在麵臨數據倉庫中海量事實數據時效率較低的問題,從事實錶查找代理鍵和多粒度事實預聚閤2箇角度齣髮,提齣在漸變維度錶上的多路併行查找算法和在不同粒度上對事實數據進行聚閤的算法。第1種算法綜閤攷慮瞭漸變維度和大維度的情況,運用分佈式緩存方法將小維度錶複製到各箇數據節點的內存中,同時對事實數據和大維度數據採用相同的分區函數進行分區,從而解決內存不足的問題,在Map階段實現多路查找代理鍵,避免由于數據傳輸產生的網絡延遲。第2種算法在Reduce階段之後增加Merge階段,可有效解決事實數據按照不同粒度進行聚閤的問題。實驗結果錶明,與Hive數據倉庫相比,2種算法在併行處理數據倉庫的事實數據的問題上具有更高的處理效率。
침대전통적추취、전환화가재공구재면림수거창고중해량사실수거시효솔교저적문제,종사실표사조대리건화다립도사실예취합2개각도출발,제출재점변유도표상적다로병행사조산법화재불동립도상대사실수거진행취합적산법。제1충산법종합고필료점변유도화대유도적정황,운용분포식완존방법장소유도표복제도각개수거절점적내존중,동시대사실수거화대유도수거채용상동적분구함수진행분구,종이해결내존불족적문제,재Map계단실현다로사조대리건,피면유우수거전수산생적망락연지。제2충산법재Reduce계단지후증가Merge계단,가유효해결사실수거안조불동립도진행취합적문제。실험결과표명,여Hive수거창고상비,2충산법재병행처리수거창고적사실수거적문제상구유경고적처리효솔。
In view of that traditional Extract, Transform, Load(ETL) tools face the efficient problem of the massive fact data in data warehouse, two algorithms about parallel processing facts are designed and implemented based on Hadoop platform. From the two perspectives of surrogate key lookup of fact table and aggregation for fact data on the different granularity, a multi-way parallel lookup algorithm on slowly changing dimensions and an algorithm of aggregation for fact data on the different granularity are presented. The first algorithm considers slowly changing dimensions and big dimensions synthetically. In order to solve the problem of out of memory, the algorithm adopts an approach to the distributed cache to copy small dimensions to every date nodes’ memory. And implementing multi-way lookup of dimension keys in the stage of map is to avoid network delay result from data transmission. The second algorithm adds merge stage after reducing stage, so it is beneficial to solve the aggregation problem of the fact data according to different granularity effectively. Experimental results show that the two algorithms have better efficient than Hive data warehouse with respect to the problem of parallel processing facts data in data warehouse.