计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2014年
5期
12-16,20
,共6页
微博数据%模拟登录%用户网络%用户影响力%网络舆情%优先队列
微博數據%模擬登錄%用戶網絡%用戶影響力%網絡輿情%優先隊列
미박수거%모의등록%용호망락%용호영향력%망락여정%우선대렬
microblogging data%analog login%user network%user influence%Internet public opinion%priority queue
目前常用的网络爬虫和基于微博 API 抓取数据的算法很难满足舆情系统对微博数据的需求。为此,提出一种模拟浏览器登录微博抓取网页数据的算法,以方便地获取任意微博用户网页上的所有数据。通过微博用户之间的关系构建用户网络,并通过该网络发现新用户。为获取微博上有质量的数据,建立一个完整的数学模型,根据用户的发帖数、发帖频率、粉丝数、转发数、评论数等因素来计算用户影响力,以影响力为主要因子构建优先队列,使得影响力越大的用户数据采集频率越高,同时计算时间间隔以兼顾非活跃用户的数据获取。实验结果表明,该算法具有通用性强、完全无需人工干预、获取信息的质量高、速度快等优点。
目前常用的網絡爬蟲和基于微博 API 抓取數據的算法很難滿足輿情繫統對微博數據的需求。為此,提齣一種模擬瀏覽器登錄微博抓取網頁數據的算法,以方便地穫取任意微博用戶網頁上的所有數據。通過微博用戶之間的關繫構建用戶網絡,併通過該網絡髮現新用戶。為穫取微博上有質量的數據,建立一箇完整的數學模型,根據用戶的髮帖數、髮帖頻率、粉絲數、轉髮數、評論數等因素來計算用戶影響力,以影響力為主要因子構建優先隊列,使得影響力越大的用戶數據採集頻率越高,同時計算時間間隔以兼顧非活躍用戶的數據穫取。實驗結果錶明,該算法具有通用性彊、完全無需人工榦預、穫取信息的質量高、速度快等優點。
목전상용적망락파충화기우미박 API 조취수거적산법흔난만족여정계통대미박수거적수구。위차,제출일충모의류람기등록미박조취망혈수거적산법,이방편지획취임의미박용호망혈상적소유수거。통과미박용호지간적관계구건용호망락,병통과해망락발현신용호。위획취미박상유질량적수거,건립일개완정적수학모형,근거용호적발첩수、발첩빈솔、분사수、전발수、평론수등인소래계산용호영향력,이영향력위주요인자구건우선대렬,사득영향력월대적용호수거채집빈솔월고,동시계산시간간격이겸고비활약용호적수거획취。실험결과표명,해산법구유통용성강、완전무수인공간예、획취신식적질량고、속도쾌등우점。
Currently, Web crawler and microblog API which are used to grab data from the microblog are difficult to satisfy the public opinion system demands for microblog data. To settle the problem, this paper presents a feasible solution which is the similar as the browser login microblog to capture data from Web pages. It can easily get all data from any microblog users. On this basis, it constructs a microblogging network through interconnections among users, and discovers new users through it. In order to get high quality data, it builds mathematical models to calculate the user’s influence index by using posting number, posting frequency, fans number, forwarding number and comments number. Moreover, it builds priority queue according to the calculated influence factor, which let those that have bigger influence index have high acquisition frequency. Finally, it calculates time interval to balance the lower frequency of non-active microblog user. The experimental results show that this method not only processes easily and has higher speed but also can obtain high quality information and have huge versatility.