基于Scrapy的分布式网络新闻抓取系统设计与实现

资源描述

《基于Scrapy的分布式网络新闻抓取系统设计与实现_马联帅.docx》由会员分享，可在线阅读，更多相关《基于Scrapy的分布式网络新闻抓取系统设计与实现_马联帅.docx（83页珍藏版）》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。

1、 Design and Implementation of Distributed Netnews Crawling System Based on Scrapy A thesis submitted to XIDIAN UNIVERSITY in partial fulfillment of the requirements for the degree of Master in Computer Application Technology By Ma Lianshuai Supervisor: Yao Yong Associate Professor December 2015 摘要在

2、互联网快速发展的推动下，人类生活的基本方式己经悄然发生改变。以往的物质交换方式、信息传播方式演变为新时代的 “ 非主流 ” ，互联网取而代之成为社会生活的必需品。新闻是生活中信息获取的最主要途径之一，随着网络的发展和应用，新闻媒体己经演变为传统媒体与网络媒体相融合的新媒体，大众获取新闻资讯的途径不断增多。网络新闻的时滞不断缩小，使得更多社会群体逐渐开始从互联网上获取新闻资讯，基于网络新闻的大数据前沿课题研究越来越热，科研领域对网络新闻数据的需求不断增多。在此背景下，论文设计实现分布式网络新闻抓取系统来抓取网络新

3、闻数据，为相关研宄提供数据支持。基于研宄课题，论文介绍了网络爬虫的产生、发展和工作原理， Scrapy 爬虫框架的结构和工作流程， Scrapy-Redis 的组成和各组件功能， Graphite 的相关概念。在深入分析网络新闻爬虫特点的基础上，根据新闻网页特点设计爬虫爬取策略和提取字段，在 Scrapy 框架的基础上，使用自定义下载中间件避免爬虫爬行时被网站屏蔽，运用 Redis 数据库部署主从结构的分布式爬虫集群，提高数据抓取效率，运用 Graphite 实现系统状态可视化，运用 Selenium 解决了动态网页数据的抓取问题。文中还设计编写了系统数据处理模块，主要包括数据清

4、洗、编码转换、对象添加、数据分类等功能。为测试系统的性能，以腾讯网国内新闻、国际新闻、社会新闻以及军事新闻四大新闻栏目为抓取目标，运行 10 小时，抓取了 3 万余条新闻内容和数百万条评论信息。最后，本文通过三个基础数据分析实验，从新闻内容、网络媒体及用户评论三个方面分析了包括舆论热点、时间维度的新闻特征、用户浏览偏好、媒体影响力、评论用户性别特征、评论用户地区特征共六个网络新闻相关特征，从而验证了数据的客观性、准确性和数据特征的多样性。关键词：网络新闻，分布式爬虫，数据处理，数据分析 ABSTRACT ABSTRACT With t

5、he rapid development of the Internet, the basic way of our daily life has been quietly changed. The Internet has become the cheapest and most efficient way to disseminate information and to exchange material. News report is one of the most important ways of information acquisition in our daily life.

6、 With the application and rapid development of network technology, not only has news media evolved into a new media merging by traditional media and Internet media, but also public access to getting news and information is growing continually. The time delay of the network news has shrunk so that mo

7、re social groups began to get news and information from the Internet. Meanwhile, researching on the big data frontier of the network news is becoming popular currently. From the intuitive point of view, the demand of network news data is increasing in the field of scientific research. In response, a

8、 distributed web crawler system is designed and implemented to extract the network news data in this paper, which provides sufficient support for our relevant research. Based on the research topic, this paper introduced the generation, development and operational principle of the web crawler, as wel

9、l as the structure and working flow of Scrapy framework, composition and function of each component of Scrapy-Redis, and concepts related to Graphite. This paper deeply analyzed the main characteristics of the crawler for network news, designed crawling strategy and extraction fields according to ch

10、aracteristics of webpages. Firstly, the system adopted Scrapy as the basic framework and deployed a custom download middleware to avoid being blocked by webs. To improve the efficiency of data crawling, it used Redis database to deploy distributed crawler with master-slave structure, using Graphite

11、as monitoring tool to realize visualization of system state. Then it used Selenium to solve the problem of dynamic web data extraction. We have also designed the data processing module, whose functions mainly include data cleaning, transcoding, adding object and data classification, etc. In order to

12、 test the performance of the system, we chose four major news columns of the Tencent as the target, including domestic news, HI 西安电子科技大学硕士学位论文 accuracy and multi characteristics of the data were verified. 4 Keywords: network news, distributed crawler, data processing, data analysis 插图索引

13、5 插图索引图 2.1 网络爬虫工作原理图 . 6 图 2.2 遍历策略示例图 . 7 图 2.3 分布式抓取系统基本结构图 . 9 图 2.4 Scrapy 框架结构图 . 10 图 2.5 Scrapy 工作原理图 . 10 图 2.6 Redis 工作原理图 . 12 图 2.7 Graphite 组成结构图 . 15 图 2.8 Graphite Web 应用程序树状结构图 . 17 图 3.1 网络新闻抓取系统架构图 . 20 图 3.2 网页链接处理流程图 . 21 图 3.3 使用 Selenium 抓取动态网页原理图 . 24 图 3.4 分布式爬虫架构图 . 25 图 3.

14、5 数据存储方式示意图 . 26 图 3.6 爬虫程序的组成图 . 27 图 3.7 Xpath 使用方法示例 . 28 图 3.8 爬虫防网站屏蔽原理图 . 30 图 3.9 动态网页数据抓取程序流程图 . 31 图 3.10 Scheduler 类层次结构图 . 32 图 3.11 数据处理模块组成图 . 35 图 3.12 过滤 ID 重复的新闻数据 . 37 图 3.13 数据分类方法示意图 . 38 图 3.14 编码转换前文件预览图 . 38 图 3.15 编码转换后文件预览图 . 39 HI 3.16 Scrapy 运行截图 . 41 图 3.17 Redis 数据处理截图 .

15、41 图 3.18 抓取数据的文件列表 . 42 图 3.19 系统状态监测图 . 42 图 4.1 1-5 月发布新闻数量变化趋势图 . 47 图 4.2 3 月发布新闻数量变化趋势图 . 47 图 4.3 评论用户所在地区分布图 1. 56 图 4.4 评论用户所在地区分布图 2. 57 西安电子科技大学硕士学位论文图 4.5 评论用户所在地区分布图 3 ,57 6 表格索引 7 表格索引表 3.1 网员 url 正贝 !J 表达式实例 . 22 表 3.2 Item 构成表 . 23 表 3.3 Graphite 监测信息歹 U 表 . 26 表

16、3.4 部署环境信息表 . 32 表 3.5 新闻数据的格式 . 35 表 3.6 评论数据的格式 . 35 表 3.7 Master 硬件环境信息表 . 40 表 3.8 Slave 硬件环境信息表 . 40 表 4.1 3 月评论数前十的新闻 . 44 表 4.2 5 月评论数前十的新闻 . 45 表 4.3 3 月份用户最喜欢浏览的前十条新闻 . 48 表 4.4 4 月份用户最喜欢浏览的前十条新闻 . 49 表 4.5 5 月份用户最喜欢浏览的前十条新闻 . 49 表 4.6 3 月搜索次数排名前十新闻 . 50 表 4.7 4 月搜索次数排名前十新闻 . 51 表 4.8 5 月搜索

17、次数排名前十新闻 . 51 表 4.9 2015 年 1-5 月发布新闻数量前十媒体 . 52 表 4.10 2015 年 1-5 月发布热点新闻数量前十媒体 . 53 表 4.11 评论用户性别构成表 1 . 55 表 4.12 评论用户性别构成表 2 . 55 表 4.13 评论用户性别构成表 3 . 56 西安电子科技大学硕士学位论文图 4.5 评论用户所在地区分布图 3 ,57 8 缩略语对照表缩略语对照表缩略语英文全称中文对照 FOAF Friend-of-a-Friend 种 XML/RDF 词汇表 URL Uniform Resource

18、Locator 统一资源定位符 DNS Domain Name System 域名系统 PNG Portable Network Graphic Format 可移植网络图形格式 JSON JavaScript Object NotationJavaScript 对象表示法 JS javascript 一种脚本语言 UTF-8 8-bit Unicode Transformation Format 万国码 API Application Programming Interface 应用程序编程接口 FIFO First Input First Output 先入先出队列 LIFO Last I

19、n First Out 后入先出法 XI 目录觀 . I ABSTRACT. Ill 随弓丨 . V 髓弓丨 . VII 缩略语对照表 . XI m-m 嫌 . 1 1.1 课题背景与意义 . 1 1.2 主要研宄内容与工作 . 2 1.3 论文的组织结构 . 2 1.4 本章小结 . 3 第二章网络爬虫及 Scrapy 框架 . 5 2.1 网络爬虫 . 5 2.1.1 网络爬虫的产生 . 5 2.1.2 网络爬虫的基本原理 . 5 2.2 Scrapy 框架 . 10 2.2.1 Scrapy 框架结构 . 10 2.2.2 Scrapy 工作原理及流程 . 12 2.3 Scrapy

20、-Redis 原理 . 13 2.3.1 Redis 简述 . 13 2.3.2 Scrapy-Redis 的基本组成及原理 . 14 2.4 Graphite 应用介绍 . 15 2.5 本章小结 . 17 第三章分布式网络新闻抓取系统的设计与实现 . 19 3.1 网络新闻爬虫的特点 . 19 3.2 分布式网络新闻抓取系统的设计 . 19 3.2.1 系统总体架构设计 . 20 3.2.2 爬取策略的设计 . 20 3.2.3 抓取字段设计 . 23 3.2.4 动态网页抓取方法的设计 . 24 3.2.5 爬虫的分布式设计 . 24 西安电子科技大学硕士学位论文 10 目录 3.2.6 基于 Graphite 的系统监测组件 . 25 3.2.7 数据存储模块的设计 . 26 3.3 分布式网络新闻抓取系统的实现 .

展开阅读全文