以前也用过爬虫,比如使用nutch爬取指定种子,基于爬到的数据做搜索,还大致看过一些源码。当然,nutch对于爬虫考虑的是十分全面和细致的。每当看到屏幕上唰唰过去的爬取到的网页信息以及处理信息的时候,总感觉这很黑科技。正好这次借助梳理Spring MVC的机会,想自己弄个小爬虫,简单没关系,有些小bug也无所谓,我需要的只是一个能针对某个种子网站能爬取我想要的信息就可以了。有Exception就去解决,可能是一些API使用不当,也可能是遇到了http请求状态异常,又或是数据库读写有问题,就是在这个报exception和解决exception的过程中,JewelCrawler(儿子的小名)已经可以能够独立的爬取数据,并且还有一项基于Word2Vec算法做个情感分析的小技能。

  后面可能还会有未知的Exception等着解决,也有一些性能需要优化,比如和数据库的交互,数据的读写等等。但是目测年内没有太多精力放这上面了,所以今天做一个简单的总结,而且前两篇主要侧重的是功能和结果,这篇来说说JewelCrawler是如何诞生的,并将代码放到Github上(源码地址在文章最后),有兴趣的可以关注下(仅供交流学习,请勿他用,考虑下douban君。多一点真诚,少一点伤害)

环境介绍

  开发工具:Intellij idea 14

  数据库: Mysql 5.5 + 数据库管理工具Navicat(可用来连接查询数据库)

  语言:Java

  Jar包管理:Maven

  版本管理:Git

目录结构

  其中

  com.ansj.vec是Word2Vec算法的Java版本实现

  com.jackie.crawler.doubanmovie是爬虫实现模块,其中又包括

  有些包是空的,因为这些模块还没有用上,其中

    constants包是存放常量类

    crawl包存放爬虫入口程序

    entity包映射数据库表的实体类

    test包存放测试类

    utils包存放工具类

  resource模块存放的是配置文件和资源文件,比如

    beans.xml:Spring上下文的配置文件

    seed.properties:种子文件

    stopwords.dic:停用词库

    comment12031715.txt:爬取的短评数据

    tokenizerResult.txt:使用IKAnalyzer分词后的结果文件

    vector.mod:基于Word2Vec算法训练的模型数据

  test模块是测试模块,用于编写UT.

数据库配置

  1. 添加依赖的包

  JewelCrawler使用的maven管理,所以只需要在pom.xml中添加相应的依赖就可以了

<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-jdbc</artifactId>
<version>4.1.1.RELEASE</version>
</dependency>
<dependency>
<groupId>commons-pool</groupId>
<artifactId>commons-pool</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>commons-dbcp</groupId>
<artifactId>commons-dbcp</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>

  

  2. 声明数据源bean

  我们需要在beans.xml中声明数据源的bean

<context:property-placeholder location="classpath*:*.properties"/>
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">
<property name="driverClassName" value="${jdbc.driver}"/>
<property name="url" value="${jdbc.url}"/>
<property name="username" value="${jdbc.username}"/>
<property name="password" value="${jdbc.password}"/>
</bean>

注意: 这里是绑定了外部配置文件jdbc.properties,具体数据源的参数从该文件读取。

  如果遇到问题“SQL [insert into user(id) values(?)]; Field 'name' doesn't  have a default value;”解决方法是设置表的相应字段为自增长字段。

解析页面遇到的问题

  对于爬到的网页数据需要解析dom结构,拿到自己想要的数据,期间遇到如下错误

  org.htmlparser.Node不识别

  解决方法:添加jar包依赖

<dependency>
<groupId>org.htmlparser</groupId>
<artifactId>htmlparser</artifactId>
<version>1.6</version>
</dependency>

  

  org.apache.http.HttpEntity不识别

  解决方法:添加jar包依赖

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency> 

  当然这是期间遇到的问题,最后用的是Jsoup做的页面解析。

maven仓库下载速度慢

  之前使用的是默认的maven中央仓库,下载jar包的速度很慢,不知道是我的网络问题还是其他原因,后来在网上找到了阿里云的maven仓库,更新后,相比之前简直是秒下,吐血推荐。

<mirrors>
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
</mirrors>

  找到maven的settings.xml文件,添加这个镜像即可。

读取resource模块下文件的一种方法

  比如读取seed.properties文件

@Test
public void testFile(){
File seedFile = new File(this.getClass().getResource("/seed.properties").getPath());
System.out.print("===========" + seedFile.length() + "===========" );
}

  

有关正则表达式

  使用regrex正则表达式的时候,如果匹配上了定义的Pattern,则需要先调用matcher的find方法然后才能使用group方法找到子串。直接调用group方法是没有办法找到你想要的结果的。

  我看了下上面Matcher类的源码

package java.util.regex;

import java.util.Objects;

public final class Matcher implements MatchResult {

    /**
* The Pattern object that created this Matcher.
*/
Pattern parentPattern; /**
* The storage used by groups. They may contain invalid values if
* a group was skipped during the matching.
*/
int[] groups; /**
* The range within the sequence that is to be matched. Anchors
* will match at these "hard" boundaries. Changing the region
* changes these values.
*/
int from, to; /**
* Lookbehind uses this value to ensure that the subexpression
* match ends at the point where the lookbehind was encountered.
*/
int lookbehindTo; /**
* The original string being matched.
*/
CharSequence text; /**
* Matcher state used by the last node. NOANCHOR is used when a
* match does not have to consume all of the input. ENDANCHOR is
* the mode used for matching all the input.
*/
static final int ENDANCHOR = 1;
static final int NOANCHOR = 0;
int acceptMode = NOANCHOR; /**
* The range of string that last matched the pattern. If the last
* match failed then first is -1; last initially holds 0 then it
* holds the index of the end of the last match (which is where the
* next search starts).
*/
int first = -1, last = 0; /**
* The end index of what matched in the last match operation.
*/
int oldLast = -1; /**
* The index of the last position appended in a substitution.
*/
int lastAppendPosition = 0; /**
* Storage used by nodes to tell what repetition they are on in
* a pattern, and where groups begin. The nodes themselves are stateless,
* so they rely on this field to hold state during a match.
*/
int[] locals; /**
* Boolean indicating whether or not more input could change
* the results of the last match.
*
* If hitEnd is true, and a match was found, then more input
* might cause a different match to be found.
* If hitEnd is true and a match was not found, then more
* input could cause a match to be found.
* If hitEnd is false and a match was found, then more input
* will not change the match.
* If hitEnd is false and a match was not found, then more
* input will not cause a match to be found.
*/
boolean hitEnd; /**
* Boolean indicating whether or not more input could change
* a positive match into a negative one.
*
* If requireEnd is true, and a match was found, then more
* input could cause the match to be lost.
* If requireEnd is false and a match was found, then more
* input might change the match but the match won't be lost.
* If a match was not found, then requireEnd has no meaning.
*/
boolean requireEnd; /**
* If transparentBounds is true then the boundaries of this
* matcher's region are transparent to lookahead, lookbehind,
* and boundary matching constructs that try to see beyond them.
*/
boolean transparentBounds = false; /**
* If anchoringBounds is true then the boundaries of this
* matcher's region match anchors such as ^ and $.
*/
boolean anchoringBounds = true; /**
* No default constructor.
*/
Matcher() {
} /**
* All matchers have the state used by Pattern during a match.
*/
Matcher(Pattern parent, CharSequence text) {
this.parentPattern = parent;
this.text = text; // Allocate state storage
int parentGroupCount = Math.max(parent.capturingGroupCount, 10);
groups = new int[parentGroupCount * 2];
locals = new int[parent.localCount]; // Put fields into initial states
reset();
}
....
/**
* Returns the input subsequence matched by the previous match.
*
* <p> For a matcher <i>m</i> with input sequence <i>s</i>,
* the expressions <i>m.</i><tt>group()</tt> and
* <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(),</tt> <i>m.</i><tt>end())</tt>
* are equivalent. </p>
*
* <p> Note that some patterns, for example <tt>a*</tt>, match the empty
* string. This method will return the empty string when the pattern
* successfully matches the empty string in the input. </p>
*
* @return The (possibly empty) subsequence matched by the previous match,
* in string form
*
* @throws IllegalStateException
* If no match has yet been attempted,
* or if the previous match operation failed
*/
public String group() {
return group(0);
}
/**
* Returns the input subsequence captured by the given group during the
* previous match operation.
*
* <p> For a matcher <i>m</i>, input sequence <i>s</i>, and group index
* <i>g</i>, the expressions <i>m.</i><tt>group(</tt><i>g</i><tt>)</tt> and
* <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(</tt><i>g</i><tt>),</tt> <i>m.</i><tt>end(</tt><i>g</i><tt>))</tt>
* are equivalent. </p>
*
* <p> <a href="Pattern.html#cg">Capturing groups</a> are indexed from left
* to right, starting at one. Group zero denotes the entire pattern, so
* the expression <tt>m.group(0)</tt> is equivalent to <tt>m.group()</tt>.
* </p>
*
* <p> If the match was successful but the group specified failed to match
* any part of the input sequence, then <tt>null</tt> is returned. Note
* that some groups, for example <tt>(a*)</tt>, match the empty string.
* This method will return the empty string when such a group successfully
* matches the empty string in the input. </p>
*
* @param group
* The index of a capturing group in this matcher's pattern
*
* @return The (possibly empty) subsequence captured by the group
* during the previous match, or <tt>null</tt> if the group
* failed to match part of the input
*
* @throws IllegalStateException
* If no match has yet been attempted,
* or if the previous match operation failed
*
* @throws IndexOutOfBoundsException
* If there is no capturing group in the pattern
* with the given index
*/
public String group(int group) {
if (first < 0)
throw new IllegalStateException("No match found");
if (group < 0 || group > groupCount())
throw new IndexOutOfBoundsException("No group " + group);
if ((groups[group*2] == -1) || (groups[group*2+1] == -1))
return null;
return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
}
/**
* Attempts to find the next subsequence of the input sequence that matches
* the pattern.
*
* <p> This method starts at the beginning of this matcher's region, or, if
* a previous invocation of the method was successful and the matcher has
* not since been reset, at the first character not matched by the previous
* match.
*
* <p> If the match succeeds then more information can be obtained via the
* <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods. </p>
*
* @return <tt>true</tt> if, and only if, a subsequence of the input
* sequence matches this matcher's pattern
*/
public boolean find() {
int nextSearchIndex = last;
if (nextSearchIndex == first)
nextSearchIndex++; // If next search starts before region, start it at region
if (nextSearchIndex < from)
nextSearchIndex = from; // If next search starts beyond region then it fails
if (nextSearchIndex > to) {
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
return false;
}
return search(nextSearchIndex);
} /**
* Initiates a search to find a Pattern within the given bounds.
* The groups are filled with default values and the match of the root
* of the state machine is called. The state machine will hold the state
* of the match as it proceeds in this matcher.
*
* Matcher.from is not set here, because it is the "hard" boundary
* of the start of the search which anchors will set to. The from param
* is the "soft" boundary of the start of the search, meaning that the
* regex tries to match at that index but ^ won't match there. Subsequent
* calls to the search methods start at a new "soft" boundary which is
* the end of the previous match.
*/
boolean search(int from) {
this.hitEnd = false;
this.requireEnd = false;
from = from < 0 ? 0 : from;
this.first = from;
this.oldLast = oldLast < 0 ? from : oldLast;
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
acceptMode = NOANCHOR;
boolean result = parentPattern.root.match(this, from, text);
if (!result)
this.first = -1;
this.oldLast = this.last;
return result;
}
...
}

  原因是这样的:这里如果不先调用find方法,直接调用group,可以发现group方法调用group(int group),该方法的方法体中有if first<0,显然这里这个条件是成立的,因为first的初始值就是-1,所以这里会抛异常。但是如果调用find方法,可以发现,最终会调用search(nextSearchIndex),注意这里的nextSearchIndex已被last赋值,而last的值为0,再跳转到search方法中

boolean search(int from) {
this.hitEnd = false;
this.requireEnd = false;
from = from < 0 ? 0 : from;
this.first = from;
this.oldLast = oldLast < 0 ? from : oldLast;
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
acceptMode = NOANCHOR;
boolean result = parentPattern.root.match(this, from, text);
if (!result)
this.first = -1;
this.oldLast = this.last;
return result;
}

  这个nextSearchIndex传给了from,而from在方法体中被赋值给了first,所以,调用了find方法之后,这个的first就不在是-1,也就不是抛异常了。

  源码已经上传至Github:https://github.com/DMinerJackie/JewelCrawler

  以上说的问题比较碎,都是在遇到问题和解决问题的时候的一些总结。在具体操作的时候还会遇到其他问题,有问题或者建议的话欢迎提出来^^。

  最后放几张截止目前爬取的数据

  Record表

  其中存储的是79032条,爬取过的网页有48471条

  movie表

  目前爬取了2964部影视作品

  comments表

  爬取了29711条记录

  如果您觉得阅读本文对您有帮助,请点一下“推荐”按钮,您的“推荐”将是我最大的写作动力!如果您想持续关注我的文章,请扫描二维码,关注JackieZheng的微信公众号,我会将我的文章推送给您,并和您一起分享我日常阅读过的优质文章。


 

Java豆瓣电影爬虫——小爬虫成长记(附源码)的更多相关文章

  1. Java之函数式接口@FunctionalInterface详解(附源码)

    Java之函数式接口@FunctionalInterface详解 函数式接口的定义 在java8中,满足下面任意一个条件的接口都是函数式接口: 1.被@FunctionalInterface注释的接口 ...

  2. 史林枫:开源HtmlAgilityPack公共小类库封装 - 网页采集(爬虫)辅助解析利器【附源码+可视化工具推荐】

    做开发的,可能都做过信息采集相关的程序,史林枫也经常做一些数据采集或某些网站的业务办理自动化操作软件. 获取目标网页的信息很简单,使用网络编程,利用HttpWebResponse.HttpWebReq ...

  3. Java平台调用Python平台已有算法(附源码及解析)

    1. 问题描述 Java平台要调用Pyhon平台已有的算法,为了减少耦合度,采用Pyhon平台提供Restful 接口,Java平台负责来调用,采用Http+Json格式交互. 2. 解决方案 2.1 ...

  4. 教你用纯Java实现一个网页版的Xshell(附源码)

    前言 最近由于项目需求,项目中需要实现一个WebSSH连接终端的功能,由于自己第一次做这类型功能,所以首先上了GitHub找了找有没有现成的轮子可以拿来直接用,当时看到了很多这方面的项目,例如:Gat ...

  5. Android 音视频深入 八 小视频录制(附源码下载)

    本篇项目地址,求starthttps://github.com/979451341/Audio-and-video-learning-materials/tree/master/%E5%B0%8F%E ...

  6. java打字游戏-一款快速提升java程序员打字速度的游戏(附源码)

    一.效果如图: 源码地址:https://gitee.com/hoosson/TYPER 纯干货,别忘了留个赞哦!

  7. 一款Java开源的Springboot即时通讯 IM,附源码

    # 开篇 电商平台最不能缺的就是即时通讯,例如通知类下发,客服聊天等.今天,就来给大家分享一个开源的即时通讯系统.如对文章不感兴趣可直接跳至文章末尾,有获取源码链接的方法. 但文章内容是需要你简单的过 ...

  8. Java豆瓣电影爬虫——抓取电影详情和电影短评数据

    一直想做个这样的爬虫:定制自己的种子,爬取想要的数据,做点力所能及的小分析.正好,这段时间宝宝出生,一边陪宝宝和宝妈,一边把自己做的这个豆瓣电影爬虫的数据采集部分跑起来.现在做一个概要的介绍和演示. ...

  9. PYTHON爬虫实战_垃圾佬闲鱼爬虫转转爬虫数据整合自用二手急速响应捡垃圾平台_3(附源码持续更新)

    说明 文章首发于HURUWO的博客小站,本平台做同步备份发布. 如有浏览或访问异常图片加载失败或者相关疑问可前往原博客下评论浏览. 原文链接 PYTHON爬虫实战_垃圾佬闲鱼爬虫转转爬虫数据整合自用二 ...

  10. 微信小程序——智能小秘“遥知之”源码分享(语义理解基于olami)

    微信小程序智能生活小秘书开发详解 >>>>>>>>>>>>>>>>>>>>> ...

随机推荐

  1. Windows内核遍历驱动模块源码分析

    要获取windows 内核中所有驱动模块信息,调用 系统服务函数 NtQuerySystemInformation,参数SystemInformationClass 传入SystemModuleInf ...

  2. noj [1480] 懒惰的风纪委Elaine (多重背包)

    http://ac.nbutoj.com/Problem/view.xhtml?id=1480 [1480] 懒惰的风纪委Elaine 时间限制: 1000 ms 内存限制: 65535 K 问题描述 ...

  3. c++构造析构顺序

    class A { public: A(){ cout << "constrcut A" << endl; }; ~A(){ cout << & ...

  4. strictmode

    最新的Android平台中(Android 2.3起),新增加了一个新的类,叫StrictMode(android.os.StrictMode).这个类可以用来帮助开发者改进他们编写的应用,并且提供了 ...

  5. 来一波CSS兼容问题小总结吧

    1.DOCTYPE 影响 CSS 处理; 2.火狐 谷歌等浏览器 设置 padding 后, div 会增加 height 和 width, 但 IE 不会, 故需要用 !important 多设一个 ...

  6. 存图方式---邻接表&amp;邻接矩阵&amp;前向星

    基于vector存图 struct Edge { int u, v, w; Edge(){} Edge(int u, int v, int w):u(u), v(v), w(w){} }; vecto ...

  7. LeetCode算法题-Sum of Two Integers(Java实现)

    这是悦乐书的第210次更新,第222篇原创 01 看题和准备 今天介绍的是LeetCode算法题中Easy级别的第78题(顺位题号是371).计算两个整数a和b的总和,但不允许使用运算符+和 - .例 ...

  8. 爬虫之xpath用法

    导包用: from lxml import etree

  9. AGC027 C - ABland Yard 拓扑排序

    目录 题目链接 题解 代码 题目链接 AGC027 C - ABland Yard 题解 发现有解的充要条件是有一个形为AABBAABBAABB的环 此时每一个点至少与两个不同颜色的点相连 对于初始不 ...

  10. iOS-如何让xcode自动检查内存泄露

    在project-setting中找到 “Run Static Analyzer” 键,然后把值修改为“YES”.这样在编码的时候,xcode就可以自动为我们检查内存泄露了.