查看: 463|回复: 8

有人懂编程吗？网络爬虫（采集器）怎么写？

228 主题	0 好友	4929 积分

飞龙

发消息

电梯直达

1楼

发表于 2012-5-22 06:47 |只看该作者 |正序浏览

本帖最后由 emucxg 于 2012-5-22 06:47 编辑

最近想做个网站，一开始想赚点pr

大概思路是，先采集文章，然后分别把标题和内容添加到sql数据库里

每隔一段时间更新一次

以前没弄过采集器，不太懂啊

1 查看全部评分

潜规则

淘帖0 收藏0

使用道具举报

新建文件夹

1656 主题	0 好友	17万积分

黑暗执政官

发消息

9楼

发表于 2012-5-22 17:26 |只看该作者

本帖最后由新建文件夹于 2012-5-23 11:00 编辑

我本科的时候就做过这个玩意。
网页先下载到本地，然后分析里面的链接，放入数据库列表中。
继续按列表中的链接下载。

待会把她带到我房间

使用道具举报

anomaly

210 主题	0 好友	4万积分

光明执政官

发消息

8楼

发表于 2012-5-22 15:22 |只看该作者

wiki 列大把啊
Open-source crawlers
Aspseek is a crawler, indexer and a search engine written in C++ and licensed under the GPL
DataparkSearch is a crawler and search engine released under the GNU General Public License.
GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
GRUB is an open source distributed search crawler that Wikia Search used to crawl the web.
Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
ht://Dig includes a Web crawler in its indexing engine.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on Web-site Parse Templates using computer's free CPU resources only.
mnoGoSearch is a crawler, indexer and a search engine written in C and licensed under the GPL (Linux machines only)
Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text-indexing package.
Open Search Server is a search engine and web crawler software release under the GPL.
Pavuk is a command-line Web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, e.g., regular expression based filtering and file creation rules.
PHP-Crawler is a simple PHP and MySQL based crawler released under the BSD. Easy to install it became popular for small MySQL-driven websites on shared hosting.
the tkWWW Robot, a crawler based on the tkWWW web browser (licensed under GPL).
YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
Seeks, a free distributed search engine (licensed under Affero General Public License).

不过这个年代还用 sql 处理 crawler 数据是不是太落伍了点

I went to the woods because I wished to live deliberately, to front only the essential facts of life, and see if I could not learn what it had to teach, and not, when I came to die, discover that I had not lived.

使用道具举报