- UID
- 69762
- 帖子
- 5939
- 积分
- 46000
- 阅读权限
- 90
- 注册时间
- 2007-9-20
- 最后登录
- 2015-6-20
- 在线时间
- 9972 小时
|
wiki 列大把啊
Open-source crawlers
Aspseek is a crawler, indexer and a search engine written in C++ and licensed under the GPL
DataparkSearch is a crawler and search engine released under the GNU General Public License.
GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
GRUB is an open source distributed search crawler that Wikia Search used to crawl the web.
Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
ht://Dig includes a Web crawler in its indexing engine.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on Web-site Parse Templates using computer's free CPU resources only.
mnoGoSearch is a crawler, indexer and a search engine written in C and licensed under the GPL (Linux machines only)
Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text-indexing package.
Open Search Server is a search engine and web crawler software release under the GPL.
Pavuk is a command-line Web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, e.g., regular expression based filtering and file creation rules.
PHP-Crawler is a simple PHP and MySQL based crawler released under the BSD. Easy to install it became popular for small MySQL-driven websites on shared hosting.
the tkWWW Robot, a crawler based on the tkWWW web browser (licensed under GPL).
YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
Seeks, a free distributed search engine (licensed under Affero General Public License).
不过这个年代还用 sql 处理 crawler 数据是不是太落伍了点 |
|
|
I went to the woods because I wished to live deliberately, to front only the essential facts of life, and see if I could not learn what it had to teach, and not, when I came to die, discover that I had not lived.
|