crawlee开源网页抓取和浏览器自动化库

科技 08-25 来源： HelloWeb3

简介

ℹ️Crawlee 是Apify SDK的继承者。用TypeScript完全重写，以获得更好的开发者体验，并具有更强大的抗阻塞功能。界面与 Apify SDK 几乎相同，因此升级轻而易举。阅读升级指南以了解更改。

Crawlee 涵盖了端到端的爬行和抓取，并帮助您构建可靠的抓取工具。快速地。

即使使用默认配置，您的爬虫也会像人类一样在现代机器人保护的雷达下飞行。Crawlee 为您提供了在 Web 上抓取链接、抓取数据并将其存储到磁盘或云中的工具，同时保持可配置以满足您的项目需求。

Crawlee 以crawleeNPM 包的形式提供。

安装

我们建议您访问Crawlee 文档中的介绍教程以获取更多信息。

Crawlee 需要Node.js 16 或更高版本。

使用 Crawlee CLI

试用 Crawlee 的最快方法是使用Crawlee CLI并选择Getting started example。CLI 将安装所有必要的依赖项并添加样板代码供您使用。

npx crawlee create my-crawler
cd my-crawler npm start

手动安装

如果您更喜欢将 Crawlee 添加到您自己的项目中，请尝试以下示例。因为它使用PlaywrightCrawler我们还需要安装Playwright。它没有与 Crawlee 捆绑以减少安装大小。

npm install crawlee playwright

import { PlaywrightCrawler, Dataset } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
    // Uncomment this option to see the browser window.
    // headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

默认情况下，Crawlee 将数据存储到./storage当前工作目录中。您可以通过 Crawlee 配置覆盖此目录。详见配置指南、请求存储和结果存储。