History

Jeff Pereira b4292c1ea3 setting up docker to ts playwright service		2024-07-03 11:55:39 -07:00
..
helpers	new playwright service	2024-06-26 12:32:30 -07:00
Dockerfile	setting up docker to ts playwright service	2024-07-03 11:55:39 -07:00
README.md	Changed port and added "using with firecrawl" section on readme	2024-06-28 11:51:24 -03:00
api.ts	Changed port and added "using with firecrawl" section on readme	2024-06-28 11:51:24 -03:00
package.json	new playwright service	2024-06-26 12:32:30 -07:00
tsconfig.json	setting up docker to ts playwright service	2024-07-03 11:55:39 -07:00

README.md

Playwright Scrape API

This is a simple web scraping service built with Express and Playwright.

Features

Scrapes HTML content from specified URLs.
Blocks requests to known ad-serving domains.
Blocks media files to reduce bandwidth usage.
Uses random user-agent strings to avoid detection.
Strategy to ensure the page is fully rendered.

Install

npm install
npx playwright install

RUN

npm run build
npm start

npm run dev

USE

curl -X POST http://localhost:3000/scrape \
-H "Content-Type: application/json" \
-d '{
  "url": "https://example.com",
  "wait_after_load": 1000,
  "timeout": 15000,
  "headers": {
    "Custom-Header": "value"
  },
  "check_selector": "#content"
}'

USING WITH FIRECRAWL

Add PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3003/scrape to /apps/api/.env to configure the API to use this Playwright microservice for scraping operations.