功能描述

Discover and scrape public Facebook pages and groups by location and category with browser simulation and export data in JSON or CSV formats.

使用说明 (SKILL.md)

Facebook Page & Group Scraper

Name: Facebook Scraper
Author: arulmozhiv

Part of ScrapeClaw — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

A browser-based Facebook page and group discovery and scraping tool.

---
name: facebook-scraper
description: Discover and scrape Facebook pages and public groups from your browser.
emoji: 📘
version: 1.0.0
author: influenza
tags:
  - facebook
  - scraping
  - social-media
  - page-discovery
  - group-discovery
  - business-pages
metadata:
  clawdbot:
    requires:
      bins:
        - python3
        - chromium

    config:
      stateDirs:
        - data/output
        - data/queue
        - thumbnails
      outputFormats:
        - json
        - csv
---

Overview

This skill provides a two-phase Facebook scraping system:

Page/Group Discovery
Browser Scraping

Features

🔍 - Discover Facebook pages and groups by location and category
🌐 - Full browser simulation for accurate scraping
🛡️ - Browser fingerprinting, human behavior simulation, and stealth scripts
📊 - Page/group info, stats, images, and engagement data
💾 - JSON/CSV export with downloaded thumbnails
🔄 - Resume interrupted scraping sessions
⚡ - Auto-skip private groups, low-like pages, empty profiles
📂 - Supports pages, groups, and public profiles via --type flag

Getting Google API Credentials (Optional)

Go to Google Cloud Console
Create a new project or select existing
Enable "Custom Search API"
Create API credentials → API Key
Go to Programmable Search Engine
Create a search engine with facebook.com as the site to search
Copy the Search Engine ID

Usage

Agent Tool Interface

For OpenClaw agent integration, the skill provides JSON output:

# Discover Facebook pages (returns JSON)
discover --location "Miami" --category "restaurant" --type page --output json

# Discover Facebook groups (returns JSON)
discover --location "New York" --category "fitness" --type group --output json

# Scrape single page (returns JSON)
scrape --page-name examplebusiness --output json

# Scrape single group (returns JSON)
scrape --page-name examplegroup --type group --output json

Output Data

Page/Group Data Structure

{
  "page_name": "example_business",
  "display_name": "Example Business",
  "entity_type": "page",
  "category": "Restaurant",
  "subcategory": "Italian Restaurant",
  "about": "Family-owned Italian restaurant since 1985",
  "followers": 45000,
  "page_likes": 42000,
  "location": "Miami, FL",
  "address": "123 Main St, Miami, FL 33101",
  "phone": "+1-555-0123",
  "email": "[email protected]",
  "website": "https://example.com",
  "hours": "Mon-Sat 11AM-10PM",
  "is_verified": false,
  "page_tier": "mid",
  "profile_pic_local": "thumbnails/example_business/profile_abc123.jpg",
  "cover_photo_local": "thumbnails/example_business/cover_def456.jpg",
  "recent_posts": [
    {"post_url": "https://facebook.com/example_business/posts/123", "reactions": 320, "comments": 45, "shares": 12}
  ],
  "scrape_timestamp": "2026-02-20T14:30:00"
}

Group Data Structure

{
  "page_name": "example_group",
  "display_name": "Miami Fitness Community",
  "entity_type": "group",
  "about": "A community for fitness enthusiasts in Miami",
  "members": 15000,
  "privacy": "Public",
  "posts_per_day": 25,
  "location": "Miami",
  "page_tier": "mid",
  "profile_pic_local": "thumbnails/example_group/profile_abc123.jpg",
  "cover_photo_local": "thumbnails/example_group/cover_def456.jpg",
  "scrape_timestamp": "2026-02-20T14:30:00"
}

Page Tiers

Tier	Likes/Members Range
nano	\x3C 1,000
micro	1,000 - 10,000
mid	10,000 - 100,000
macro	100,000 - 1M
mega	> 1,000,000

File Outputs

Queue files: data/queue/{location}_{category}_{type}_{timestamp}.json
Scraped data: data/output/{page_name}.json
Thumbnails: thumbnails/{page_name}/profile_*.jpg, thumbnails/{page_name}/cover_*.jpg
Export files: data/export_{timestamp}.json, data/export_{timestamp}.csv

Configuration

Edit config/scraper_config.json:

{
  "google_search": {
    "enabled": true,
    "api_key": "",
    "search_engine_id": "",
    "queries_per_location": 3
  },
  "scraper": {
    "headless": false,
    "min_likes": 1000,
    "download_thumbnails": true,
    "max_thumbnails": 6
  },
  "cities": ["New York", "Los Angeles", "Miami", "Chicago"],
  "categories": ["restaurant", "retail", "fitness", "real-estate", "healthcare", "beauty"]
}

Filters Applied

The scraper automatically filters out:

❌ Private groups
❌ Pages with \x3C 1,000 likes (configurable)
❌ Deactivated or removed pages
❌ Non-existent pages/groups
❌ Already scraped entries (deduplication)

Troubleshooting

Login Issues

Ensure credentials are correct
Handle verification codes when prompted
Wait if rate limited (the script will auto-retry)

No Pages Discovered

Check Google API key and quota
Verify Search Engine ID is configured for facebook.com
Try different location/category combinations

Rate Limiting

Reduce scraping speed (increase delays)
Use multiple Facebook accounts
Run during off-peak hours
Use a residential proxy (see below)

🌐 Residential Proxy Support

Why Use a Residential Proxy?

Running a scraper at scale without a residential proxy will get your IP blocked fast. Here's why proxies are essential for long-running scrapes:

Advantage	Description
Avoid IP Bans	Residential IPs look like real household users, not data-center bots. Facebook is far less likely to flag them.
Automatic IP Rotation	Each request (or session) gets a fresh IP, so rate-limits never stack up on one address.
Geo-Targeting	Route traffic through a specific country/city so scraped content matches the target audience's locale.
Sticky Sessions	Keep the same IP for a configurable window (e.g. 10 min) — critical for maintaining a Facebook login session.
Higher Success Rate	Rotating residential IPs deliver 95%+ success rates compared to ~30% with data-center proxies on Facebook.
Long-Running Scrapes	Scrape thousands of pages/groups over hours or days without interruption.
Concurrent Scraping	Run multiple browser instances across different IPs simultaneously.

Recommended Proxy Providers

We have affiliate partnerships with top residential proxy providers. Using these links supports continued development of this skill:

Provider	Best For	Sign Up
Bright Data	World's largest residential network, 72M+ IPs, enterprise-grade	👉 Sign Up for Bright Data
IProyal	Premium residential pool, pay-as-you-go, 195+ countries	👉 Sign Up for IProyal
Storm Proxies	Fast & reliable residential IPs, developer-friendly API	👉 Sign Up for Storm Proxies
NetNut	ISP-grade residential network, 52M+ IPs, direct connectivity	👉 Sign Up for NetNut

Setup Steps

1. Get Your Proxy Credentials

Sign up with any provider above, then grab:

Username (from your provider dashboard)
Password (from your provider dashboard)
Host and Port are pre-configured per provider (or use custom)

2. Configure Entirely via Environment Variables

export PROXY_ENABLED=true
export PROXY_PROVIDER=netnut       # brightdata | iproyal | stormproxies | netnut | custom
export PROXY_USERNAME=your_user
export PROXY_PASSWORD=your_pass
export PROXY_COUNTRY=us            # optional: two-letter country code
export PROXY_STICKY=true           # optional: keep same IP per session

3. Provider-Specific Host/Port Defaults

These are auto-configured when you set the provider name:

Provider	Host	Port
Bright Data	`brd.superproxy.io`	`22225`
IProyal	`proxy.iproyal.com`	`12321`
Storm Proxies	`rotating.stormproxies.com`	`9999`
NetNut	`gw-resi.netnut.io`	`5959`

Override with "host" and "port" in config or PROXY_HOST / PROXY_PORT env vars if your plan uses a different gateway.

4. Custom Proxy Provider

For any other proxy service, set provider to custom and supply host/port manually:

{
  "proxy": {
    "enabled": true,
    "provider": "custom",
    "host": "your.proxy.host",
    "port": 8080,
    "username": "user",
    "password": "pass"
  }
}

Running the Scraper with Proxy

Once configured, the scraper picks up the proxy automatically — no extra flags needed:

# Discover and scrape as usual — proxy is applied automatically
python main.py discover --location "Miami" --category "restaurant" --type page
python main.py scrape --page-name examplebusiness

# The log will confirm proxy is active:
# INFO - Proxy enabled: \x3CProxyManager provider=netnut enabled host=gw-resi.netnut.io:5959>
# INFO - Browser using proxy: netnut → gw-resi.netnut.io:5959

Using the Proxy Manager Programmatically

from proxy_manager import ProxyManager

# From config (auto-reads config/scraper_config.json)
pm = ProxyManager.from_config()

# From environment variables
pm = ProxyManager.from_env()

# Manual construction
pm = ProxyManager(
    provider="netnut",
    username="your_user",
    password="your_pass",
    country="us",
    sticky=True
)

# For Playwright browser context
proxy = pm.get_playwright_proxy()
# → {"server": "http://gw-resi.netnut.io:5959", "username": "user-country-us-session-abc123", "password": "pass"}

# For requests / aiohttp
proxies = pm.get_requests_proxy()
# → {"http": "http://user:pass@host:port", "https": "http://user:pass@host:port"}

# Force new IP (rotates session ID)
pm.rotate_session()

# Debug info
print(pm.info())

Best Practices for Long-Running Scrapes

Always use sticky sessions — Facebook requires consistent IPs during a login session. Set "sticky": true.
Target the right country — Set "country": "us" (or your target region) so Facebook serves content in the expected locale.
Combine with existing anti-detection — This scraper already has fingerprinting, stealth scripts, and human behavior simulation. The proxy is the final layer.
Rotate sessions between accounts — Call pm.rotate_session() when switching Facebook accounts to get a fresh IP.
Use delays — Even with proxies, respect delay_between_profiles in config (default 5-10s) to avoid aggressive patterns.
Monitor your proxy dashboard — All providers (Bright Data, IProyal, Storm Proxies, NetNut) have dashboards showing bandwidth usage and success rates.

安全使用建议

This skill describes browser-based scraping that will likely require installing Python/Playwright/Chromium, providing Facebook account credentials (and possibly multiple accounts), Google API keys (optional), and proxy credentials — none of which are declared in the registry. Before installing or using it, ask the publisher for: (1) source code or a reproducible install script; (2) an explicit list of environment variables and how credentials are supplied/stored; (3) a reliable install spec that pins packages and explains where files are written; and (4) an explanation of how authentication/verification flows are handled safely. Treat the skill as potentially privacy-invasive: only run it in an isolated environment (VM/container) and avoid providing real personal or high-privilege credentials until you can audit the code. Also consider legal/terms-of-service risks: automated scraping of Facebook and using multiple accounts or residential proxies can violate Facebook’s terms and local law.

功能分析

Type: OpenClaw Skill Name: facebook-scraper Version: 0.1.2 The skill is classified as suspicious due to its design requiring the handling of highly sensitive credentials, specifically Facebook login credentials, Google API keys, and proxy authentication details, as outlined in SKILL.md. While these capabilities are plausibly needed for an advanced web scraper, they introduce significant security risks if the underlying code is compromised or malicious. The `SKILL.md` itself does not contain explicit instructions for prompt injection or other malicious actions by the agent, but the inherent risk associated with managing such sensitive data warrants a 'suspicious' classification.

能力评估

⚠ Purpose & Capability

The SKILL.md declares runtime requirements (inside its front-matter) of python3 and chromium and describes Playwright-style browser scraping, fingerprinting, and downloading thumbnails. The registry metadata lists no required binaries, no env vars, and no config paths — a clear mismatch. The declared capabilities (browser scraping, proxies, credentialed logins) would legitimately require those binaries and configuration, so the absence in the registry is an incoherence.

⚠ Instruction Scope

The runtime instructions direct the agent to discover and scrape Facebook pages/groups, download thumbnails, persist queue/output files in data/queue and data/output, and handle Facebook login flows and verification codes. They also recommend using multiple Facebook accounts and residential proxies. The SKILL.md therefore expects handling of sensitive credentials and persistent local storage. The skill does not limit or document how credentials are provided or protected, and the registry did not declare those inputs.

⚠ Install Mechanism

There is no install spec in the registry (instruction-only). However, the text implies non-trivial dependencies (Python, Playwright, Chromium, stealth scripts) and filesystem layout. The absence of an install mechanism means there is no authoritative, auditable way the agent will obtain or verify those dependencies — increasing risk and operational ambiguity.

⚠ Credentials

Although the skill's operation logically requires Facebook account credentials (for login flows), optional Google API key/Search Engine ID, and possibly proxy credentials, the registry declares no required environment variables or primary credential. This omission is disproportionate: the skill asks users to supply multiple sensitive secrets (accounts, API keys, proxy auth) in prose but does not declare them or explain storage/usage, which is a security and privacy risk.

✓ Persistence & Privilege

The skill does not request always:true and is user-invocable (defaults). It will create local files (data/queue, data/output, thumbnails) according to SKILL.md; that is expected for a scraper and is appropriately scoped. There is no indication it modifies other skills or system-wide agent settings.

版本历史

v0.1.2

facebook-scraper 0.1.2 - Documentation updated in SKILL.md; no functional or feature changes in this version. - Existing usage instructions, proxy setup, configuration, and output details remain unchanged.

v0.1.1

- Added prominent section about ScrapeClaw suite and clarified integration with other social media scrapers. - Introduced detailed documentation for residential proxy setup, including benefits, recommended providers, and setup steps. - Enhanced guidance on running long scrapes reliably and avoiding IP bans. - Updated troubleshooting and best practices to address rate limiting and login/session management. - No changes to scraper commands, output data formats, or configuration structure.

v0.1.0

Initial release of browser-based Facebook page and group scraper. - Discover and scrape Facebook pages, public groups, and profiles with browser automation. - Supports location and category-based discovery, human-like browsing, and stealth features. - Outputs comprehensive JSON/CSV data, including stats, images, and engagement. - Automatically skips private groups, low-like pages, and duplicates. - Configurable filters, Google Search API integration, and thumbnail downloads. - Resumes interrupted sessions and exports data to organized directories.

元数据

Slug facebook-scraper

版本 0.1.2

许可证 —

累计安装 2

当前安装数 2

历史版本数 3

常见问题

Facebook Scraper 是什么？

Discover and scrape public Facebook pages and groups by location and category with browser simulation and export data in JSON or CSV formats. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 1090 次。

如何安装 Facebook Scraper？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install facebook-scraper」即可一键安装，无需额外配置。

Facebook Scraper 是免费的吗？

是的，Facebook Scraper 完全免费（开源免费），可自由下载、安装和使用。

Facebook Scraper 支持哪些平台？

Facebook Scraper 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Facebook Scraper？

由 ArulmozhiV（@arulmozhiv）开发并维护，当前版本 v0.1.2。

Facebook Scraper