Description

Browser-based tool to discover Instagram profiles by location/category and scrape their public info, stats, images, and engagement with export options.

README (SKILL.md)

Instagram Profile Scraper

Name: Instagram Scraper
Author: arulmozhiv

A browser-based Instagram profile discovery and scraping tool.

Part of ScrapeClaw — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

---
name: instagram-scraper
description: Discover and scrape Instagram profiles from your browser.
emoji: 📸
version: 1.0.6
author: influenza
tags:
  - instagram
  - scraping
  - social-media
  - influencer-discovery
metadata:
  clawdbot:
    requires:
      bins:
        - python3
        - chromium

    config:
      stateDirs:
        - data/output
        - data/queue
        - thumbnails
      outputFormats:
        - json
        - csv
---

Overview

This skill provides a two-phase Instagram scraping system:

Profile Discovery
Browser Scraping

Features

🔍 - Discover Instagram profiles by location and category
🌐 - Full browser simulation for accurate scraping
🛡️ - Browser fingerprinting, human behavior simulation, and stealth scripts
📊 - Profile info, stats, images, and engagement data
💾 - JSON/CSV export with downloaded thumbnails
🔄 - Resume interrupted scraping sessions
⚡ - Auto-skip private accounts, low followers, empty profiles
🌍 - Built-in residential proxy support with 4 providers

Getting Google API Credentials (Optional)

Go to Google Cloud Console
Create a new project or select existing
Enable "Custom Search API"
Create API credentials → API Key
Go to Programmable Search Engine
Create a search engine with instagram.com as the site to search
Copy the Search Engine ID

Usage

Agent Tool Interface

For OpenClaw agent integration, the skill provides JSON output:

# Discover profiles (returns JSON)
discover --location "Miami" --category "fitness" --output json

# Scrape single profile (returns JSON)
scrape --username influencer123 --output json

Output Data

Profile Data Structure

{
  "username": "example_user",
  "full_name": "Example User",
  "bio": "Fashion blogger | NYC",
  "followers": 125000,
  "following": 1500,
  "posts_count": 450,
  "is_verified": false,
  "is_private": false,
  "influencer_tier": "mid",
  "category": "fashion",
  "location": "New York",
  "profile_pic_local": "thumbnails/example_user/profile_abc123.jpg",
  "content_thumbnails": [
    "thumbnails/example_user/content_1_def456.jpg",
    "thumbnails/example_user/content_2_ghi789.jpg"
  ],
  "post_engagement": [
    {"post_url": "https://instagram.com/p/ABC123/", "likes": 5420, "comments": 89}
  ],
  "scrape_timestamp": "2025-02-09T14:30:00"
}

Influencer Tiers

Tier	Follower Range
nano	\x3C 1,000
micro	1,000 - 10,000
mid	10,000 - 100,000
macro	100,000 - 1M
mega	> 1,000,000

File Outputs

Queue files: data/queue/{location}_{category}_{timestamp}.json
Scraped data: data/output/{username}.json
Thumbnails: thumbnails/{username}/profile_*.jpg, thumbnails/{username}/content_*.jpg
Export files: data/export_{timestamp}.json, data/export_{timestamp}.csv

Configuration

Edit config/scraper_config.json:

{
  "proxy": {
    "enabled": false,
    "provider": "brightdata",
    "country": "",
    "sticky": true,
    "sticky_ttl_minutes": 10
  },
  "google_search": {
    "enabled": true,
    "api_key": "",
    "search_engine_id": "",
    "queries_per_location": 3
  },
  "scraper": {
    "headless": false,
    "min_followers": 1000,
    "download_thumbnails": true,
    "max_thumbnails": 6
  },
  "cities": ["New York", "Los Angeles", "Miami", "Chicago"],
  "categories": ["fashion", "beauty", "fitness", "food", "travel", "tech"]
}

Filters Applied

The scraper automatically filters out:

❌ Private accounts
❌ Accounts with \x3C 1,000 followers (configurable)
❌ Accounts with no posts
❌ Non-existent/removed accounts
❌ Already scraped accounts (deduplication)

Troubleshooting

Login Issues

Ensure credentials are correct
Handle verification codes when prompted
Wait if rate limited (the script will auto-retry)

No Profiles Discovered

Check Google API key and quota
Verify Search Engine ID is configured for instagram.com
Try different location/category combinations

Rate Limiting

Reduce scraping speed (increase delays in config)
Run during off-peak hours
Use a residential proxy (see below)

🌐 Residential Proxy Support

Why Use a Residential Proxy?

Running a scraper at scale without a residential proxy will get your IP blocked fast. Here's why proxies are essential for long-running scrapes:

Advantage	Description
Avoid IP Bans	Residential IPs look like real household users, not data-center bots. Instagram is far less likely to flag them.
Automatic IP Rotation	Each request (or session) gets a fresh IP, so rate-limits never stack up on one address.
Geo-Targeting	Route traffic through a specific country/city so scraped content matches the target audience's locale.
Sticky Sessions	Keep the same IP for a configurable window (e.g. 10 min) — critical for maintaining a consistent browsing session.
Higher Success Rate	Rotating residential IPs deliver 95%+ success rates compared to ~30% with data-center proxies on Instagram.
Long-Running Scrapes	Scrape thousands of profiles over hours or days without interruption.
Concurrent Scraping	Run multiple browser instances across different IPs simultaneously.

Recommended Proxy Providers

We have affiliate partnerships with top residential proxy providers. Using these links supports continued development of this skill:

Provider	Best For	Sign Up
Bright Data	World's largest network, 72M+ IPs, enterprise-grade	👉 Get Bright Data
IProyal	Pay-as-you-go, 195+ countries, no traffic expiry	👉 Get IProyal
Storm Proxies	Fast & reliable, developer-friendly API, competitive pricing	👉 Get Storm Proxies
NetNut	ISP-grade network, 52M+ IPs, direct connectivity	👉 Get NetNut

Setup Steps

1. Get Your Proxy Credentials

Sign up with any provider above, then grab:

Username (from your provider dashboard)
Password (from your provider dashboard)
Host and Port are pre-configured per provider (or use custom)

2. Configure via Environment Variables

export PROXY_ENABLED=true
export PROXY_PROVIDER=brightdata    # brightdata | iproyal | stormproxies | netnut | custom
export PROXY_USERNAME=your_user
export PROXY_PASSWORD=your_pass
export PROXY_COUNTRY=us             # optional: two-letter country code
export PROXY_STICKY=true            # optional: keep same IP per session

3. Provider-Specific Host/Port Defaults

These are auto-configured when you set the provider name:

Provider	Host	Port
Bright Data	`brd.superproxy.io`	`22225`
IProyal	`proxy.iproyal.com`	`12321`
Storm Proxies	`rotating.stormproxies.com`	`9999`
NetNut	`gw-resi.netnut.io`	`5959`

Override with PROXY_HOST / PROXY_PORT env vars if your plan uses a different gateway.

4. Custom Proxy Provider

For any other proxy service, set provider to custom and supply host/port manually:

{
  "proxy": {
    "enabled": true,
    "provider": "custom",
    "host": "your.proxy.host",
    "port": 8080,
    "username": "user",
    "password": "pass"
  }
}

Running the Scraper with Proxy

Once configured, the scraper picks up the proxy automatically — no extra flags needed:

# Discover and scrape as usual — proxy is applied automatically
python main.py discover --location "Miami" --category "fitness"
python main.py scrape --username influencer123

# The log will confirm proxy is active:
# INFO - Proxy enabled: \x3CProxyManager provider=brightdata enabled host=brd.superproxy.io:22225>
# INFO - Browser using proxy: brightdata → brd.superproxy.io:22225

Using the Proxy Manager Programmatically

from proxy_manager import ProxyManager

# From config (auto-reads config/scraper_config.json)
pm = ProxyManager.from_config()

# From environment variables
pm = ProxyManager.from_env()

# Manual construction
pm = ProxyManager(
    provider="brightdata",
    username="your_user",
    password="your_pass",
    country="us",
    sticky=True
)

# For Playwright browser context
proxy = pm.get_playwright_proxy()
# → {"server": "http://brd.superproxy.io:22225", "username": "user-country-us-session-abc123", "password": "pass"}

# For requests / aiohttp
proxies = pm.get_requests_proxy()
# → {"http": "http://user:pass@host:port", "https": "http://user:pass@host:port"}

# Force new IP (rotates session ID)
pm.rotate_session()

# Debug info
print(pm.info())

Best Practices for Long-Running Scrapes

Use sticky sessions — Instagram requires consistent IPs during a browsing session. Set "sticky": true.
Target the right country — Set "country": "us" (or your target region) so Instagram serves content in the expected locale.
Combine with existing anti-detection — This scraper already has fingerprinting, stealth scripts, and human behavior simulation. The proxy is the final layer.
Rotate sessions between batches — Call pm.rotate_session() between large batches of profiles to get a fresh IP.
Use delays — Even with proxies, respect delay_between_profiles in config to avoid aggressive patterns.
Monitor your proxy dashboard — All providers have dashboards showing bandwidth usage and success rates.

Usage Guidance

This skill's description matches its scraping purpose, but its SKILL.md requires additional binaries, credentials, and proxy accounts that are not declared in the registry metadata. Before installing, ask the publisher for: (1) a clear list of required binaries/dependencies and an install spec (how to install Playwright, Python, Chromium), (2) a list of all environment variables or credentials the skill will read/write (Instagram login, Google API key, proxy username/password), and (3) the full source or a trustable homepage. Only provide Instagram credentials or proxy credentials if you trust the author — prefer running the skill in an isolated environment or VM because it writes scraped data and thumbnails to disk. Also consider legal/ToS risks of scraping Instagram and beware of affiliate links/recommended paid proxy providers. If the publisher cannot justify or document the missing requirements, treat the skill as untrustworthy.

Capability Analysis

Type: OpenClaw Skill Name: instagram-scraper Version: 1.0.7 The skill describes a powerful web scraping tool utilizing full browser automation (Playwright) and extensive network access, including residential proxy support. While the `SKILL.md` does not contain explicit prompt injection or direct evidence of malicious intent, it details capabilities that are inherently risky, such as handling sensitive Google API keys and proxy credentials (username/password) via environment variables. The potential for misuse or vulnerabilities in the underlying (unseen) code that implements these powerful features, even if plausibly needed for the stated purpose, warrants a 'suspicious' classification.

Capability Assessment

⚠ Purpose & Capability

The SKILL.md clearly expects a Python + Playwright/Chromium environment, state directories (data/output, thumbnails), and proxy support; however, the registry metadata lists no required binaries, env vars, or config paths. That mismatch means the declared requirements do not match what the skill actually needs to perform scraping.

⚠ Instruction Scope

Runtime instructions tell the agent to run a two-phase discovery/scrape pipeline, edit config/scraper_config.json, handle Instagram login/verification, use Google Custom Search optionally, download thumbnails to local paths, and operate residential proxies. These steps involve reading/writing local files and handling credentials (login codes, API keys, proxy credentials) even though those secrets are not declared in metadata.

ℹ Install Mechanism

There is no install spec (instruction-only), which reduces installer risk. However, SKILL.md expects external dependencies (python3, chromium, Playwright) and persistent state directories; the absence of an install step means the agent or user must provision these themselves — this is an operational gap that should be clarified.

⚠ Credentials

The skill does not declare any required environment variables or a primary credential, but the instructions implicitly require Instagram login credentials (and handling of verification codes), optional Google API key and search engine ID, and residential proxy provider credentials. Not declaring these sensitive needs is disproportionate and hides what secrets the skill will need or handle.

✓ Persistence & Privilege

The skill is not force-enabled (always:false) and does not request elevated system-wide privileges in the metadata. It does expect to create and use local state directories and files (data/, thumbnails/), which is reasonable for a scraper but should be made explicit.

Version History

v1.0.7

- Updated SKILL.md to bump the documented version from 1.0.3 to 1.0.6. - No code or feature changes; changelog reflects documentation version update only.

v1.0.6

instagram-scraper 1.0.6 - No file changes detected in this version. - Behavior, features, and documentation remain the same as the previous release.

v1.0.5

- Added official mention and integration with ScrapeClaw suite, highlighting broader social scraping capabilities. - Introduced a comprehensive "Residential Proxy Support" section, with setup instructions, provider recommendations, and environment variable usage for enhanced scraping reliability and scalability. - Expanded configuration documentation to include proxy settings and usage. - Updated troubleshooting advice and feature list to cover new proxy functionality and provide improved guidance. - No changes to core scraping features or interfaces.

v1.0.4

- Version bump to 1.0.4 with no file changes detected. - No new features, bug fixes, or documentation updates introduced in this release.

v1.0.3

- Renamed package to `instagram-scraper` and updated version to 1.0.3. - Updated YAML metadata: removed environment variable requirements for Google/Instagram credentials. - Removed installation and CLI usage instructions from documentation. - Streamlined documentation for agent tool interface and usage, focusing on JSON outputs. - Other minor content and formatting updates for clarity.

v1.0.2

- Major simplification: core implementation files and anti-detection logic removed; only documentation remains. - SKILL.md updated to reflect a reduced feature set and generic browser scraping. - References to Playwright-specific and anti-detection features trimmed or generalized. - Installation and usage instructions retain previous structure, but advanced configuration details are removed. - Many technical details, example configs, and troubleshooting information are either omitted or made optional.

v1.0.1

- Internal updates made to discovery.py with no visible user-facing changes. - Documentation and usage instructions remain unchanged.

v1.0.0

Initial release of Instagram Profile Scraper. - Discover and scrape Instagram profiles using Google Custom Search API and Playwright with anti-detection. - Includes full-featured profile discovery (by location/category), browser automation, and stealth scraping. - Exports comprehensive data (bio, stats, images, engagement, etc.) to local JSON/CSV files with thumbnails. - Features session checkpoint/resume, smart filtering (e.g., skip private/low-follower/empty accounts), and configuration options. - Designed for robust anti-detection: browser fingerprinting, human simulation, stealth scripts, and rate limit handling.

Metadata

Slug instagram-scraper

Version 1.0.7

License —

All-time Installs 12

Active Installs 11

Total Versions 8

Frequently Asked Questions

What is Instagram Scraper?

Browser-based tool to discover Instagram profiles by location/category and scrape their public info, stats, images, and engagement with export options. It is an AI Agent Skill for Claude Code / OpenClaw, with 3010 downloads so far.

How do I install Instagram Scraper?

Run "/install instagram-scraper" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Instagram Scraper free?

Yes, Instagram Scraper is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Instagram Scraper support?

Instagram Scraper is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Instagram Scraper?

It is built and maintained by ArulmozhiV (@arulmozhiv); the current version is v1.0.7.

More Skills

Instagram Scraper