Scrapoxy β
What is Scrapoxy? β
Scrapoxy is a super proxies manager that orchestrates all your proxies into one place π―, rather than spreading management across multiple scrapers πΈοΈ.
Deployed on your own infrastructure, Scrapoxy serves as a single proxy endpoint for your scrapers.
It builds a pool of private proxies from your datacenter subscription π, integrates them with proxy vendors π, manages IP rotation and fingerprinting, and smartly routes traffic to avoid bans π«.
What is not Scrapoxy? β
Scrapoxy is not:
- a proxies list manager like ProxyBroker2;
- a webscraper like Scrapy, Crawlee or Octoparse;
- a cloud provider like AWS, GCP or Azure;
- a browser farm like Puppeteer, Selenium or Playwright;
- a proxy service like Rayobyte, IP Royal or Zyte.
What you can do with Scrapoxy? β
- Integrate it into your web scraping stack to manage proxies, whether as an individual or a company;
- Contribute to the project by submitting issues or pull requests;
- Distribute the code under the AGPLv3 license, ensuring the ownerβs name remains intact.
What you cannot do with Scrapoxy? β
- Use it for any activities that are illegal under your jurisdiction;
- Modify or redistribute the source code under a license other than AGPLv3;
- Sell Scrapoxy, whether as a standalone service or incorporated into another product.
Why Scrapoxy? β
I started developing the Scrapoxy project in 2015.
At that time, I was working with Scrapy and encountering issues with my scrapers getting banned π. There were also few low-cost solutions for obtaining IP addresses. Additionally, manually installing proxies was too time-consuming and tedious π.
A solution was needed to automate these tasks π€.
Scrapoxy initially focused on managing the AWS provider. Users could start and stop instances and get a new IP address each time.
However, an essential element was missing: the routing π.
I integrated this part so that Scrapoxy became the only entry point for scrapers in a proxies infrastructure. This allowed it to autonomously distribute traffic and handle proxy rotation when a ban was detected π¨.
My goal was to make proxy management accessible to everyone, so I open-sourced the project under the AGPLv3 license. Several users requested the addition of new providers, and the project grew π±.
Now, Scrapoxy smartly manages both datacenter providers and proxy services. It intercepts and modifies requests to ensure consistency in your scraping stack, which is crucial when facing ban issues π¨.
Staying consistent in your scraping stack is the primary focus, and Scrapoxy helps you achieve that π―.
Features β
βοΈ Datacenter Providers with easy installation βοΈ β
Scrapoxy supports many datacenter providers like AWS, Azure, or GCP.
It installs a proxy image on each datacenter, helping the quick launch of a proxy instance. Traffic is routed to proxy instances to provide many IP addresses.
Scrapoxy handles the startup/shutdown of proxy instances to rotate IP addresses effectively.
π Proxy Services π β
Scrapoxy supports many proxy services like Rayobyte, IPRoyal or Zyte.
It connects to these services and uses a variety of parameters such as country or OS type, to create a diversity of proxies.
π» Hardware materials π» β
Scrapoxy supports many 4G proxy farms hardware types like Proxidize.
It uses their APIs to handle IP rotation on 4G networks.
π Free Proxy Lists π β
Scrapoxy supports lists of HTTP/HTTPS proxies and SOCKS4/SOCKS5 proxies.
It takes care of testing their connectivity to aggregate them into the proxy pool.
β° Timeout free β° β
Scrapoxy only routes traffic to online proxies.
This feature is useful with residential proxies. Sometimes, proxies may be too slow or inactive. Scrapoxy detects these offline nodes and excludes them from the proxy pool.
π Auto-Rotate proxies π β
Scrapoxy automatically changes IP addresses at regular intervals.
Scrapers can have thousands of IP addresses without managing proxy rotation.
π Auto-Scale proxies π β
Scrapoxy monitors incoming traffic and automatically scales the number of proxies according to your needs.
It also reduces proxy count to minimize your costs.
πͺ Sticky sessions on Browser πͺ β
Scrapoxy can keep the same IP address for a scraping session, even for browsers.
It includes HTTP requests/responses interception mechanism to inject a session cookie, ensuring continuity of the IP address throughout the browser session.
π¨ Ban management π¨ β
Scrapoxy injects the name of the proxy into the HTTP responses.
When a scraper detects that a ban has occurred, it can notify Scrapoxy to remove the proxy from the pool.
π‘ Traffic interception π‘ β
Scrapoxy intercepts HTTP requests/responses to modify headers, keeping consistency in your scraping stack. It can add session cookies or specific headers like user-agent.
π Traffic monitoring π β
Scrapoxy measures incoming and outgoing traffic to provide an overview of your scraping session.
It tracks metrics such as the number of requests, active proxy count, requests per proxy, and more.
π Coverage monitoring π β
Scrapoxy displays the geographic coverage of your proxies to better understand the global distribution of your proxies.
π Easy-to-use and production-ready π β
Scrapoxy is suitable for both beginners and experts.
It can be started in seconds using Docker, or be deployed in a complex, distributed environment with Kubernetes.
π Free and Open Source π β
Scrapoxy is free and open source, under the AGPLv3 license.
All contributions must remain under this license.