pcdt-scraper Documentation

Introduction
Requirements
Installation
Getting Started
Core Features
API Reference
Examples
Troubleshooting

Introduction

GitHub commit activity GitHub last commit

pcdt-scraper is a Python web scraping library that combines the power of PyChromeDevTools with a Selenium-like syntax. It’s designed to handle websites that block traditional HTTP requests but allow Chrome browser requests. The library provides an intuitive interface for web scraping tasks while utilizing Chrome’s DevTools Protocol.

Requirements

Python 3.6 or higher
Chrome or Chromium browser
bs4 (BeautifulSoup4)
PyChromeDevTools

Installation

Install using pip:

pip install pcdt-scraper

Or using pip3:

pip3 install pcdt-scraper

Getting Started

1. Start Chrome/Chromium in Debug Mode

First, you need to run Chrome or Chromium with remote debugging enabled:

# Regular mode
chromium --remote-debugging-port=9222 --remote-allow-origins=*

# Or headless mode
chromium --remote-debugging-port=9222 --remote-allow-origins=* --headless

2. Basic Usage

from pcdt_scraper import WebScraper

# Initialize the scraper
scraper = WebScraper()

try:
    # Navigate to a webpage
    scraper.get("https://www.example.com")
    
    # Find elements and extract data
    element = scraper.find_element_by_class_name("my-class")
    text = element.text()
    
finally:
    # Always close the scraper
    scraper.close()

Core Features

WebScraper Class

The main class that handles all scraping operations. It provides:

Selenium-like syntax for easy transition
Chrome DevTools integration
BeautifulSoup parsing capabilities

ElementWrapper Class

Wraps web elements with convenient methods:

text(): Get element’s text content
get_attribute(attribute): Get specific attribute value
is_displayed(): Check if element exists
get_html(): Get element’s HTML content

Elements Class

Collection class for handling multiple elements:

Iterable interface
Length checking
Index-based access

API Reference

WebScraper Methods

scraper.get(url, timeout=60)  # Navigate to a webpage
scraper.close()  # Close the browser
scraper.quit()   # Alias for close()

Element Finding Methods

# Single element finders
scraper.find_element_by_id(id_)
scraper.find_element_by_class_name(class_name)
scraper.find_element_by_tag_name(tag_name)
scraper.find_element_by_name(name)
scraper.find_element_by_css_selector(css_selector)
scraper.find_element_by_xpath(xpath)  # Limited support

# Multiple elements finders
scraper.find_elements_by_class_name(class_name)
scraper.find_elements_by_tag_name(tag_name)
scraper.find_elements_by_name(name)
scraper.find_elements_by_css_selector(css_selector)

Page Content

scraper.get_page_source()      # Get page source (alias)
scraper.get_page_content()     # Get parsed page content

Examples

1. Basic Scraping

from pcdt_scraper import WebScraper

scraper = WebScraper()
try:
    scraper.get("https://www.example.com")
    title = scraper.find_element_by_tag_name("h1").text()
    print(f"Page title: {title}")
finally:
    scraper.close()

2. Working with Multiple Elements

from pcdt_scraper import WebScraper

scraper = WebScraper()
try:
    scraper.get("https://www.example.com")
    links = scraper.find_elements_by_tag_name("a")
    
    for link in links:
        href = link.get_attribute("href")
        text = link.text()
        print(f"Link: {text} -> {href}")
finally:
    scraper.close()

3. Using CSS Selectors

from pcdt_scraper import WebScraper

scraper = WebScraper()
try:
    scraper.get("https://www.example.com")
    elements = scraper.find_elements_by_css_selector(".content article")
    
    for element in elements:
        title = element.find_element_by_class_name("title").text()
        print(f"Article title: {title}")
finally:
    scraper.close()

Troubleshooting

Common Issues

ConnectionError
- Error: “Got ConnectionError, it seems your chrome remote instance is not running”
- Solution: Ensure Chrome/Chromium is running with remote debugging enabled
Page Load Timeout
- Error: “Page load timed out after X seconds”
- Solution: Increase the timeout parameter in the get() method
Element Not Found
- Solution:
  - Check if the element exists in the page source
  - Try different selector methods
  - Ensure the page has fully loaded

Best Practices

Always use try-finally blocks to ensure proper cleanup
Close the scraper after use
Handle potential exceptions appropriately
Use appropriate timeouts for your use case
Choose the most specific selector method available

This documentation provides a comprehensive guide to using pcdt-scraper. For more information or to contribute to the project, visit the GitHub repository.

pcdt-scraper

A PyChromeDevTools based WebScraper and selenium like syntax.