Documentation | Serpex

Web Scraping API

Extract content from websites and convert to clean markdown for your AI applications. Perfect for LLM training, RAG systems, and content ingestion.

3 credits per URL

Up to 10 URLs per request

HTML to Markdown conversion

Quick Start

Get started with web scraping in just a few lines of code

import { SerpexClient } from 'serpex';

const client = new SerpexClient('your-api-key-here');

async function extractContent() {
  try {
    const result = await client.extract({
      urls: [
        'https://example.com/article1',
        'https://example.com/article2'
      ]
    });

    console.log('Extraction successful:', result);
  } catch (error) {
    console.error('Extraction failed:', error);
  }
}

extractContent();

API Reference

Complete API specification for the web scraping endpoint

Endpoint

POST /api/crawl

Request Body

urlsrequired

Array of URLs to extract content from (maximum 10 URLs)

["https://example.com/page1", "https://example.com/page2"]

Authentication

Include your API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Response Examples

Different response scenarios you might encounter

{
  "success": true,
  "results": [
    {
      "url": "https://example.com/article1",
      "success": true,
      "markdown": "# Article Title\n\nThis is the main content of the article converted to markdown format...\n\n## Section Header\n\nMore content here...",
      "status_code": 200
    },
    {
      "url": "https://example.com/article2",
      "success": true,
      "markdown": "# Second Article\n\nContent from the second webpage...\n\n### Subsection\n\nAdditional content...",
      "status_code": 200
    }
  ],
  "metadata": {
    "total_urls": 2,
    "processed_urls": 2,
    "successful_crawls": 2,
    "failed_crawls": 0,
    "credits_used": 6,
    "response_time": 2150,
    "timestamp": "2025-01-22T10:30:20.000Z"
  }
}

Response Fields

Detailed breakdown of all response fields

Result Fields

url

The URL that was crawled

success

Whether the crawl was successful

markdown

Clean markdown content (if successful)

status_code

HTTP status code of the response

crawled_at

Timestamp when the URL was crawled

extraction_mode

Method used for content extraction

Metadata Fields

total_urls

Total number of URLs requested

processed_urls

Number of URLs processed

successful_crawls

Number of successful extractions

failed_crawls

Number of failed extractions

credits_used

Credits consumed (3 per URL)

response_time

Total processing time in milliseconds

Use Cases

Common applications for the web scraping API

LLM Training Data

Prepare clean, structured content for training language models

RAG Systems

Extract and process content for retrieval-augmented generation

Content Analysis

Analyze and process web content for insights and research

Knowledge Base Building

Build comprehensive knowledge bases from web sources

Error Handling

Common errors and how to handle them

Invalid URLs

Ensure all URLs are valid and properly formatted. The API will return specific invalid URLs in the error response.

Rate Limiting

Respect rate limits and implement exponential backoff for retries.

Partial Failures

Some URLs may fail while others succeed. Check the success field for each result individually.

import { SerpexClient } from 'serpex'; const client = new SerpexClient('your-api-key-here'); async function extractContent() { try { const result = await client.extract({ urls: [ 'https://example.com/article1', 'https://example.com/article2' ] }); console.log('Extraction successful:', result); } catch (error) { console.error('Extraction failed:', error); } } extractContent();

{ "success": true, "results": [ { "url": "https://example.com/article1", "success": true, "markdown": "# Article Title\n\nThis is the main content of the article converted to markdown format...\n\n## Section Header\n\nMore content here...", "status_code": 200 }, { "url": "https://example.com/article2", "success": true, "markdown": "# Second Article\n\nContent from the second webpage...\n\n### Subsection\n\nAdditional content...", "status_code": 200 } ], "metadata": { "total_urls": 2, "processed_urls": 2, "successful_crawls": 2, "failed_crawls": 0, "credits_used": 6, "response_time": 2150, "timestamp": "2025-01-22T10:30:20.000Z" } }