Browser Use - AgentBay - Alibaba Cloud Documentation Center

What is AgentBay AIBrowser

AgentBay AIBrowser is a scalable, managed platform for running headless and headed browsers at scale. It provides the infrastructure to create and manage sessions, initialize browser instances, and allocate underlying hardware resources on demand. It is designed for web automation scenarios, such as filling forms, simulating user operations, and orchestrating complex, multi-step tasks on modern dynamic websites.

The AgentBay AIBrowser API provides a simple interface for controlling browsers and practical tools for creating and managing sessions. You can use its advanced AI capabilities to execute tasks described in natural language.

Main features

Automation framework compatibility: Highly compatible with Playwright and Puppeteer through CDP.
Secure and scalable infrastructure: Managed sessions, isolated environments, and elastic resource allocation.
Observability: Session playback, session inspector, and real-time mode for live debugging.
Advanced capabilities: Context management, IP proxy, and stealth/fingerprint options.
AI-driven PageUseAgent: Executes complex web flow tasks using natural language.
Rich API operations: Provides simple interfaces for session management, browser lifecycle control, and proxy operations.

Quick Start (Python)

This minimal, runnable example shows how to initialize a browser with the AgentBay Python software development kit (SDK) and drive it with Playwright through CDP. The sample code performs the following steps:

Authenticates by building an AgentBay client with your API key to establish a trusted channel.
Configures a new execution environment by creating a session with a browser-enabled image. This ensures the required runtime is available.
Initializes the session's browser using BrowserOption(). This starts a remote browser instance ready for automation.
Retrieves the CDP endpoint URL with get_endpoint_url() and connects using Playwright's connect_over_cdp. This bridges your local code to the remote browser.
After establishing an active connection, the code opens a new page and navigates to a website. You can inspect or manipulate the DOM as you would with a local browser.
When all work is complete, deletes the session to release the allocated resources.

Prerequisites:

Set your API key: export AGENTBAY_API_KEY=your_api_key
Install dependencies: pip install wuying-agentbay-sdk playwright
Install the Playwright browser: python -m playwright install chromium

import os
import asyncio
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
from agentbay.browser.browser import BrowserOption
from playwright.async_api import async_playwright

async def main():
    api_key = os.getenv("AGENTBAY_API_KEY")
    if not api_key:
        raise RuntimeError("AGENTBAY_API_KEY environment variable not set")

    agent_bay = AgentBay(api_key=api_key)

    # Create a session. Use an image with a browser preinstalled.
    params = CreateSessionParams(image_id="browser_latest")
    session_result = agent_bay.create(params)
    if not session_result.success:
        raise RuntimeError(f"Failed to create session: {session_result.error_message}")

    session = session_result.session

    # Initialize the browser. BrowserOption supports stealth, proxy, fingerprint, and more.
    ok = await session.browser.initialize_async(BrowserOption())
    if not ok:
        raise RuntimeError("Browser initialization failed")

    endpoint_url = session.browser.get_endpoint_url()

    # Connect Playwright over CDP and automate.
    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(endpoint_url)
        page = await browser.new_page()
        await page.goto("https://www.aliyun.com")
        print("Title:", await page.title())
        await browser.close()

    session.delete()

if __name__ == "__main__":
    asyncio.run(main())

Key browser APIs:

Browser.initialize(option: BrowserOption) -> bool / initialize_async(…): Starts a browser instance for the session.
Browser.get_endpoint_url() -> str: Returns the CDP WebSocket endpoint. Use it with Playwright's connect_over_cdp.
Browser.is_initialized() -> bool: Checks if the browser is ready.

Basic configuration

Web pages may require different configurations and display environments. You can customize the browser's identity and window size to tailor the user experience for specific device types or audiences. The following example sets a custom user agent and a precise window size. It performs the following steps:

Authenticates and creates a session that hosts a browser.
Simulates a common user agent for macOS and Chrome to identify the browser environment.
Starts the browser using initialize_async, requests the CDP endpoint, and establishes a connection through Playwright.
Visits a website and verifies that the User-Agent and window size are set correctly.

import os
import asyncio
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
from agentbay.browser.browser import BrowserOption, BrowserViewport
from playwright.async_api import async_playwright

CUSTOM_UA = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

async def main():
    agent_bay = AgentBay(api_key=os.environ["AGENTBAY_API_KEY"])  # Authenticate.

    params = CreateSessionParams(image_id="browser_latest")       # Provision a browser-ready session.
    result = agent_bay.create(params)
    if not result.success:
        raise RuntimeError(result.error_message)

    session = result.session

    # Define the browser's appearance and behavior.
    option = BrowserOption(
        user_agent=CUSTOM_UA,                    # Set a custom identity.
        viewport=BrowserViewport(width=1366, height=768),  # Set a window size for a common laptop.
    )

    ok = await session.browser.initialize_async(option)
    if not ok:
        raise RuntimeError("Browser initialization failed")

    endpoint_url = session.browser.get_endpoint_url()      # Get the CDP endpoint.

    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(endpoint_url)  # Connect and take control.
        page = await browser.new_page()

        await page.goto("https://www.whatismybrowser.com/detect/what-is-my-user-agent")
        # Verify the new user agent and window size.
        ua = await page.evaluate("navigator.userAgent")
        w = await page.evaluate("window.innerWidth")
        h = await page.evaluate("window.innerHeight")
        print("Effective UA:", ua)
        print("Viewport:", w, "x", h)

        await browser.close()

    session.delete()  # Delete the session to release resources.

if __name__ == "__main__":
    asyncio.run(main())

Using PageUseAgent [Beta]

PageUseAgent allows users to interact with a browser using natural language to execute their intent. Calls to PageUseAgent depend on large language model capabilities and consume tokens. This incurs charges based on pay-as-you-go rules. The service is available for a free trial during the Beta period.

The following example shows how to search for a book on Google. It performs the following steps:

Creates and initializes a browser session and provides an instruction to the agent.
Inputs a query. The agent translates the intent into page operations.
The browser automatically enters the query into the search box and navigates to the results page.
Closes the browser and releases the session.

import os
import asyncio
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
from agentbay.browser.browser import BrowserOption
from agentbay.browser.browser_agent import ActOptions
from playwright.async_api import async_playwright

BOOK_QUERY = "The Pragmatic Programmer"

async def main():
    agent_bay = AgentBay(api_key=os.environ["AGENTBAY_API_KEY"])  # Authenticate.

    params = CreateSessionParams(image_id="browser_latest")       # Provision a session with a browser image.
    result = agent_bay.create(params)
    if not result.success:
        raise RuntimeError(result.error_message)
    session = result.session

    # Initialize the remote browser.
    if not await session.browser.initialize_async(BrowserOption()):
        raise RuntimeError("Browser initialization failed")

    endpoint_url = session.browser.get_endpoint_url()

    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(endpoint_url)
        page = await browser.new_page()

        # Navigate to the page.
        await page.goto("https://www.google.com")

        # Ask the agent to type the book name into the search box and submit.
        act_result = await session.browser.agent.act_async(ActOptions(
            action=f"Type '{BOOK_QUERY}' into the search box and submit",
        ), page)
        print("act_result:", act_result.success, act_result.message)

        # Let the agent open the first search result.
        open_first = await session.browser.agent.act_async(ActOptions(
            action="Click the first result in the search results",
        ), page)
        print("open_first:", open_first.success, open_first.message)

        # Pause briefly to observe the result.
        await page.wait_for_timeout(5000)
        await browser.close()

    session.delete()

if __name__ == "__main__":
    asyncio.run(main())

About PageUseAgent.act:

variables interpolates dynamic values to enable reusable prompts.
It operates on an active Playwright page by retrieving its underlying context_id and page_id.
It returns a structured ActResult that contains success and message for easy logging and recovery flows.

Limitations

PageUseAgent does not include a long-term planner. It does not orchestrate multi-step plans on its own. It relies on the caller, or a higher-level agent, to break down a project into steps and call act or other PageUseAgent methods for each step.

The strength of PageUseAgent lies in executing precise, atomic web operations, such as clicks, fills, and scrolls, quickly and consistently.

PageUseAgent prioritizes throughput and accuracy for each step. It leaves complex task planning and branching logic to an external controller.