UI automation - AgentBay - Alibaba Cloud Documentation Center

This topic describes the UI automation capabilities of the AgentBay software development kit (SDK) for cloud computer environments, covering mouse, keyboard, and screen operations.

Overview

The Computer Use module provides powerful UI automation functionality for cloud computers, including:

Mouse operations - Precise control over clicks, movement, dragging, and scrolling.
Keyboard operations - Input text and send key combinations (shortcuts).
Screen operations - Take screenshots and retrieve screen information.

Create a session

from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams

agent_bay = AgentBay()
# Use windows_latest or linux_latest
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

Mouse operations

Click operations

The click_mouse() method supports various types of clicks. Use the MouseButton enumeration to ensure type safety. The following button types are supported:

MouseButton.LEFT
MouseButton.RIGHT
MouseButton.MIDDLE
MouseButton.DOUBLE_LEFT

from agentbay.computer import MouseButton

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

# Left-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.LEFT)
if result.success:
    print("Left-click successful")
# Output: Left-click successful

# Right-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.RIGHT)
if result.success:
    print("Right-click successful")
# Output: Right-click successful

# Middle-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.MIDDLE)
if result.success:
    print("Middle-click successful")
# Output: Middle-click successful

# Double left-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.DOUBLE_LEFT)
if result.success:
    print("Double left-click successful")
# Output: Double left-click successful

agent_bay.delete(session)

Move mouse

Move the mouse cursor to the specified coordinates.

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

result = session.computer.move_mouse(x=600, y=400)
if result.success:
    print("Mouse move successful")
# Output: Mouse move successful

agent_bay.delete(session)

Drag mouse

Using the MouseButton enum, drag the mouse from one point to another. The following button types are supported:

MouseButton.LEFT
MouseButton.RIGHT
MouseButton.MIDDLE

from agentbay.computer import MouseButton

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

# Left-drag
result = session.computer.drag_mouse(
    from_x=100, 
    from_y=100, 
    to_x=200, 
    to_y=200, 
    button=MouseButton.LEFT
)
if result.success:
    print("Drag operation successful")
# Output: Drag operation successful

agent_bay.delete(session)

Scroll wheel

Using the ScrollDirection enum, scroll the mouse wheel at a specific coordinate. The following directions are supported:

ScrollDirection.UP
ScrollDirection.DOWN
ScrollDirection.LEFT
ScrollDirection.RIGHT

from agentbay.computer import ScrollDirection

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

# Scroll up
result = session.computer.scroll(x=500, y=500, direction=ScrollDirection.UP, amount=3)
if result.success:
    print("Scroll up successful")
# Output: Scroll up successful

# Scroll down
result = session.computer.scroll(x=500, y=500, direction=ScrollDirection.DOWN, amount=5)
if result.success:
    print("Scroll down successful")
# Output: Scroll down successful

agent_bay.delete(session)

Get cursor position

import json

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

result = session.computer.get_cursor_position()
if result.success:
    cursor_data = json.loads(result.data)
    print(f"Cursor position: x={cursor_data['x']}, y={cursor_data['y']}")
# Output: Cursor position: x=512, y=384

agent_bay.delete(session)

Keyboard operations

Text input

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

result = session.computer.input_text("Hello AgentBay!")
if result.success:
    print("Text input successful")
# Output: Text input successful

agent_bay.delete(session)

Key press

Input key combinations, with support for modifier keys.

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

# Press Ctrl+A to select all
result = session.computer.press_keys(keys=["Ctrl", "a"])
if result.success:
    print("Key press successful")
# Output: Key press successful

# Press Ctrl+C to copy
result = session.computer.press_keys(keys=["Ctrl", "c"])
if result.success:
    print("Copy command sent")
# Output: Copy command sent

agent_bay.delete(session)

Release key

When hold=True is set for a key press, the cloud computer holds the key down. After the related operations are complete, the key must be released to avoid conflicts with other key operations.

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

# Hold down the Ctrl key
session.computer.press_keys(keys=["Ctrl"], hold=True)

# ... Perform other operations ...

# Release the Ctrl key
result = session.computer.release_keys(keys=["Ctrl"])
if result.success:
    print("Key release successful")
# Output: Key release successful

agent_bay.delete(session)

Screen operations

Screenshot

Capture a snapshot of the current screen. The screenshot is saved to cloud storage, and a download URL is returned.

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

result = session.computer.screenshot()
if result.success:
    screenshot_url = result.data
    print(f"Screenshot URL: {screenshot_url}")
# Output: Screenshot URL: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***

agent_bay.delete(session)

Get screen dimensions

import json

session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

result = session.computer.get_screen_size()
if result.success:
    screen_data = json.loads(result.data)
    print(f"Screen width: {screen_data['width']}")
    print(f"Screen height: {screen_data['height']}")
    print(f"DPI scaling factor: {screen_data['dpiScalingFactor']}")
# Output: Screen width: 1024
# Output: Screen height: 768
# Output: DPI scaling factor: 1.0

agent_bay.delete(session)

Troubleshooting

FAQ

"Tool not found" error
- Ensure your environment is running on a cloud computer image, such as windows_latest or linux_latest.
How do I handle the download link (URL) returned after taking a screenshot?
- Screenshots are automatically saved to cloud storage (OSS).
- The result.data variable contains the download URL, not the raw image data itself.
- Use this URL to download the screenshot file.