This topic describes the UI automation capabilities of the AgentBay software development kit (SDK) for cloud computer environments, covering mouse, keyboard, and screen operations.
Overview
The Computer Use module provides powerful UI automation functionality for cloud computers, including:
Mouse operations - Precise control over clicks, movement, dragging, and scrolling.
Keyboard operations - Input text and send key combinations (shortcuts).
Screen operations - Take screenshots and retrieve screen information.
Create a session
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
agent_bay = AgentBay()
# Use windows_latest or linux_latest
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
Mouse operations
Click operations
The click_mouse() method supports various types of clicks. Use the MouseButton enumeration to ensure type safety. The following button types are supported:
MouseButton.LEFTMouseButton.RIGHTMouseButton.MIDDLEMouseButton.DOUBLE_LEFT
from agentbay.computer import MouseButton
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
# Left-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.LEFT)
if result.success:
print("Left-click successful")
# Output: Left-click successful
# Right-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.RIGHT)
if result.success:
print("Right-click successful")
# Output: Right-click successful
# Middle-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.MIDDLE)
if result.success:
print("Middle-click successful")
# Output: Middle-click successful
# Double left-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.DOUBLE_LEFT)
if result.success:
print("Double left-click successful")
# Output: Double left-click successful
agent_bay.delete(session)
Move mouse
Move the mouse cursor to the specified coordinates.
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
result = session.computer.move_mouse(x=600, y=400)
if result.success:
print("Mouse move successful")
# Output: Mouse move successful
agent_bay.delete(session)Drag mouse
Using the MouseButton enum, drag the mouse from one point to another. The following button types are supported:
MouseButton.LEFTMouseButton.RIGHTMouseButton.MIDDLE
from agentbay.computer import MouseButton
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
# Left-drag
result = session.computer.drag_mouse(
from_x=100,
from_y=100,
to_x=200,
to_y=200,
button=MouseButton.LEFT
)
if result.success:
print("Drag operation successful")
# Output: Drag operation successful
agent_bay.delete(session)Scroll wheel
Using the ScrollDirection enum, scroll the mouse wheel at a specific coordinate. The following directions are supported:
ScrollDirection.UPScrollDirection.DOWNScrollDirection.LEFTScrollDirection.RIGHT
from agentbay.computer import ScrollDirection
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
# Scroll up
result = session.computer.scroll(x=500, y=500, direction=ScrollDirection.UP, amount=3)
if result.success:
print("Scroll up successful")
# Output: Scroll up successful
# Scroll down
result = session.computer.scroll(x=500, y=500, direction=ScrollDirection.DOWN, amount=5)
if result.success:
print("Scroll down successful")
# Output: Scroll down successful
agent_bay.delete(session)Get cursor position
import json
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
result = session.computer.get_cursor_position()
if result.success:
cursor_data = json.loads(result.data)
print(f"Cursor position: x={cursor_data['x']}, y={cursor_data['y']}")
# Output: Cursor position: x=512, y=384
agent_bay.delete(session)Keyboard operations
Text input
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
result = session.computer.input_text("Hello AgentBay!")
if result.success:
print("Text input successful")
# Output: Text input successful
agent_bay.delete(session)Key press
Input key combinations, with support for modifier keys.
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
# Press Ctrl+A to select all
result = session.computer.press_keys(keys=["Ctrl", "a"])
if result.success:
print("Key press successful")
# Output: Key press successful
# Press Ctrl+C to copy
result = session.computer.press_keys(keys=["Ctrl", "c"])
if result.success:
print("Copy command sent")
# Output: Copy command sent
agent_bay.delete(session)Release key
When hold=True is set for a key press, the cloud computer holds the key down. After the related operations are complete, the key must be released to avoid conflicts with other key operations.
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
# Hold down the Ctrl key
session.computer.press_keys(keys=["Ctrl"], hold=True)
# ... Perform other operations ...
# Release the Ctrl key
result = session.computer.release_keys(keys=["Ctrl"])
if result.success:
print("Key release successful")
# Output: Key release successful
agent_bay.delete(session)Screen operations
Screenshot
Capture a snapshot of the current screen. The screenshot is saved to cloud storage, and a download URL is returned.
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
result = session.computer.screenshot()
if result.success:
screenshot_url = result.data
print(f"Screenshot URL: {screenshot_url}")
# Output: Screenshot URL: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***
agent_bay.delete(session)Get screen dimensions
import json
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
result = session.computer.get_screen_size()
if result.success:
screen_data = json.loads(result.data)
print(f"Screen width: {screen_data['width']}")
print(f"Screen height: {screen_data['height']}")
print(f"DPI scaling factor: {screen_data['dpiScalingFactor']}")
# Output: Screen width: 1024
# Output: Screen height: 768
# Output: DPI scaling factor: 1.0
agent_bay.delete(session)
Troubleshooting
FAQ
"Tool not found" error
Ensure your environment is running on a cloud computer image, such as
windows_latestorlinux_latest.
How do I handle the download link (URL) returned after taking a screenshot?
Screenshots are automatically saved to cloud storage (OSS).
The
result.datavariable contains the download URL, not the raw image data itself.Use this URL to download the screenshot file.