Automate Mouse, Keyboard & Screenshots via AgentBay SDK - AgentBay

This topic describes how to use the AgentBay SDK to automate mouse, keyboard, and screen operations in cloud computer environments.

Overview

The Computer Use module in the AgentBay SDK provides UI automation for cloud computers. It supports three categories of operations:

Mouse operations - Click, move, drag, and scroll with precise coordinate control.
Keyboard operations - Type text and send key combinations (keyboard shortcuts).
Screen operations - Capture screenshots and retrieve screen dimensions.

Method summary

Category	Method	Description
Mouse	`click_mouse`	Click at specified coordinates.
Mouse	`move_mouse`	Move cursor without clicking.
Mouse	`drag_mouse`	Drag from one point to another.
Mouse	`scroll`	Scroll the mouse wheel at a location.
Mouse	`get_cursor_position`	Get current cursor coordinates.
Keyboard	`input_text`	Type a string at the cursor position.
Keyboard	`press_keys`	Send key combinations or shortcuts.
Keyboard	`release_keys`	Release previously held keys.
Screen	`screenshot`	Capture the screen and get a download URL.
Screen	`get_screen_size`	Get screen dimensions and DPI scaling.

Coordinate system

All mouse and screen operations use a pixel-based coordinate system. The origin (0, 0) is at the top-left corner of the screen. The X axis increases to the right, and the Y axis increases downward. Use get_screen_size() to determine the available screen dimensions, and get_cursor_position() to find the current cursor location.

Result objects

Operations in the Computer Use module return one of two result types:

BoolResult (returned by mouse clicks, movement, scrolling, text input, and key operations):

Property	Type	Description
`success`	`bool`	Whether the operation completed successfully.
`data`	`bool` or `None`	The boolean result of the operation.
`error_message`	`str`	Error details if the operation failed. Empty string on success.

OperationResult (returned by screenshot, get_cursor_position, and get_screen_size):

Property	Type	Description
`success`	`bool`	Whether the operation completed successfully.
`data`	varies	Operation-specific data. For `get_cursor_position` and `get_screen_size`, this is a JSON string that you must parse with `json.loads()`. For `screenshot`, this is a download URL string.
`error_message`	`str`	Error details if the operation failed. Empty string on success.

Prerequisites

Before you begin, make sure you have:

The AgentBay SDK installed (pip install wuying-agentbay-sdk) and configured with valid credentials.
A supported cloud computer image (windows_latest or linux_latest).

Create a session

Before you can perform any UI automation, you must create a session connected to a cloud computer.

from agentbay import AgentBay, CreateSessionParams

agent_bay = AgentBay()
# Use windows_latest or linux_latest
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

When you are finished, delete the session to release resources:

agent_bay.delete(session)

All code examples in the following sections assume that you have already created an agent_bay instance and an active session as shown above. Session creation and deletion code is omitted for brevity.

Mouse operations

click_mouse

Performs a mouse click at the specified screen coordinates.

Signature

session.computer.click_mouse(x, y, button=MouseButton.LEFT)

Parameters

Parameter	Type	Required	Default	Description
`x`	`int`	Yes	-	Horizontal position in pixels.
`y`	`int`	Yes	-	Vertical position in pixels.
`button`	`MouseButton`	No	`MouseButton.LEFT`	The mouse button to use.

MouseButton enum values

Import: from agentbay import MouseButton

Value	Description
`MouseButton.LEFT`	Standard left-click.
`MouseButton.RIGHT`	Right-click (context menu).
`MouseButton.MIDDLE`	Middle-click (scroll wheel button).
`MouseButton.DOUBLE_LEFT`	Double left-click.

Returns: A BoolResult object. Check result.success to confirm the click was registered.

Example

from agentbay import MouseButton

# Left-click (default button)
result = session.computer.click_mouse(x=500, y=300)
if result.success:
    print("Left-click successful")
# Output: Left-click successful

# Right-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.RIGHT)
if result.success:
    print("Right-click successful")
# Output: Right-click successful

# Middle-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.MIDDLE)
if result.success:
    print("Middle-click successful")
# Output: Middle-click successful

# Double left-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.DOUBLE_LEFT)
if result.success:
    print("Double left-click successful")
# Output: Double left-click successful

move_mouse

Moves the mouse cursor to the specified coordinates without clicking.

Signature

session.computer.move_mouse(x, y)

Parameters

Parameter	Type	Required	Description
`x`	`int`	Yes	Horizontal position in pixels.
`y`	`int`	Yes	Vertical position in pixels.

Returns: A BoolResult object.

Example

result = session.computer.move_mouse(x=600, y=400)
if result.success:
    print("Mouse move successful")
# Output: Mouse move successful

drag_mouse

Drags the mouse from one point to another while holding the specified button.

Signature

session.computer.drag_mouse(from_x, from_y, to_x, to_y, button=MouseButton.LEFT)

Parameters

Parameter	Type	Required	Default	Description
`from_x`	`int`	Yes	-	Starting horizontal position in pixels.
`from_y`	`int`	Yes	-	Starting vertical position in pixels.
`to_x`	`int`	Yes	-	Ending horizontal position in pixels.
`to_y`	`int`	Yes	-	Ending vertical position in pixels.
`button`	`MouseButton`	No	`MouseButton.LEFT`	The mouse button to hold during the drag.

Supported button values for drag: MouseButton.LEFT, MouseButton.RIGHT, MouseButton.MIDDLE.

MouseButton.DOUBLE_LEFT is not supported for drag operations.

Returns: A BoolResult object.

Example

from agentbay import MouseButton

result = session.computer.drag_mouse(
    from_x=100,
    from_y=100,
    to_x=200,
    to_y=200,
    button=MouseButton.LEFT
)
if result.success:
    print("Drag operation successful")
# Output: Drag operation successful

scroll

Scrolls the mouse wheel at a specific location on the screen.

Signature

session.computer.scroll(x, y, direction=ScrollDirection.UP, amount=1)

Parameters

Parameter	Type	Required	Default	Description
`x`	`int`	Yes	-	Horizontal position where the scroll occurs, in pixels.
`y`	`int`	Yes	-	Vertical position where the scroll occurs, in pixels.
`direction`	`ScrollDirection`	No	`ScrollDirection.UP`	The direction to scroll.
`amount`	`int`	No	`1`	Number of scroll increments.

ScrollDirection enum values

Import: from agentbay import ScrollDirection

Value	Description
`ScrollDirection.UP`	Scroll upward.
`ScrollDirection.DOWN`	Scroll downward.
`ScrollDirection.LEFT`	Scroll left.
`ScrollDirection.RIGHT`	Scroll right.

Returns: A BoolResult object.

Example

from agentbay import ScrollDirection

# Scroll up
result = session.computer.scroll(x=500, y=500, direction=ScrollDirection.UP, amount=3)
if result.success:
    print("Scroll up successful")
# Output: Scroll up successful

# Scroll down
result = session.computer.scroll(x=500, y=500, direction=ScrollDirection.DOWN, amount=5)
if result.success:
    print("Scroll down successful")
# Output: Scroll down successful

get_cursor_position

Returns the current position of the mouse cursor.

Signature

session.computer.get_cursor_position()

Parameters: None.

Returns: An OperationResult object. When result.success is True, result.data contains a JSON string with x and y fields.

Example

import json

result = session.computer.get_cursor_position()
if result.success:
    cursor_data = json.loads(result.data)
    print(f"Cursor position: x={cursor_data['x']}, y={cursor_data['y']}")
# Output: Cursor position: x=512, y=384

Keyboard operations

input_text

Types a string of text at the current cursor position.

Signature

session.computer.input_text(text)

Parameters

Parameter	Type	Required	Description
`text`	`str`	Yes	The text to type.

Returns: A BoolResult object.

Example

result = session.computer.input_text("Hello AgentBay!")
if result.success:
    print("Text input successful")
# Output: Text input successful

press_keys

Sends one or more keys simultaneously, with support for modifier keys. Use this method for keyboard shortcuts such as Ctrl+C or Alt+Tab.

Signature

session.computer.press_keys(keys, hold=False)

Parameters

Parameter	Type	Required	Default	Description
`keys`	`list[str]`	Yes	-	A list of key names to press simultaneously.
`hold`	`bool`	No	`False`	When set to `True`, the keys are held down instead of being pressed and released. You must call `release_keys()` afterward.

Returns: A BoolResult object.

Example

# Press Ctrl+A to select all
result = session.computer.press_keys(keys=["Ctrl", "a"])
if result.success:
    print("Key press successful")
# Output: Key press successful

# Press Ctrl+C to copy
result = session.computer.press_keys(keys=["Ctrl", "c"])
if result.success:
    print("Copy command sent")
# Output: Copy command sent

release_keys

Releases keys that were previously held down with press_keys(hold=True). Always release held keys when you are done to prevent them from interfering with subsequent operations.

Signature

session.computer.release_keys(keys)

Parameters

Parameter	Type	Required	Description
`keys`	`list[str]`	Yes	A list of key names to release.

Returns: A BoolResult object.

Example

# Hold down the Ctrl key
session.computer.press_keys(keys=["Ctrl"], hold=True)

# ... Perform other operations ...

# Release the Ctrl key
result = session.computer.release_keys(keys=["Ctrl"])
if result.success:
    print("Key release successful")
# Output: Key release successful

Screen operations

screenshot

Captures the current screen and returns a download URL. The screenshot is saved to Object Storage Service (OSS), and a URL is returned that you can use to download the image file.

Signature

session.computer.screenshot()

Parameters: None.

Returns: An OperationResult object. When result.success is True, result.data contains the download URL for the screenshot image (not the raw image data).

Example

result = session.computer.screenshot()
if result.success:
    screenshot_url = result.data
    print(f"Screenshot URL: {screenshot_url}")
# Output: Screenshot URL: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***

get_screen_size

Returns the screen dimensions and display scale factor of the cloud computer.

Signature

session.computer.get_screen_size()

Parameters: None.

Returns: An OperationResult object. When result.success is True, result.data contains a JSON string with the following fields:

Field	Type	Description
`width`	`int`	Screen width in pixels.
`height`	`int`	Screen height in pixels.
`dpiScalingFactor`	`float`	The display scale factor (DPI scaling). A value of `1.0` means 100% scaling (96 DPI).

Example

import json

result = session.computer.get_screen_size()
if result.success:
    screen_data = json.loads(result.data)
    print(f"Screen width: {screen_data['width']}")
    print(f"Screen height: {screen_data['height']}")
    print(f"DPI scaling factor: {screen_data['dpiScalingFactor']}")
# Output: Screen width: 1024
# Output: Screen height: 768
# Output: DPI scaling factor: 1.0

Troubleshooting

"Tool not found" error

Symptom: You receive a "Tool not found" error when calling a Computer Use method.

Cause: The session is not running on a supported cloud computer image.

Solution: When you create a session, make sure the image_id parameter is set to a valid value such as windows_latest or linux_latest.

# Correct - use a supported image ID
session_params = CreateSessionParams(image_id="windows_latest")

Screenshot returns a URL, not image data

Symptom: result.data from screenshot() contains a URL string instead of raw image bytes.

Cause: This is the expected behavior. The screenshot() method stores the image in Object Storage Service (OSS) and returns a download URL.

Solution: Use the returned URL to download the image with an HTTP client or browser.

Held keys not released

Symptom: Subsequent keyboard operations behave unexpectedly after using press_keys(hold=True).

Cause: Keys held with press_keys(hold=True) remain active until explicitly released.

Solution: Always pair press_keys(hold=True) with a corresponding release_keys() call.

# Hold a key
session.computer.press_keys(keys=["Shift"], hold=True)

# Perform operations that need Shift held...

# Always release afterward
session.computer.release_keys(keys=["Shift"])

Coordinates out of bounds

Symptom: A mouse operation produces unexpected results or no visible effect.

Cause: The target coordinates may be outside the screen area. The SDK does not validate coordinate ranges on the client side — coordinates are sent directly to the cloud computer.

Solution: Call get_screen_size() first to determine the valid coordinate range, and make sure your x and y values fall within (0, 0) to (width, height).