All Products
Search
Document Center

AgentBay:Desktop application automation

Last Updated:Mar 25, 2026

This topic describes how to use the AgentBay SDK to automate mouse, keyboard, and screen operations in cloud computer environments.

Overview

The Computer Use module in the AgentBay SDK provides UI automation for cloud computers. It supports three categories of operations:

  • Mouse operations - Click, move, drag, and scroll with precise coordinate control.

  • Keyboard operations - Type text and send key combinations (keyboard shortcuts).

  • Screen operations - Capture screenshots and retrieve screen dimensions.

Method summary

Category

Method

Description

Mouse

click_mouse

Click at specified coordinates.

Mouse

move_mouse

Move cursor without clicking.

Mouse

drag_mouse

Drag from one point to another.

Mouse

scroll

Scroll the mouse wheel at a location.

Mouse

get_cursor_position

Get current cursor coordinates.

Keyboard

input_text

Type a string at the cursor position.

Keyboard

press_keys

Send key combinations or shortcuts.

Keyboard

release_keys

Release previously held keys.

Screen

screenshot

Capture the screen and get a download URL.

Screen

get_screen_size

Get screen dimensions and DPI scaling.

Coordinate system

All mouse and screen operations use a pixel-based coordinate system. The origin (0, 0) is at the top-left corner of the screen. The X axis increases to the right, and the Y axis increases downward. Use get_screen_size() to determine the available screen dimensions, and get_cursor_position() to find the current cursor location.

Result objects

Operations in the Computer Use module return one of two result types:

BoolResult (returned by mouse clicks, movement, scrolling, text input, and key operations):

Property

Type

Description

success

bool

Whether the operation completed successfully.

data

bool or None

The boolean result of the operation.

error_message

str

Error details if the operation failed. Empty string on success.

OperationResult (returned by screenshot, get_cursor_position, and get_screen_size):

Property

Type

Description

success

bool

Whether the operation completed successfully.

data

varies

Operation-specific data. For get_cursor_position and get_screen_size, this is a JSON string that you must parse with json.loads(). For screenshot, this is a download URL string.

error_message

str

Error details if the operation failed. Empty string on success.

Prerequisites

Before you begin, make sure you have:

  • The AgentBay SDK installed (pip install wuying-agentbay-sdk) and configured with valid credentials.

  • A supported cloud computer image (windows_latest or linux_latest).

Create a session

Before you can perform any UI automation, you must create a session connected to a cloud computer.

from agentbay import AgentBay, CreateSessionParams

agent_bay = AgentBay()
# Use windows_latest or linux_latest
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

When you are finished, delete the session to release resources:

agent_bay.delete(session)
All code examples in the following sections assume that you have already created an agent_bay instance and an active session as shown above. Session creation and deletion code is omitted for brevity.

Mouse operations

click_mouse

Performs a mouse click at the specified screen coordinates.

Signature

session.computer.click_mouse(x, y, button=MouseButton.LEFT)

Parameters

Parameter

Type

Required

Default

Description

x

int

Yes

-

Horizontal position in pixels.

y

int

Yes

-

Vertical position in pixels.

button

MouseButton

No

MouseButton.LEFT

The mouse button to use.

MouseButton enum values

Import: from agentbay import MouseButton

Value

Description

MouseButton.LEFT

Standard left-click.

MouseButton.RIGHT

Right-click (context menu).

MouseButton.MIDDLE

Middle-click (scroll wheel button).

MouseButton.DOUBLE_LEFT

Double left-click.

Returns: A BoolResult object. Check result.success to confirm the click was registered.

Example

from agentbay import MouseButton

# Left-click (default button)
result = session.computer.click_mouse(x=500, y=300)
if result.success:
    print("Left-click successful")
# Output: Left-click successful

# Right-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.RIGHT)
if result.success:
    print("Right-click successful")
# Output: Right-click successful

# Middle-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.MIDDLE)
if result.success:
    print("Middle-click successful")
# Output: Middle-click successful

# Double left-click
result = session.computer.click_mouse(x=500, y=300, button=MouseButton.DOUBLE_LEFT)
if result.success:
    print("Double left-click successful")
# Output: Double left-click successful

move_mouse

Moves the mouse cursor to the specified coordinates without clicking.

Signature

session.computer.move_mouse(x, y)

Parameters

Parameter

Type

Required

Description

x

int

Yes

Horizontal position in pixels.

y

int

Yes

Vertical position in pixels.

Returns: A BoolResult object.

Example

result = session.computer.move_mouse(x=600, y=400)
if result.success:
    print("Mouse move successful")
# Output: Mouse move successful

drag_mouse

Drags the mouse from one point to another while holding the specified button.

Signature

session.computer.drag_mouse(from_x, from_y, to_x, to_y, button=MouseButton.LEFT)

Parameters

Parameter

Type

Required

Default

Description

from_x

int

Yes

-

Starting horizontal position in pixels.

from_y

int

Yes

-

Starting vertical position in pixels.

to_x

int

Yes

-

Ending horizontal position in pixels.

to_y

int

Yes

-

Ending vertical position in pixels.

button

MouseButton

No

MouseButton.LEFT

The mouse button to hold during the drag.

Supported button values for drag: MouseButton.LEFT, MouseButton.RIGHT, MouseButton.MIDDLE.

MouseButton.DOUBLE_LEFT is not supported for drag operations.

Returns: A BoolResult object.

Example

from agentbay import MouseButton

result = session.computer.drag_mouse(
    from_x=100,
    from_y=100,
    to_x=200,
    to_y=200,
    button=MouseButton.LEFT
)
if result.success:
    print("Drag operation successful")
# Output: Drag operation successful

scroll

Scrolls the mouse wheel at a specific location on the screen.

Signature

session.computer.scroll(x, y, direction=ScrollDirection.UP, amount=1)

Parameters

Parameter

Type

Required

Default

Description

x

int

Yes

-

Horizontal position where the scroll occurs, in pixels.

y

int

Yes

-

Vertical position where the scroll occurs, in pixels.

direction

ScrollDirection

No

ScrollDirection.UP

The direction to scroll.

amount

int

No

1

Number of scroll increments.

ScrollDirection enum values

Import: from agentbay import ScrollDirection

Value

Description

ScrollDirection.UP

Scroll upward.

ScrollDirection.DOWN

Scroll downward.

ScrollDirection.LEFT

Scroll left.

ScrollDirection.RIGHT

Scroll right.

Returns: A BoolResult object.

Example

from agentbay import ScrollDirection

# Scroll up
result = session.computer.scroll(x=500, y=500, direction=ScrollDirection.UP, amount=3)
if result.success:
    print("Scroll up successful")
# Output: Scroll up successful

# Scroll down
result = session.computer.scroll(x=500, y=500, direction=ScrollDirection.DOWN, amount=5)
if result.success:
    print("Scroll down successful")
# Output: Scroll down successful

get_cursor_position

Returns the current position of the mouse cursor.

Signature

session.computer.get_cursor_position()

Parameters: None.

Returns: An OperationResult object. When result.success is True, result.data contains a JSON string with x and y fields.

Example

import json

result = session.computer.get_cursor_position()
if result.success:
    cursor_data = json.loads(result.data)
    print(f"Cursor position: x={cursor_data['x']}, y={cursor_data['y']}")
# Output: Cursor position: x=512, y=384

Keyboard operations

input_text

Types a string of text at the current cursor position.

Signature

session.computer.input_text(text)

Parameters

Parameter

Type

Required

Description

text

str

Yes

The text to type.

Returns: A BoolResult object.

Example

result = session.computer.input_text("Hello AgentBay!")
if result.success:
    print("Text input successful")
# Output: Text input successful

press_keys

Sends one or more keys simultaneously, with support for modifier keys. Use this method for keyboard shortcuts such as Ctrl+C or Alt+Tab.

Signature

session.computer.press_keys(keys, hold=False)

Parameters

Parameter

Type

Required

Default

Description

keys

list[str]

Yes

-

A list of key names to press simultaneously.

hold

bool

No

False

When set to True, the keys are held down instead of being pressed and released. You must call release_keys() afterward.

Returns: A BoolResult object.

Example

# Press Ctrl+A to select all
result = session.computer.press_keys(keys=["Ctrl", "a"])
if result.success:
    print("Key press successful")
# Output: Key press successful

# Press Ctrl+C to copy
result = session.computer.press_keys(keys=["Ctrl", "c"])
if result.success:
    print("Copy command sent")
# Output: Copy command sent

release_keys

Releases keys that were previously held down with press_keys(hold=True). Always release held keys when you are done to prevent them from interfering with subsequent operations.

Signature

session.computer.release_keys(keys)

Parameters

Parameter

Type

Required

Description

keys

list[str]

Yes

A list of key names to release.

Returns: A BoolResult object.

Example

# Hold down the Ctrl key
session.computer.press_keys(keys=["Ctrl"], hold=True)

# ... Perform other operations ...

# Release the Ctrl key
result = session.computer.release_keys(keys=["Ctrl"])
if result.success:
    print("Key release successful")
# Output: Key release successful

Screen operations

screenshot

Captures the current screen and returns a download URL. The screenshot is saved to Object Storage Service (OSS), and a URL is returned that you can use to download the image file.

Signature

session.computer.screenshot()

Parameters: None.

Returns: An OperationResult object. When result.success is True, result.data contains the download URL for the screenshot image (not the raw image data).

Example

result = session.computer.screenshot()
if result.success:
    screenshot_url = result.data
    print(f"Screenshot URL: {screenshot_url}")
# Output: Screenshot URL: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***

get_screen_size

Returns the screen dimensions and display scale factor of the cloud computer.

Signature

session.computer.get_screen_size()

Parameters: None.

Returns: An OperationResult object. When result.success is True, result.data contains a JSON string with the following fields:

Field

Type

Description

width

int

Screen width in pixels.

height

int

Screen height in pixels.

dpiScalingFactor

float

The display scale factor (DPI scaling). A value of 1.0 means 100% scaling (96 DPI).

Example

import json

result = session.computer.get_screen_size()
if result.success:
    screen_data = json.loads(result.data)
    print(f"Screen width: {screen_data['width']}")
    print(f"Screen height: {screen_data['height']}")
    print(f"DPI scaling factor: {screen_data['dpiScalingFactor']}")
# Output: Screen width: 1024
# Output: Screen height: 768
# Output: DPI scaling factor: 1.0

Troubleshooting

"Tool not found" error

Symptom: You receive a "Tool not found" error when calling a Computer Use method.

Cause: The session is not running on a supported cloud computer image.

Solution: When you create a session, make sure the image_id parameter is set to a valid value such as windows_latest or linux_latest.

# Correct - use a supported image ID
session_params = CreateSessionParams(image_id="windows_latest")

Screenshot returns a URL, not image data

Symptom: result.data from screenshot() contains a URL string instead of raw image bytes.

Cause: This is the expected behavior. The screenshot() method stores the image in Object Storage Service (OSS) and returns a download URL.

Solution: Use the returned URL to download the image with an HTTP client or browser.

Held keys not released

Symptom: Subsequent keyboard operations behave unexpectedly after using press_keys(hold=True).

Cause: Keys held with press_keys(hold=True) remain active until explicitly released.

Solution: Always pair press_keys(hold=True) with a corresponding release_keys() call.

# Hold a key
session.computer.press_keys(keys=["Shift"], hold=True)

# Perform operations that need Shift held...

# Always release afterward
session.computer.release_keys(keys=["Shift"])

Coordinates out of bounds

Symptom: A mouse operation produces unexpected results or no visible effect.

Cause: The target coordinates may be outside the screen area. The SDK does not validate coordinate ranges on the client side — coordinates are sent directly to the cloud computer.

Solution: Call get_screen_size() first to determine the valid coordinate range, and make sure your x and y values fall within (0, 0) to (width, height).