UI automation - AgentBay - Alibaba Cloud Documentation Center

This topic describes the UI automation features of the AgentBay SDK for cloud phones. These features include touch operations, text input, UI element detection, and screen operations.

Overview

The AgentBay SDK for cloud phones provides the following UI automation features:

Touch operations: Perform tap and swipe gestures to interact with the cloud phone.
Text input: Enter text and send hardware keypress events.
UI element detection: Find and interact with UI elements.
Screen operations: Capture screenshots for visual verification.

Create a session

from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams

agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
# The session is created. You can now automate the cloud phone.

Touch operations

Tap gestures

You can tap the screen at specific coordinates:

session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

# Tap at the specified coordinates.
result = session.mobile.tap(x=500, y=300)
if result.success:
    print("Tap successful")  # Output: Tap successful
else:
    print(f"Tap failed: {result.error_message}")

agent_bay.delete(session)

Swipe gestures

You can perform a swipe gesture from one point to another:

session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

# Swipe up (from bottom to top).
result = session.mobile.swipe(
    start_x=100,
    start_y=500,
    end_x=100,
    end_y=200,
    duration_ms=300
)
if result.success:
    print("Swipe up successful")  # Output: Swipe up successful

# Swipe left (from right to left).
result = session.mobile.swipe(
    start_x=500,
    start_y=300,
    end_x=100,
    end_y=300,
    duration_ms=300
)
if result.success:
    print("Swipe left successful")  # Output: Swipe left successful

agent_bay.delete(session)

Parameters:

start_x, start_y: The starting coordinates.
end_x, end_y: The ending coordinates.
duration_ms: The duration of the swipe in milliseconds. The default value is 300.

Text input

Input text

You can input text into the currently active input field:

session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

result = session.mobile.input_text("Hello AgentBay!")
if result.success:
    print("Text input successful")  # Output: Text input successful

agent_bay.delete(session)

Send hardware keypress events

You can send Android hardware keypress events using KeyCode constants:

from agentbay.mobile.mobile import KeyCode

session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

# Press the HOME key.
result = session.mobile.send_key(KeyCode.HOME)
if result.success:
    print("HOME key pressed")  # Output: HOME key pressed

# KeyCode values: HOME=3, BACK=4, VOLUME_UP=24, VOLUME_DOWN=25, POWER=26, MENU=82
print(f"HOME keycode value: {KeyCode.HOME}")  # Output: HOME keycode value: 3

agent_bay.delete(session)

Available KeyCode constants:

KeyCode	Value	Description
`KeyCode.HOME`	3	Home button
`KeyCode.BACK`	4	Back button
`KeyCode.VOLUME_UP`	24	Volume up button
`KeyCode.VOLUME_DOWN`	25	Volume down button
`KeyCode.POWER`	26	Power button
`KeyCode.MENU`	82	Menu button

Note: All hardware keypresses can be automated for cloud phones. Keypress events are sent to the Android system and executed accordingly.

UI element detection

Get all UI elements

You can retrieve all UI elements in the current screen hierarchy:

session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

result = session.mobile.get_all_ui_elements(timeout_ms=2000)
if result.success:
    print(f"Found {len(result.elements)} UI elements")  # Output: Found 2172 UI elements
    for element in result.elements:
        # The element structure varies. Check the element data.
        print(f"Element: {element}")
        # Example output: Element data contains UI hierarchy information
else:
    print(f"Failed: {result.error_message}")

agent_bay.delete(session)

Parameter:

timeout_ms: The timeout period in milliseconds to wait for UI elements. The default value is 2000.

Screen operations

Take a screenshot

You can capture a screenshot of the current cloud phone screen:

session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

result = session.mobile.screenshot()
if result.success:
    screenshot_url = result.data
    print(f"Screenshot URL: {screenshot_url}")
    # Output: Screenshot URL: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***
else:
    print(f"Screenshot failed: {result.error_message}")

agent_bay.delete(session)

Best practices

Always use a cloud phone OS image

UI automation for cloud phones requires a cloud phone OS image. The following example uses mobile_latest:

# Correct - Use a cloud phone OS image.
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

# Incorrect - Cannot be used for cloud phone operations.
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session

Properly use screenshot URLs

The screenshot operation returns an OSS URL, not the image data itself:

result = session.mobile.screenshot()
if result.success:
    screenshot_url = result.data
    print(f"Screenshot available at: {screenshot_url}")
    # Output: Screenshot available at: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***
    # Use the URL to download or display the screenshot.
else:
    print(f"Screenshot failed: {result.error_message}")

Common use cases

Use case 1: Application navigation

from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams

agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

try:
    # Tap the application icon.
    tap_result = session.mobile.tap(x=200, y=400)
    print(f"App tap result: {tap_result.success}")  # Output: App tap result: True
    
    # Wait for the application to load.
    import time
    time.sleep(2)
    
    # Swipe to navigate.
    swipe_result = session.mobile.swipe(
        start_x=400,
        start_y=600,
        end_x=100,
        end_y=600,
        duration_ms=300
    )
    print(f"Navigation swipe result: {swipe_result.success}")  # Output: Navigation swipe result: True
    
    # Tap the button.
    button_result = session.mobile.tap(x=300, y=800)
    print(f"Button tap result: {button_result.success}")  # Output: Button tap result: True
    
finally:
    agent_bay.delete(session)

Use case 2: Mobile form filling

from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams

agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

try:
    # Tap the username field.
    username_tap = session.mobile.tap(x=300, y=400)
    print(f"Username field focused: {username_tap.success}")  # Output: Username field focused: True
    
    # Enter the username.
    username_input = session.mobile.input_text("john_doe")
    print(f"Username entered: {username_input.success}")  # Output: Username entered: True
    
    # Tap the password field.
    password_tap = session.mobile.tap(x=300, y=500)
    print(f"Password field focused: {password_tap.success}")  # Output: Password field focused: True
    
    # Enter the password.
    password_input = session.mobile.input_text("secure_password")
    print(f"Password entered: {password_input.success}")  # Output: Password entered: True
    
    # Tap the logon button.
    login_tap = session.mobile.tap(x=300, y=650)
    print(f"Login button pressed: {login_tap.success}")  # Output: Login button pressed: True
    
finally:
    agent_bay.delete(session)

Example 3: UI element discovery and interaction

from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams

agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

try:
    # Get all clickable elements.
    result = session.mobile.get_clickable_ui_elements(timeout_ms=3000)
    
    if result.success:
        print(f"Found {len(result.elements)} clickable elements")  # Output: Found 3 clickable elements
        
        # Analyze the elements to find the target.
        for i, element in enumerate(result.elements):
            print(f"Element {i+1}: {element}")  
            # Example output:
            # Element 1: UI element with interaction capabilities
            # Element 2: UI element with interaction capabilities  
            # Element 3: UI element with interaction capabilities
    
    # Take a screenshot for verification.
    screenshot = session.mobile.screenshot()
    if screenshot.success:
        screenshot_url = screenshot.data
        print(f"Screenshot URL: {screenshot_url}")
        # Output: Screenshot URL: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***
    
finally:
    agent_bay.delete(session)

Example 4: Scrolling through content

from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams

agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session

try:
    # Scroll down multiple times.
    for i in range(3):
        scroll_result = session.mobile.swipe(
            start_x=300,
            start_y=800,
            end_x=300,
            end_y=200,
            duration_ms=400
        )
        print(f"Scroll down {i+1}: {scroll_result.success}")  # Output: Scroll down 1: True, etc.
        
        # Pause briefly between scrolls.
        import time
        time.sleep(1)
    
    # Scroll up.
    up_result = session.mobile.swipe(
        start_x=300,
        start_y=200,
        end_x=300,
        end_y=800,
        duration_ms=400
    )
    print(f"Scroll up result: {up_result.success}")  # Output: Scroll up result: True
    
finally:
    agent_bay.delete(session)

Troubleshooting

FAQ

"Tool not found" error
- Make sure you are using a cloud phone OS image, such as image_id="mobile_latest".
- Verify that the session was created successfully.
- Check that the API key and endpoint are configured correctly.
Hardware keypress operations
- Hardware keypress operations send keypress events to the Android system.
- Check the result.success status to verify that the keypress was sent successfully.
- Example of error troubleshooting:
```
result = session.mobile.send_key(KeyCode.HOME)
if not result.success:
    print(f"Key press failed: {result.error_message}")
```
UI element detection returns an empty result
- Increase the timeout_ms parameter.
- Take a screenshot to check the current UI status.
- Make sure the loading of the target screen is completed.
Screenshot URL using
- The screenshot operation returns an OSS URL, not the image data itself.
- result.data contains the download URL, not the image itself.
- Use the URL to download the screenshot if needed.
Swipe gesture does not work as expected
- Verify that the coordinates are within the screen boundaries.
- Adjust duration_ms to accommodate different gesture speeds.
- Make sure the start and end coordinates create a meaningful swipe gesture.