This topic describes the UI automation features of the AgentBay SDK for cloud phones. These features include touch operations, text input, UI element detection, and screen operations.
Overview
The AgentBay SDK for cloud phones provides the following UI automation features:
Touch operations: Perform tap and swipe gestures to interact with the cloud phone.
Text input: Enter text and send hardware keypress events.
UI element detection: Find and interact with UI elements.
Screen operations: Capture screenshots for visual verification.
Create a session
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
# The session is created. You can now automate the cloud phone.
Touch operations
Tap gestures
You can tap the screen at specific coordinates:
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
# Tap at the specified coordinates.
result = session.mobile.tap(x=500, y=300)
if result.success:
print("Tap successful") # Output: Tap successful
else:
print(f"Tap failed: {result.error_message}")
agent_bay.delete(session)
Swipe gestures
You can perform a swipe gesture from one point to another:
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
# Swipe up (from bottom to top).
result = session.mobile.swipe(
start_x=100,
start_y=500,
end_x=100,
end_y=200,
duration_ms=300
)
if result.success:
print("Swipe up successful") # Output: Swipe up successful
# Swipe left (from right to left).
result = session.mobile.swipe(
start_x=500,
start_y=300,
end_x=100,
end_y=300,
duration_ms=300
)
if result.success:
print("Swipe left successful") # Output: Swipe left successful
agent_bay.delete(session)
Parameters:
start_x,start_y: The starting coordinates.end_x,end_y: The ending coordinates.duration_ms: The duration of the swipe in milliseconds. The default value is 300.
Text input
Input text
You can input text into the currently active input field:
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
result = session.mobile.input_text("Hello AgentBay!")
if result.success:
print("Text input successful") # Output: Text input successful
agent_bay.delete(session)
Send hardware keypress events
You can send Android hardware keypress events using KeyCode constants:
from agentbay.mobile.mobile import KeyCode
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
# Press the HOME key.
result = session.mobile.send_key(KeyCode.HOME)
if result.success:
print("HOME key pressed") # Output: HOME key pressed
# KeyCode values: HOME=3, BACK=4, VOLUME_UP=24, VOLUME_DOWN=25, POWER=26, MENU=82
print(f"HOME keycode value: {KeyCode.HOME}") # Output: HOME keycode value: 3
agent_bay.delete(session)
Available KeyCode constants:
KeyCode | Value | Description |
| 3 | Home button |
| 4 | Back button |
| 24 | Volume up button |
| 25 | Volume down button |
| 26 | Power button |
| 82 | Menu button |
Note: All hardware keypresses can be automated for cloud phones. Keypress events are sent to the Android system and executed accordingly.
UI element detection
Get all UI elements
You can retrieve all UI elements in the current screen hierarchy:
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
result = session.mobile.get_all_ui_elements(timeout_ms=2000)
if result.success:
print(f"Found {len(result.elements)} UI elements") # Output: Found 2172 UI elements
for element in result.elements:
# The element structure varies. Check the element data.
print(f"Element: {element}")
# Example output: Element data contains UI hierarchy information
else:
print(f"Failed: {result.error_message}")
agent_bay.delete(session)
Parameter:
timeout_ms: The timeout period in milliseconds to wait for UI elements. The default value is 2000.
Screen operations
Take a screenshot
You can capture a screenshot of the current cloud phone screen:
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
result = session.mobile.screenshot()
if result.success:
screenshot_url = result.data
print(f"Screenshot URL: {screenshot_url}")
# Output: Screenshot URL: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***
else:
print(f"Screenshot failed: {result.error_message}")
agent_bay.delete(session)
Best practices
Always use a cloud phone OS image
UI automation for cloud phones requires a cloud phone OS image. The following example uses mobile_latest:
# Correct - Use a cloud phone OS image.
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
# Incorrect - Cannot be used for cloud phone operations.
session_params = CreateSessionParams(image_id="windows_latest")
session = agent_bay.create(session_params).session
Properly use screenshot URLs
The screenshot operation returns an OSS URL, not the image data itself:
result = session.mobile.screenshot()
if result.success:
screenshot_url = result.data
print(f"Screenshot available at: {screenshot_url}")
# Output: Screenshot available at: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***
# Use the URL to download or display the screenshot.
else:
print(f"Screenshot failed: {result.error_message}")
Common use cases
Use case 1: Application navigation
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
try:
# Tap the application icon.
tap_result = session.mobile.tap(x=200, y=400)
print(f"App tap result: {tap_result.success}") # Output: App tap result: True
# Wait for the application to load.
import time
time.sleep(2)
# Swipe to navigate.
swipe_result = session.mobile.swipe(
start_x=400,
start_y=600,
end_x=100,
end_y=600,
duration_ms=300
)
print(f"Navigation swipe result: {swipe_result.success}") # Output: Navigation swipe result: True
# Tap the button.
button_result = session.mobile.tap(x=300, y=800)
print(f"Button tap result: {button_result.success}") # Output: Button tap result: True
finally:
agent_bay.delete(session)Use case 2: Mobile form filling
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
try:
# Tap the username field.
username_tap = session.mobile.tap(x=300, y=400)
print(f"Username field focused: {username_tap.success}") # Output: Username field focused: True
# Enter the username.
username_input = session.mobile.input_text("john_doe")
print(f"Username entered: {username_input.success}") # Output: Username entered: True
# Tap the password field.
password_tap = session.mobile.tap(x=300, y=500)
print(f"Password field focused: {password_tap.success}") # Output: Password field focused: True
# Enter the password.
password_input = session.mobile.input_text("secure_password")
print(f"Password entered: {password_input.success}") # Output: Password entered: True
# Tap the logon button.
login_tap = session.mobile.tap(x=300, y=650)
print(f"Login button pressed: {login_tap.success}") # Output: Login button pressed: True
finally:
agent_bay.delete(session)Example 3: UI element discovery and interaction
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
try:
# Get all clickable elements.
result = session.mobile.get_clickable_ui_elements(timeout_ms=3000)
if result.success:
print(f"Found {len(result.elements)} clickable elements") # Output: Found 3 clickable elements
# Analyze the elements to find the target.
for i, element in enumerate(result.elements):
print(f"Element {i+1}: {element}")
# Example output:
# Element 1: UI element with interaction capabilities
# Element 2: UI element with interaction capabilities
# Element 3: UI element with interaction capabilities
# Take a screenshot for verification.
screenshot = session.mobile.screenshot()
if screenshot.success:
screenshot_url = screenshot.data
print(f"Screenshot URL: {screenshot_url}")
# Output: Screenshot URL: https://***.***.aliyuncs.com/***/screenshot_1234567890.png?***
finally:
agent_bay.delete(session)Example 4: Scrolling through content
from agentbay import AgentBay
from agentbay.session_params import CreateSessionParams
agent_bay = AgentBay()
session_params = CreateSessionParams(image_id="mobile_latest")
session = agent_bay.create(session_params).session
try:
# Scroll down multiple times.
for i in range(3):
scroll_result = session.mobile.swipe(
start_x=300,
start_y=800,
end_x=300,
end_y=200,
duration_ms=400
)
print(f"Scroll down {i+1}: {scroll_result.success}") # Output: Scroll down 1: True, etc.
# Pause briefly between scrolls.
import time
time.sleep(1)
# Scroll up.
up_result = session.mobile.swipe(
start_x=300,
start_y=200,
end_x=300,
end_y=800,
duration_ms=400
)
print(f"Scroll up result: {up_result.success}") # Output: Scroll up result: True
finally:
agent_bay.delete(session)Troubleshooting
FAQ
"Tool not found" error
Make sure you are using a cloud phone OS image, such as
image_id="mobile_latest".Verify that the session was created successfully.
Check that the API key and endpoint are configured correctly.
Hardware keypress operations
Hardware keypress operations send keypress events to the Android system.
Check the
result.successstatus to verify that the keypress was sent successfully.Example of error troubleshooting:
result = session.mobile.send_key(KeyCode.HOME) if not result.success: print(f"Key press failed: {result.error_message}")
UI element detection returns an empty result
Increase the
timeout_msparameter.Take a screenshot to check the current UI status.
Make sure the loading of the target screen is completed.
Screenshot URL using
The screenshot operation returns an OSS URL, not the image data itself.
result.datacontains the download URL, not the image itself.Use the URL to download the screenshot if needed.
Swipe gesture does not work as expected
Verify that the coordinates are within the screen boundaries.
Adjust
duration_msto accommodate different gesture speeds.Make sure the start and end coordinates create a meaningful swipe gesture.