Extracts text from images using optical character recognition (OCR) via the /green/image/scan API. The API returns the detected text, its position in the image, and a review suggestion — all in a single synchronous call.
Prerequisites
Before you begin, make sure you have:
An Alibaba Cloud account with Content Moderation enabled
An AccessKey ID and AccessKey Secret
Images accessible via public HTTP or HTTPS URLs
How it works
Submit a request to
/green/image/scanwithscenesset to["ocr"]and a list of image URLs.Content Moderation downloads each image, runs OCR, and returns the detected text and bounding box coordinates.
Check the
suggestionfield in the response to decide whether the detected text requires manual review.
Results are typically returned within 1 second. The maximum response time is 6 seconds; requests that exceed this limit return a timeout error.
Usage notes
Billing: Calling this operation incurs charges. For pricing details, see the billing documentation.
Image download timeout: If an image cannot be downloaded within 3 seconds, the request returns a timeout error. Store images in a stable, low-latency service such as Object Storage Service (OSS) or Content Delivery Network (CDN) to minimize download failures.
Text-heavy images: OCR processing time increases with the number of words in an image. For images containing large amounts of text — such as scanned documents — use asynchronous moderation instead.
Image requirements:
Protocol: HTTP or HTTPS URLs only
Formats: PNG, JPG, JPEG, BMP, GIF, WEBP
Maximum size: 20 MB (applies to both synchronous and asynchronous moderation)
Minimum recommended resolution: 256 × 256 pixels
QPS limits
This operation supports up to 10 requests per second (QPS) per account. Exceeding this limit triggers throttling.
Submit a request
Endpoint
POST http(s)://[Endpoint]/green/image/scanRequest parameters
| Name | Type | Required | Description |
|---|---|---|---|
scenes | StringArray | Yes | The detection scenario. Set to ["ocr"]. |
tasks | JSONArray | Yes | The images to scan. Up to 100 items per request. To submit 100 items in a single request, you must increase the number of concurrent tasks to more than 100. See Task parameters. |
bizType | String | No | The business scenario identifier. Default: default. Use this to apply a custom moderation policy configured in the Content Moderation console. If not set, the default policy applies. For setup instructions, see Customize policies for machine-assisted moderation. |
Task parameters
Each element in the tasks array describes one image to scan.
| Name | Type | Required | Description |
|---|---|---|---|
url | String | Yes | The public HTTP or HTTPS URL of the image. Maximum length: 2,048 characters. |
dataId | String | No | The custom identifier for this image. Must be unique within the request. Returned in the response for correlation. |
interval | Integer | No | The frame capture interval for GIF or long images. See GIF and long image moderation. |
maxFrames | Integer | No | The maximum number of frames to capture. Default: 1. See GIF and long image moderation. |
TheintervalandmaxFramesparameters must be used together.
Interpret the response
A successful call (HTTP 200) returns a data array where each element corresponds to one submitted image.
Response fields
| Name | Type | Description |
|---|---|---|
code | Integer | The result code for this image. 200 indicates success. |
msg | String | The result message. |
taskId | String | The system-generated ID for this detection task. |
dataId | String | The dataId value from your request, if provided. |
url | String | The image URL from your request. |
results | Array | The detection results. Present when code is 200. See Result fields. |
Result fields
Each element in results contains the OCR output for the image.
| Name | Type | Description |
|---|---|---|
scene | String | The detection scenario. Always ocr. |
label | String | The classification of the detection result. Valid values: ocr (text was detected), normal (no text found). |
suggestion | String | The recommended action. Valid values: pass (no action needed), review (text requires human review). |
ocrData | Array | The full detected text, combined into a single string usually stored in the first array element. Not returned if no text is detected. |
ocrLocations | Array | The position and content of each detected text region. Not returned if no text is detected. See ocrLocation fields. |
frames | Array | Per-frame OCR results for GIF images. Returned only when multiple frames are captured. |
rate | Float | A confidence score. Not meaningful in the OCR scenario. |
ocrLocation fields
Each entry in ocrLocations describes one detected text region. The coordinate origin is the upper-left corner of the image, with x increasing to the right and y increasing downward.
| Name | Type | Description |
|---|---|---|
text | String | The text detected in this region. |
x | Float | The horizontal distance from the left edge of the image to the left edge of the text region, in pixels. |
y | Float | The vertical distance from the top edge of the image to the top edge of the text region, in pixels. |
w | Float | The width of the text region, in pixels. |
h | Float | The height of the text region, in pixels. |
ocrDatacontains the combined text from all detected regions as a single string.ocrLocationsgives the position of each individual text region. UseocrLocationswhen you need to locate or highlight specific text within the image.
GIF and long image moderation
By default, only the first frame of a GIF or long image is scanned. Use interval and maxFrames together to scan multiple frames.
interval: Scan one frame out of everynframes, wherenis the value ofinterval.maxFrames: Cap the total number of frames scanned.
If interval × maxFrames is less than the total number of frames in the image, the system automatically adjusts the interval to ceil(total_frames / maxFrames) to distribute coverage evenly.
What counts as a long image:
| Orientation | Condition |
|---|---|
| Portrait (tall) | Height > 400 px AND height:width ratio > 2.5:1. Frame count = round(height ÷ width). |
| Landscape (wide) | Width > 400 px AND width:height ratio > 2.5:1. Frame count = round(width ÷ height). |
Example: With interval: 2 and maxFrames: 100, the system scans one frame out of every two frames, up to a maximum of 100 frames. Charges apply per frame scanned.
Example
Request
POST http(s)://[Endpoint]/green/image/scan
<Common request parameters>
{
"scenes": ["ocr"],
"tasks": [
{
"dataId": "test_data_xxxx",
"url": "https://aliyundoc.com/test_image_xxxx.png"
}
]
}Response
{
"code": 200,
"msg": "OK",
"requestId": "C4AB08A9-AD75-4410-859B-0B9EF6DFC3C4",
"data": [
{
"code": 200,
"msg": "OK",
"dataId": "test_data_xxxx",
"taskId": "img5A@k7a@B4q@6K@d9nfKgOs-1s****",
"url": "https://aliyundoc.com/test_image_xxxx.png",
"extras": {},
"results": [
{
"scene": "ocr",
"label": "ocr",
"suggestion": "review",
"rate": 99.91,
"ocrData": [
"hello, this is a test text."
],
"ocrLocations": [
{
"text": "hello",
"x": 41,
"y": 84,
"w": 83,
"h": 26
},
{
"text": " this is a test text.",
"x": 78,
"y": 114,
"w": 95,
"h": 25
}
]
}
]
}
]
}In this example, suggestion: review means the detected text requires human review before taking further action. ocrData contains the full combined string "hello, this is a test text.", while ocrLocations gives the exact pixel position of each text fragment within the image.
What's next
SDK overview — Use a pre-built client instead of constructing raw HTTP requests.
Request structure — Learn how to construct and sign requests.
Customize policies for machine-assisted moderation — Configure custom moderation policies using
bizType.