All Products
Search
Document Center

AI Guardrails:Synchronous detection

Last Updated:Mar 31, 2026

Extracts text from images using optical character recognition (OCR) via the /green/image/scan API. The API returns the detected text, its position in the image, and a review suggestion — all in a single synchronous call.

Prerequisites

Before you begin, make sure you have:

  • An Alibaba Cloud account with Content Moderation enabled

  • An AccessKey ID and AccessKey Secret

  • Images accessible via public HTTP or HTTPS URLs

How it works

  1. Submit a request to /green/image/scan with scenes set to ["ocr"] and a list of image URLs.

  2. Content Moderation downloads each image, runs OCR, and returns the detected text and bounding box coordinates.

  3. Check the suggestion field in the response to decide whether the detected text requires manual review.

Results are typically returned within 1 second. The maximum response time is 6 seconds; requests that exceed this limit return a timeout error.

Usage notes

Billing: Calling this operation incurs charges. For pricing details, see the billing documentation.

Image download timeout: If an image cannot be downloaded within 3 seconds, the request returns a timeout error. Store images in a stable, low-latency service such as Object Storage Service (OSS) or Content Delivery Network (CDN) to minimize download failures.

Text-heavy images: OCR processing time increases with the number of words in an image. For images containing large amounts of text — such as scanned documents — use asynchronous moderation instead.

Image requirements:

  • Protocol: HTTP or HTTPS URLs only

  • Formats: PNG, JPG, JPEG, BMP, GIF, WEBP

  • Maximum size: 20 MB (applies to both synchronous and asynchronous moderation)

  • Minimum recommended resolution: 256 × 256 pixels

QPS limits

This operation supports up to 10 requests per second (QPS) per account. Exceeding this limit triggers throttling.

Submit a request

Endpoint

POST http(s)://[Endpoint]/green/image/scan

Request parameters

NameTypeRequiredDescription
scenesStringArrayYesThe detection scenario. Set to ["ocr"].
tasksJSONArrayYesThe images to scan. Up to 100 items per request. To submit 100 items in a single request, you must increase the number of concurrent tasks to more than 100. See Task parameters.
bizTypeStringNoThe business scenario identifier. Default: default. Use this to apply a custom moderation policy configured in the Content Moderation console. If not set, the default policy applies. For setup instructions, see Customize policies for machine-assisted moderation.

Task parameters

Each element in the tasks array describes one image to scan.

NameTypeRequiredDescription
urlStringYesThe public HTTP or HTTPS URL of the image. Maximum length: 2,048 characters.
dataIdStringNoThe custom identifier for this image. Must be unique within the request. Returned in the response for correlation.
intervalIntegerNoThe frame capture interval for GIF or long images. See GIF and long image moderation.
maxFramesIntegerNoThe maximum number of frames to capture. Default: 1. See GIF and long image moderation.
The interval and maxFrames parameters must be used together.

Interpret the response

A successful call (HTTP 200) returns a data array where each element corresponds to one submitted image.

Response fields

NameTypeDescription
codeIntegerThe result code for this image. 200 indicates success.
msgStringThe result message.
taskIdStringThe system-generated ID for this detection task.
dataIdStringThe dataId value from your request, if provided.
urlStringThe image URL from your request.
resultsArrayThe detection results. Present when code is 200. See Result fields.

Result fields

Each element in results contains the OCR output for the image.

NameTypeDescription
sceneStringThe detection scenario. Always ocr.
labelStringThe classification of the detection result. Valid values: ocr (text was detected), normal (no text found).
suggestionStringThe recommended action. Valid values: pass (no action needed), review (text requires human review).
ocrDataArrayThe full detected text, combined into a single string usually stored in the first array element. Not returned if no text is detected.
ocrLocationsArrayThe position and content of each detected text region. Not returned if no text is detected. See ocrLocation fields.
framesArrayPer-frame OCR results for GIF images. Returned only when multiple frames are captured.
rateFloatA confidence score. Not meaningful in the OCR scenario.

ocrLocation fields

Each entry in ocrLocations describes one detected text region. The coordinate origin is the upper-left corner of the image, with x increasing to the right and y increasing downward.

NameTypeDescription
textStringThe text detected in this region.
xFloatThe horizontal distance from the left edge of the image to the left edge of the text region, in pixels.
yFloatThe vertical distance from the top edge of the image to the top edge of the text region, in pixels.
wFloatThe width of the text region, in pixels.
hFloatThe height of the text region, in pixels.
ocrData contains the combined text from all detected regions as a single string. ocrLocations gives the position of each individual text region. Use ocrLocations when you need to locate or highlight specific text within the image.

GIF and long image moderation

By default, only the first frame of a GIF or long image is scanned. Use interval and maxFrames together to scan multiple frames.

  • interval: Scan one frame out of every n frames, where n is the value of interval.

  • maxFrames: Cap the total number of frames scanned.

If interval × maxFrames is less than the total number of frames in the image, the system automatically adjusts the interval to ceil(total_frames / maxFrames) to distribute coverage evenly.

What counts as a long image:

OrientationCondition
Portrait (tall)Height > 400 px AND height:width ratio > 2.5:1. Frame count = round(height ÷ width).
Landscape (wide)Width > 400 px AND width:height ratio > 2.5:1. Frame count = round(width ÷ height).

Example: With interval: 2 and maxFrames: 100, the system scans one frame out of every two frames, up to a maximum of 100 frames. Charges apply per frame scanned.

Example

Request

POST http(s)://[Endpoint]/green/image/scan
<Common request parameters>

{
    "scenes": ["ocr"],
    "tasks": [
        {
            "dataId": "test_data_xxxx",
            "url": "https://aliyundoc.com/test_image_xxxx.png"
        }
    ]
}

Response

{
    "code": 200,
    "msg": "OK",
    "requestId": "C4AB08A9-AD75-4410-859B-0B9EF6DFC3C4",
    "data": [
        {
            "code": 200,
            "msg": "OK",
            "dataId": "test_data_xxxx",
            "taskId": "img5A@k7a@B4q@6K@d9nfKgOs-1s****",
            "url": "https://aliyundoc.com/test_image_xxxx.png",
            "extras": {},
            "results": [
                {
                    "scene": "ocr",
                    "label": "ocr",
                    "suggestion": "review",
                    "rate": 99.91,
                    "ocrData": [
                        "hello, this is a test text."
                    ],
                    "ocrLocations": [
                        {
                            "text": "hello",
                            "x": 41,
                            "y": 84,
                            "w": 83,
                            "h": 26
                        },
                        {
                            "text": " this is a test text.",
                            "x": 78,
                            "y": 114,
                            "w": 95,
                            "h": 25
                        }
                    ]
                }
            ]
        }
    ]
}

In this example, suggestion: review means the detected text requires human review before taking further action. ocrData contains the full combined string "hello, this is a test text.", while ocrLocations gives the exact pixel position of each text fragment within the image.

What's next