全部產品
Search
文件中心

AnalyticDB:自建Qdrant遷移至AnalyticDB PostgreSQL版

更新時間:Jun 19, 2024

Qdrant是向量相似性搜尋引擎,主要用於儲存、搜尋和管理向量,支援通過Python程式設計語言將本地Qdrant集合資料移轉到AnalyticDB PostgreSQL版執行個體中。

前提條件

  • 已建立Qdrant叢集。

  • 已安裝Python環境,建議使用Python 3.8及以上版本。

  • 已安裝所需的Python庫。

    pip install psycopg2
    pip install qdrant-client==1.6.0
    pip install pyaml
    pip install tqdm

遷移操作

步驟一:匯出Qdrant

  1. 準備好匯出指令碼export.py及其匯出設定檔qdrant2csv.yaml,並建立輸出目錄,本文以output為例。

    匯出指令碼export.py如下。

    import yaml
    import json
    from qdrant_client import QdrantClient
    import os
    from enum import IntEnum
    from tqdm import tqdm
    
    with open("./qdrant2csv.yaml", "r") as f:
        config = yaml.safe_load(f)
    
    
    print("configuration:")
    print(config)
    
    qdrant_config = config["qdrant"]
    
    
    class DataType(IntEnum):
        ID = 1
        FLOAT_VECTOR = 2
        JSON = 3
    
    
    def data_convert_to_str(data, dtype, delimeter):
        if dtype == DataType.ID:
            return str(data)
        elif dtype == DataType.FLOAT_VECTOR:
            return "{" + ", ".join(str(x) for x in data) + "}"
        elif dtype == DataType.JSON:
            return str(data).replace(delimeter, f"\\{delimeter}").replace("\"", "\\\"")
        Exception(f"Unsupported DataType {dtype}")
    
    
    def csv_write_rows(datum, fd, fields_types, delimiter="|"):
        for data in datum:
            for i in range(len(data)):
                data[i] = data_convert_to_str(data[i], fields_types[i], delimiter)
            fd.write(delimiter.join(data) + "\n")
    
    
    def csv_write_header(headers, fd, delimiter="|"):
        fd.write(delimiter.join(headers) + "\n")
    
    
    def dump_collection(collection_name: str):
        results = []
        file_cnt = 0
        print("connecting to qdrant...")
        client = QdrantClient(**qdrant_config)
    
        export_config = config["export"]
        tmp_path = os.path.join(export_config["output_path"], collection_name)
        if not os.path.exists(tmp_path):
            os.mkdir(tmp_path)
    
        # fetch info of collection
        fields_meta_list = ["id bigint"]
        fields_types = [DataType.ID]
        headers = ["id"]
        collection = client.get_collection(collection_name)
        total_num = collection.points_count
        if isinstance(collection.config.params.vectors, dict):
            # multi vectors
            for vec_name in collection.config.params.vectors.keys():
                fields_types.append(DataType.FLOAT_VECTOR)
                fields_meta_list.append(f"{vec_name} real[]")
                headers.append(vec_name)
        else:
            # single vector
            fields_types.append(DataType.FLOAT_VECTOR)
            fields_meta_list.append("vector real[]")
            headers.append("vector")
    
        fields_types.append(DataType.JSON)
        fields_meta_list.append("payload json")
        headers.append("payload")
    
        fields_meta_str = ','.join(fields_meta_list)
        create_table_sql = f"CREATE TABLE {collection_name} " \
                           f" ({fields_meta_str});"
    
        with open(os.path.join(export_config["output_path"], collection_name, "create_table.sql"), "w") as f_d:
            f_d.write(create_table_sql)
    
        print(create_table_sql)
    
        def write_to_csv_file(col_names, data):
            if len(results) == 0:
                return
            nonlocal file_cnt
            assert(file_cnt <= 1e9)
            output_file_name = os.path.join(export_config["output_path"], collection_name, f"{str(file_cnt).zfill(10)}.csv")
            with open(output_file_name, "w", newline="") as csv_file:
                # write header
                csv_write_header(col_names, csv_file)
                # write data
                csv_write_rows(data, csv_file, fields_types)
                file_cnt += 1
                results.clear()
    
        offset_id = None
    
        with tqdm(total=total_num, bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt}") as pbar:
            while True:
                res = client.scroll(collection_name=collection_name,
                                    limit=1000,
                                    offset=offset_id,
                                    with_payload=True,
                                    with_vectors=True)
    
                records = res[0]
                for record in records:
                    # append id
                    record_list = [record.id]
                    # append vectors
                    if isinstance(record.vector, dict):
                        # multi vector
                        for vector_name in headers[1:-1]:
                            record_list.append(record.vector[vector_name])
                    else:
                        # single vector
                        record_list.append(record.vector)
                    # append payload
                    record_list.append(json.dumps(record.payload, ensure_ascii=False))
                    results.append(record_list)
    
                    if len(results) >= export_config["max_line_in_file"]:
                        write_to_csv_file(headers, data=results)
    
                    pbar.update(1)
    
                if len(res) == 0 or len(res[0]) == 0 or res[1] is None:
                    # finished
                    break
                else:
                    offset_id = res[1]
    
        write_to_csv_file(headers, data=results)
    
    
    for name in config["export"]["collections"]:
        dump_collection(name)
    

    匯出設定檔qdrant2csv.yaml如下。

    qdrant:  # 串連Qdrant的配置項
        host: 'localhost'  # Qdrant的主機地址
        port: 6333        # Qdrant連接埠號碼,預設值:6433 
        grpc_port: 6434   # gRPC 介面的連接埠,預設值: 6334
        api_key: ''  # 用於在Qdrant Cloud中進行身分識別驗證的API密鑰 
        url: ''      # 主機名稱或者字串"Optional[scheme], host, Optional[port], Optional[prefix]"
        location: '' # 填寫memory,則以記憶體模式串連執行個體;填寫普通的字串,則等同於填寫url配置項;如果不填,則會通過host+port串連執行個體        
    
    export:
       collections:
        - 'test_collection'
        - 'multi'                 # 填寫所有需要匯出的collection
      max_line_in_file: 40000     # 檔案切分行數
      output_path: './output'     # 匯出目標目錄
    
  2. 將匯出指令碼export.py、匯出設定檔qdrant2csv.yaml及輸出目錄output存放至同一個目錄下。目錄層級如下。

    ├── export.py
    ├── qdrant2csv.yaml
    └── output
  3. 根據Qdrant叢集資訊,修改qdrant2csv.yaml中的配置項。

  4. 運行Python指令碼,並查看輸出結果。

    python export.py

    輸出結果如下。

    .
    ├── export.py
    ├── qdrant2csv.yaml
    └── output
        ├── test_collection
        │   ├── 0000000000.csv
        │   ├── 0000000001.csv
        │   ├── 0000000002.csv
        │   └── create_table.sql
        └── multi
            ├── 0000000000.csv
            └── create_table.sql

步驟二:匯入AnalyticDB PostgreSQL版向量資料庫

  1. 準備好匯入指令碼import.py、匯入設定檔csv2adbpg.yaml及需要匯入的資料data(即在匯出步驟中得到的output目錄)。

    匯入指令碼import.py如下。

    import psycopg2
    import yaml
    import glob
    import os
    
    if __name__ == "__main__":
        with open('csv2adbpg.yaml', 'r') as config_file:
            config = yaml.safe_load(config_file)
    
        print("current config:" + str(config))
    
        db_host = config['database']['host']
        db_port = config['database']['port']
        db_name = config['database']['name']
        schema_name = config['database']['schema']
        db_user = config['database']['user']
        db_password = config['database']['password']
        data_path = config['data_path']
    
        conn = psycopg2.connect(
            host=db_host,
            port=db_port,
            database=db_name,
            user=db_user,
            password=db_password,
            options=f'-c search_path={schema_name},public'
        )
    
        cur = conn.cursor()
    
        # check schema
        cur.execute("SELECT schema_name FROM information_schema.schemata WHERE schema_name = %s", (schema_name,))
        existing_schema = cur.fetchone()
        if existing_schema:
            print(f"Schema {schema_name} already exists.")
        else:
            # create schema
            cur.execute(f"CREATE SCHEMA {schema_name}")
            print(f"Created schema: {schema_name}")
    
        for table_name in os.listdir(data_path):
            table_folder = os.path.join(data_path, table_name)
            print(f"Begin Process table: {table_name}")
            if os.path.isdir(table_folder):
                create_table_file = os.path.join(table_folder, 'create_table.sql')
                with open(create_table_file, 'r') as file:
                    create_table_sql = file.read()
                try:
                    cur.execute(create_table_sql)
                except psycopg2.errors.DuplicateTable as e:
                    print(e)
                    conn.rollback()
                    continue
                print(f"Created table: {table_name}")
    
                cnt = 0
                csv_files = glob.glob(os.path.join(table_folder, '*.csv'))
                for csv_file in csv_files:
                    with open(csv_file, 'r') as file:
                        copy_command = f"COPY {table_name} FROM STDIN DELIMITER '|' HEADER"
                        cur.copy_expert(copy_command, file)
                    cnt += 1
                    print(f"Imported data from: {csv_file} | {cnt}/{len(csv_files)} file(s) Done")
    
            conn.commit()
            print(f"Finished import table: {table_name}")
            print('#'*60)
    
        cur.close()
        conn.close()
    

    匯入設定檔csv2adbpg.yaml如下。

    database:
        host: "192.16.XX.XX"         # AnalyticDB PostgreSQL版執行個體的外網地址
        port: 5432                   # AnalyticDB PostgreSQL版執行個體的連接埠號碼
        name: "vector_database"      # 匯入目標資料庫名 
        user: "username"             # AnalyticDB PostgreSQL版執行個體的資料庫帳號
        password: ""                 # 帳號密碼
        schema: "public"             # 匯入Schama名,若不存在則會自動建立
    
    data_path: "./data"            # 匯入資料來源
    
  2. 將匯入指令碼import.py和匯入設定檔csv2adbpg.yaml與需要匯入的資料data存放在同一目錄下。目錄層級如下。

    .
    ├── csv2adbpg.yaml
    ├── data
    │   ├── test_collection
    │   │   ├── 0000000000.csv
    │   │   ├── 0000000001.csv
    │   │   ├── 0000000002.csv
    │   │   └── create_table.sql
    │   └── multi
    │       ├── 0000000000.csv
    │       └── create_table.sql
    └── import.py
  3. 根據AnalyticDB PostgreSQL版執行個體資訊,修改csv2adbpg.yaml檔案中配置項。

  4. 運行Python指令碼。

    python import.py
  5. AnalyticDB PostgreSQL版向量資料庫中檢查資料是否正常匯入。

  6. 重建所需要的索引。具體操作,請參見建立向量索引

相關文檔

更多關於Qdrant的介紹,請參見Qdrant產品文檔