雲端硬碟備份資料下載後恢複至自建MongoDB資料庫中 - ApsaraDB for MongoDB

本文介紹如何通過Mongorestore將ApsaraDB for MongoDB執行個體的雲端硬碟備份組檔案恢複至自建MongoDB資料庫中。

背景資訊

MongoDB提供了一組官方備份恢複工具，分別是Mongodump和Mongorestore。ApsaraDB for MongoDB的邏輯備份通過Mongodump產生，當您需要將邏輯備份恢複到自建MongoDB資料庫時可以通過Mongorestore進行恢複。

注意事項

由於MongoDB一直在迭代更新，舊版本的Mongorestore不能相容新版本的MongoDB。請選擇合適的Mongorestore版本，以相容MongoDB，如何選擇Mongorestore版本，請參見mongorestore。
即便某個表的資料很少，只有一個bson檔案，比如myDatabase/myCollection/data/myCollection_0_part0.bson，也需要做bson合并或者重新命名，因為mongorestore處理bson檔案時會考慮檔案名稱首碼。
雲端硬碟備份下載對於保留schema的空表也會做處理，得到一個包含庫表名資訊的空bson檔案；mongorestore也可以正常處理這種空檔案。
對於分區執行個體而言，下載的雲端硬碟備份檔案中已經不再包含分區的路由資訊，因此備份檔案資料可以恢複到任意一個單節點、複本集或分區架構執行個體中。如果期望恢複到分區執行個體的話，需要自行做預分區的操作。

準備工作

下載並安裝與ApsaraDB for MongoDB執行個體資料庫版本相同的MongoDB至自建MongoDB資料庫所在用戶端（本機伺服器或Elastic Compute Service執行個體），安裝方法請參見Install MongoDB。
已完成邏輯備份下載，未完成可參考下載備份檔案。

操作步驟

將下載的備份檔案複製到自建MongoDB所在用戶端（即安裝有Mongorestore工具的用戶端）的裝置上。
解壓備份檔案壓縮包。
備份檔案下載分tar.zst和tar.gz兩種格式，分別使用zstd和gzip的壓縮演算法，可通過CreateDownload API的UseZstd參數選擇下載格式。
tar.zst（控制台下載）
```
zstd -d -c <備份檔案的tar.zst包> | tar -xvf - -C <解壓目錄位址>
```
需要確保本地存在zstd工具且解壓目錄位址已存在。
樣本：
```
mkdir -p ./download_test/test1
zstd -d -c test1.tar.zst | tar -xvf - -C /Users/xxx/Desktop/download_test/test1/
```
tar.gz（OpenAPI下載預設格式）
```
tar -zxvf <備份檔案的tar.gz包> -C <解壓目錄位址>
```
需要確保解壓目錄位址已存在。
樣本：
```
mkdir -p ./download_test/test1
tar -zxvf testDB.tar.gz -C /Users/xxx/Desktop/download_test/test1/
```

合并bson檔案。

在有python環境的裝置上，複製如下的merge_bson_files.py檔案。

import os
import struct
import sys
import argparse
import shutil
import re

# 相容Python 2和3的字串處理
if sys.version_info[0] >= 3:
    unicode = str


def merge_single_bson_dir(input_dir: str, output_dir: str, namespace: str) -> None:
    """
    合并單個目錄下的 bson 檔案。

    參數:
        input_dir (str): 包含 bson 檔案的目錄路徑。
        output_dir (str): 輸出檔案的目錄路徑。
        namespace (str): 輸出檔案的名稱（不包括副檔名）。
    """
    try:
        # 擷取所有匹配 ***_*_part*.bson 模式的 bson 檔案並按檔案名稱排序
        files = [f for f in os.listdir(input_dir) if re.match(r'^.+_.+_part\d+\.bson$', f)]
        files.sort()  # 按檔案名稱排序

        if not files:
            print("No matching .bson files found in {}".format(input_dir))
            return

        output_file = os.path.join(output_dir, "{}.bson".format(namespace))
        if os.path.exists(output_file):
            print("Output file {} already exists, skipping...".format(output_file))
            return

        print("Merging {} files into {}...".format(len(files), output_file))

        # 流式讀取併合並檔案
        total_files = len(files)
        with open(output_file, "wb") as out_f:
            for index, filename in enumerate(files, 1):
                file_path = os.path.join(input_dir, filename)
                print("  Processing file {}/{}: {}...".format(index, total_files, filename))

                try:
                    with open(file_path, "rb") as in_f:
                        while True:
                            # 讀取BSON文檔大小
                            size_data = in_f.read(4)
                            if not size_data or len(size_data) < 4:
                                break

                            # 解析文檔大小（小端序）
                            doc_size = struct.unpack("<i", size_data)[0]

                            # 重新讀取完整的文檔資料
                            in_f.seek(in_f.tell() - 4)
                            doc_data = in_f.read(doc_size)

                            if len(doc_data) != doc_size:
                                break

                            out_f.write(doc_data)
                except Exception as e:
                    print("Error reading {}: {}".format(filename, str(e)))
    except Exception as e:
        print("Error in merge_single_bson_dir: {}".format(str(e)))


def merge_bson_files_recursive(input_root: str, output_root: str = None) -> None:
    """
    遞迴遍曆目錄，合并所有 bson 檔案。

    參數:
        input_root (str): 包含 bson 檔案的根目錄路徑。
        output_root (str): 輸出檔案的根目錄路徑，預設為 input_root。
    """
    if output_root is None:
        output_root = input_root

    # 確保輸出根目錄存在
    if not os.path.exists(output_root):
        os.makedirs(output_root)

    print("Scanning directories in {}...".format(input_root))
    
    # 遍曆輸入根目錄下的所有專案
    for item in os.listdir(input_root):
        item_path = os.path.join(input_root, item)
        
        # 如果是目錄，則處理它
        if os.path.isdir(item_path):
            print("Processing directory: {}".format(item))
            
            # 建立對應的輸出目錄
            output_item_path = os.path.join(output_root, item)
            if not os.path.exists(output_item_path):
                os.makedirs(output_item_path)
            
            # 遍曆該目錄下的所有子目錄和檔案
            for item_d in os.listdir(item_path):
                sub_item_path = os.path.join(item_path, item_d)
                for sub_item in os.listdir(sub_item_path):
                    data_path = os.path.join(sub_item_path, sub_item)
                    # 如果是"data"目錄，則合并其中的bson檔案
                    if os.path.isdir(data_path) and sub_item == "data":
                        # 提取命名空間（父目錄名）
                        namespace = os.path.basename(sub_item_path)
                        merge_single_bson_dir(data_path, output_item_path, namespace)
                    # 如果是.metadata.json檔案，則直接複製到對應的輸出目錄
                    elif sub_item.endswith(".metadata.json"):
                        src_file = os.path.join(sub_item_path, sub_item)
                        target_dir = os.path.join(output_item_path, sub_item)
                        shutil.copy(src_file, target_dir)
                        print("Copied metadata file: {}".format(sub_item))
            print("Finished processing directory: {}".format(item))


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="遞迴合并 bson 檔案")
    parser.add_argument("input_root", help="包含 bson 檔案的根目錄路徑")
    parser.add_argument("-o", "--output_root", help="輸出檔案的根目錄路徑，預設為輸入根目錄")

    args = parser.parse_args()
    merge_bson_files_recursive(args.input_root, args.output_root)

執行命令：

python merge_bson_files.py <input_directory> -o <output_directory>

使用mongorestore工具將備份資料恢複到資料庫執行個體中。

# 單表恢複
mongorestore --uri=<mongodb-uri> --db <db> --collection <collection>  <xxx.bson>
# 單表恢複樣本
mongorestore --uri='mongodb://127.x.x.x:27017/?authSource=admin' --db testDB --collection coll1 ./testDB/coll1.bson 
# 單庫恢複
mongorestore --uri=<mongodb-uri> --db <db> --dir </path/to/bson/dir>
# 單庫恢複樣本
mongorestore --uri='mongodb://127.x.x.x:27017/?authSource=admin' --db testDB --dir ./testDB 
# 整執行個體恢複
mongorestore --uri=<mongodb-uri>  --dir </path/to/bson/dir>
# 整執行個體恢複樣本
mongorestore --uri='mongodb://127.x.x.x:27017/?authSource=admin' --dir ./

參數說明：

<mongodb-uri> ：自建或雲MongoDB執行個體的伺服器高可用地址。uri中包含了使用者名稱、密碼以及服務端的ip和連接埠，詳情可參考官方文檔。
<db>：要恢複的資料庫名。
<collection>：要恢複的資料庫表名。
<xxx.bson>：要進行單表恢複的對應備份bson檔案。
<path/to/bson/dir>：要進行恢複的包含bson檔案的目錄。

常見問題

執行個體類型不支援下載備份檔案時，如何將資料恢複至自建資料庫？

您可以通過DTS將執行個體資料移轉至自建資料庫中。具體操作，請參見源為自建MongoDB或ApsaraDB for MongoDB的遷移方案。
使用MongoDB資料庫內建的備份還原工具Mongodump和Mongorestore，備份和恢複執行個體。

背景資訊

注意事項

準備工作

操作步驟

tar.zst（控制台下載）

tar.gz（OpenAPI下載預設格式）

常見問題