×
Community Blog GenAI on Alibaba Cloud [Part 2]: Chat with Your PDF (Building a RAG System)

GenAI on Alibaba Cloud [Part 2]: Chat with Your PDF (Building a RAG System)

This article introduces how to build a “Chat with PDF” tool using RAG so Qwen can answer questions based on private documents.

In Part 1, we learned how to connect to Alibaba Cloud's Qwen model using Python. We asked it general questions and it gave us great answers.

But what if you ask Qwen: "How do I reset my password on my company's internal portal?" or "Summarize this specific legal contract?"

Qwen will fail. Why? Because Generic LLMs do not know your private data.

To fix this, we need RAG (Retrieval-Augmented Generation). In this episode, we are going to build a "Chat with PDF" tool. We will feed a PDF document into our system, and Qwen will answer questions based only on that document.

What is RAG? (The "Open Book Exam" Analogy)

Standard LLMs work like a student taking a closed-book exam. They have to rely on their memory (training data). If they don't know the answer, they might guess (hallucinate).

RAG changes this to an open-book exam.

  1. Retrieval: When you ask a question, the system first looks up the relevant page in your "textbook" (your PDF).
  2. Generation: It sends that specific page to the LLM and says, "Read this page and answer the user's question."

To achieve this, we will use two Alibaba Cloud models:

Text-Embedding-v3: To convert text into search-friendly numbers.

Qwen-Plus: To read the text and generate the answer.

Step 1: The Setup

We need a library to read PDFs. We will use pdfplumber because it handles text extraction very accurately. We also need numpy for the vector math.

Open your terminal and run:

pip install openai numpy pdfplumber python-dotenv

Step 2: The Logic (How to "Read" a PDF)

Computers can't understand text directly; they understand numbers. To search a PDF, we have to:

  1. Extract text from the PDF.
  2. Chunk it (split it into pages or paragraphs).
  3. Embed it (Turn those chunks into lists of numbers called "Vectors").

When the user asks a question, we convert their question into a Vector, too. Then, we simply look for the PDF chunk that is "mathematically closest" to the question.

Step 3: The Code

Create a file named pdf_rag.py.
We are sticking to the OpenAI-Compatible method we used in Episode 1. This ensures your code works globally without region errors.
Copy and paste the following code:

import os
import numpy as np
import pdfplumber
from openai import OpenAI
from dotenv import load_dotenv

# 1. Load API Key
# Make sure you have a .env file with: DASHSCOPE_API_KEY=sk-your_key
load_dotenv()

# 2. Setup Client (International Endpoint)
# We point the base_url to Alibaba Cloud's International server.
client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"), 
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

def extract_text_from_pdf(pdf_path):
    """
    Reads a PDF and splits it into chunks (one chunk per page).
    """
    chunks = []
    if not os.path.exists(pdf_path):
        print(f"Error: File '{pdf_path}' not found.")
        return []

    print(f"Loading {pdf_path}...")
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text()
            if text:
                # We prepend "Page X" so the AI can cite its sources!
                chunks.append(f"[Page {i+1}] {text}")
    
    print(f"Successfully loaded {len(chunks)} pages.")
    return chunks

def get_embedding(text):
    """
    Calls Alibaba Cloud 'text-embedding-v3' to turn text into vectors.
    """
    # Clean up newlines to improve embedding quality
    text = text.replace("\n", " ")
    try:
        response = client.embeddings.create(
            model="text-embedding-v3", 
            input=[text],
            dimensions=1024
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"Error getting embedding: {e}")
        return []

def find_best_match(query, corpus_embeddings, corpus_text):
    """
    Compares the user's question against all PDF pages using Cosine Similarity.
    """
    # 1. Embed the user's question
    query_vec = get_embedding(query)
    if not query_vec: return None

    # 2. Compare against every page in the PDF
    scores = []
    for doc_vec in corpus_embeddings:
        # Dot product is a simple way to measure similarity between normalized vectors
        score = np.dot(query_vec, doc_vec) 
        scores.append(score)

    # 3. Get the index of the highest score
    best_idx = np.argmax(scores)
    return corpus_text[best_idx]

# --- MAIN APPLICATION ---
if __name__ == "__main__":
    
    # SETUP: Put a PDF file in the same folder and name it 'manual.pdf'
    pdf_filename = "manual.pdf" 
    
    print("--- Step 1: Processing PDF ---")
    kb_text = extract_text_from_pdf(pdf_filename)

    if not kb_text:
        print("Exiting: No text found.")
        exit()

    print("--- Step 2: Generating Embeddings (This takes a few seconds) ---")
    kb_vectors = [get_embedding(text) for text in kb_text]
    print("--- Ready! Ask questions about your PDF. ---\n")

    while True:
        user_query = input("You: ")
        if user_query.lower() in ['exit', 'quit']: break

        print("Searching document...")
        
        # 1. RETRIEVE: Find the best page
        best_context = find_best_match(user_query, kb_vectors, kb_text)
        
        # 2. AUGMENT: Create the prompt
        prompt = f"""
        You are a helpful assistant. Answer the user's question based ONLY on the content below.
        If the answer is not in the content, say "I don't know."
        
        Document Content:
        {best_context}
        
        User Question: {user_query}
        """

        # 3. GENERATE: Ask Qwen
        try:
            completion = client.chat.completions.create(
                model="qwen-plus", # Strong reasoning model
                messages=[
                    {'role': 'system', 'content': 'You are a helpful assistant.'},
                    {'role': 'user', 'content': prompt}
                ]
            )
            print(f"\nQwen: {completion.choices[0].message.content}\n")
            print("-" * 50)
            
        except Exception as e:
            print(f"Error: {e}")

Step 4: Run It

  1. Find a PDF (e.g., a washing machine manual, a resume, or a business report).
  2. Rename it to manual.pdf and place it in the same folder as your python script.
  3. Run the script:

python pdf_rag.py

Example Output:

Imagine I uploaded a PDF about a coffee machine.

You: Why is the red light blinking?

Searching document...

Qwen: According to [Page 4], the red light blinks when the water tank is empty. Please refill the tank.

Why This is Powerful

Notice what just happened:

  1. We used text-embedding-v3 (International version) to understand the meaning of your question, not just keywords.
  2. We found the exact page with the answer.
  3. We used qwen-plus to read that page and explain it to you in plain English.

We didn't need a massive vector database or complex infrastructure. Just few lines of Python code and Alibaba Cloud Model Studio.

Optimization Tips for Production

If you want to build this for a real startup or enterprise app:

  1. Vector Database: Instead of storing kb_vectors in a Python list (which disappears when you close the script), store them in Alibaba Cloud AnalyticDB for PostgreSQL. It has built-in vector search.
  2. Chunking: We split by "Page". For better accuracy, you might want to split by "Paragraph" or overlapping windows (e.g., 500 words at a time).
  3. Hybrid Search: Combine Vector search (semantic) with Keyword search (Elasticsearch) for the best results.

What's Next?

You now have a working RAG chatbot! But what if you want to build an AI that can take action? What if you want it to not just read the manual, but actually book a meeting or send an email?

In Episode 3, we will explore AI Agents and Function Calling. We will teach Qwen how to use tools to interact with the real world.

See you then!


Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

0 1 0
Share on

Farah Abdou

12 posts | 0 followers

You may also like

Comments