Konfigurasi komponen Text Summarization - Platform For AI

Komponen Text Summarization menggunakan algoritma automatic summarization berbasis model TextRank untuk mengekstraksi kalimat kunci dari sebuah dokumen, menghasilkan ringkasan yang ringkas, koheren, dan secara akurat mencerminkan gagasan utama dokumen asli. Topik ini menjelaskan cara mengonfigurasi komponen Text Summarization.

Batasan

Engine komputasi yang didukung adalah MaxCompute.

Catatan penggunaan

Tambahkan komponen Sentence Splitting di hulu untuk memisahkan teks menjadi satu kalimat per baris.

Konfigurasi komponen

Anda dapat mengonfigurasi parameter komponen dengan salah satu cara berikut.

Metode 1: Gunakan GUI

Anda dapat mengonfigurasi parameter komponen pada halaman alur kerja Designer.

Tab	Parameter	Deskripsi
Fields Setting	Column for document ID	Masukkan nama kolom yang berisi ID dokumen.
Fields Setting	Sentence column	Tentukan satu kolom.
Parameters Setting	Number of key sentences to output	Nilai default adalah 3.
	Sentence similarity calculation method	Metode untuk menghitung kemiripan kalimat: Ics_sim leveshtein_sim ssk cosine
	Weight of matching string	Parameter ini aktif ketika Sentence similarity calculation method diatur ke ssk. Nilai default adalah 0,5.
	Length of substring	Parameter ini aktif ketika Sentence similarity calculation method diatur ke ssk atau cosine. Nilai default adalah 2.
	Damping factor	Nilai default adalah 0,85.
	Maximum iterations	Nilai default adalah 100.
	Convergence coefficient	Nilai default adalah 0,000001.
Execution tuning	Number of cores	Dialokasikan secara otomatis.
Execution tuning	Memory per core	Dialokasikan secara otomatis.

Metode 2: Gunakan perintah PAI

Anda dapat menggunakan perintah PAI untuk mengonfigurasi parameter komponen. Untuk melakukannya, gunakan komponen SQL Script untuk memanggil perintah PAI. Untuk informasi selengkapnya, lihat SQL Script.

PAI -name TextSummarization
    -project algo_public
    -DinputTableName="test_input"
    -DoutputTableName="test_output"
    -DdocIdCol="doc_id"
    -DsentenceCol="sentence"
    -DtopN=2
    -Dlifecycle=30;

Parameter	Wajib	Deskripsi	Nilai default
inputTableName	Ya	Nama tabel input.	Tidak ada
inputTablePartitions	Tidak	Partisi dalam tabel input yang digunakan untuk komputasi.	Semua partisi tabel input
outputTableName	Ya	Nama tabel output.	Tidak ada
docIdCol	Ya	Nama kolom yang berisi ID dokumen.	Tidak ada
sentenceCol	Ya	Kolom kalimat. Anda hanya dapat menentukan satu kolom.	Tidak ada
topN	Tidak	Output terdiri atas beberapa kalimat kunci pertama.	3
similarityType	Tidak	Metode untuk menghitung kemiripan kalimat: Ics_sim leveshtein_sim ssk cosine	lcs_sim
lambda	Tidak	Bobot string yang cocok. Parameter ini tersedia ketika `similarityType` diatur ke ssk.	0,5
k	Tidak	Panjang substring. Parameter ini tersedia ketika `similarityType` diatur ke ssk atau cosine.	2
dampingFactor	Tidak	Faktor redaman (damping factor).	0,85
maxIter	Tidak	Jumlah maksimum iterasi.	100
epsilon	Tidak	Koefisien konvergensi.	0,000001
lifecycle	Tidak	Siklus hidup tabel output.	Tidak ada
coreNum	Tidak	Jumlah core untuk komputasi.	Dialokasikan secara otomatis oleh sistem
memSizePerCore	Tidak	Memori yang dibutuhkan untuk setiap core.	Dialokasikan secara otomatis oleh sistem

Contoh

Persiapkan tabel input `test_input`. Tabel berikut menunjukkan data sampel.

Anda dapat menggunakan client MaxCompute untuk membuat tabel dan menggunakan perintah Tunnel untuk mengunggah data. Untuk informasi selengkapnya tentang cara menginstal dan mengonfigurasi client MaxCompute, lihat Connect using the local client (odpscmd). Untuk informasi selengkapnya tentang perintah Tunnel, lihat Tunnel commands.

doc_id

sentence

1000897

Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue. This poses a great risk to public health and has drawn widespread social concern. Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success. While cracking down on these illegal activities, law enforcement found that a large consumer base, enormous poaching profits, and the difficulty and high cost of identification are key reasons the illegal wildlife trade continues to thrive.

Keterangan:

doc_id: Kolom ID dokumen.
sentence: Kolom kalimat.

Gunakan komponen Sentence Splitting untuk memisahkan teks dalam kolom `sentence` menjadi satu kalimat per baris. Tabel output diberi nama `test_output`. Tabel berikut menunjukkan isinya. Untuk informasi selengkapnya, lihat Sentence Splitting.

doc_id	sentence
1000897	Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue.
1000897	This poses a great risk to public health and has drawn widespread social concern.
1000897	Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success.
1000897	While cracking down on these illegal activities, law enforcement found that a large consumer base, enormous poaching profits, and the difficulty and high cost of identification are key reasons the illegal wildlife trade continues to thrive.

Jalankan perintah PAI berikut untuk menghasilkan ringkasan teks.

Anda dapat menggunakan komponen SQL Script atau komponen ODPS SQL Node untuk menjalankan perintah PAI berikut.

PAI -name TextSummarization
    -project algo_public
    -DinputTableName="test_output"
    -DoutputTableName="test_output1"
    -DdocIdCol="doc_id"
    -DsentenceCol="sentence"
    -DtopN=2
    -Dlifecycle=30;

Tabel output memiliki dua kolom: doc_id dan abstract.

doc_id	abstract
1000897	Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue. Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success.

Referensi

Komponen Sentence Splitting melakukan pra-pemrosesan data dengan memisahkan segmen teks menjadi satu kalimat per baris. Untuk informasi selengkapnya, lihat Sentence Splitting.
Untuk informasi selengkapnya tentang Designer, lihat Designer overview.