Development of a Keyword Extractor Using the Bidirectional Encoder Representations from Transformers (BERT) Model

Muhammad Mubarok Azzam(1), Ade Jamal(2),


(1) Universitas Al-Azhar Indonesia
(2) Universitas Al-Azhar Indonesia
(*) Corresponding Author

Abstract


To extract key information from documents, keyword extraction is often used as an automated process to identify the most relevant words and phrases. Models like Rapid Automatic Keyword Extraction (RAKE) and Yet Another Keyword Extractor (YAKE) operate based on the statistical properties of text without considering semantic similarity. Bidirectional Encoder Representations from Transformers (BERT), a bidirectional transformer model, addresses this limitation by converting phrases and documents into vectors that capture semantic meaning. This research tests a keyword extraction system on the abstract texts of Indonesian theses using the BERT model "cahya/bert-base-indonesian-1.5G" from HuggingFace. Additionally, the study employs three similarity matrix formulas (Cosine Similarity, Euclidean Distance, Manhattan Distance) to measure the similarity between the text and candidate keywords. The results show that the YAKE model performed best overall, followed by RAKE. The BERT model showed lower performance, but Euclidean Distance for BERT outperformed Cosine Similarity and Manhattan Distance

Full Text:

PDF

References


M. A. Shiddiq. Ekstraksi kata kunci pada artikel menggunakan metode TextRank S, UIN Malang. 2019.

M. Grootendorst. Keyword extraction with BERT. Maartengrootendorst. 2020. https://www.maartengrootendorst.com/blog/keybert. (Diakses pada 10 Juli 2024).

C. D. Manning, P. Raghavan & H. Schütze. An introduction to information retrieval. Cambridge University. 2009.

P. Barret. Euclidean distance: Raw, normalized, and double‐scaled coefficients. PBarret.net. 2005.

H. S. Ranjitkar & S. Karki. Comparison of A*, Euclidean and Manhattan distance using Influence Map in Ms. Pac-Man. 2016.

Y. Matsuo & M. Ishizuka. Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif, Vol. 13, no. 1, pp. 157-169. 2004.

R. Mihalcea & P. Tarau. TextRank: Bringing Order into Text. Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404-411. 2004.

A. Jain, G. Kulkarni, & V. Shah. Natural language processing. International Journal of Computer Sciences and Engineering, Vol. 6, No. 1, pp. 161-167. 2018.

M. Emms & S. Luz. Machine Learning for Natural Language Processing. ESSLLI 2007 Course Reader. 2007.

M.V. Koroteev. BERT: A Review of Applications in Natural Language Processing and Understanding. 2021. arXiv:2103.11943

S. Islam, H. Elmekki, A. Elsebai, J. Bentahar, N. Drawel, G. Rjoub & W. Pedrycz. A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks. Expert Systems with Applications, Vol. 241, No. 122666. 2023.

J. Devlin, M. Chang, K. Lee, & K. Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics. pp. 4171–4186. 2019.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser & I. Polosukhin. Attention is All you Need. Neural Information Processing Systems. 2017.

M. H. Sazli. A brief review of feed-forward neural networks. Commun.Fac.Sci.Univ.Ank.Series A2-A3: Phys.Sci. and Eng, Vol. 50, No. 01. 2006.

D. Park & C. W. Ahn. Self-Supervised Contextual Data Augmentation for Natural Language Processing. Symmetry, Vol. 11, No. 11, pp. 1393. 2019.

A. Mittal & A. Modi. ReCAM@IITK at SemEval-2021 Task 4: BERT and ALBERT based Ensemble for Abstract Word Prediction. International Workshop on Semantic Evaluation (SemEval2021), Online. 2021.

T. Setyorini. e-Modul matematika kelas XI: Matriks. Repositori Institusi Kementrian Pendidikan, Kebudayaan, Riset, dan Teknologi. 2019.

H. M. Abdallah, A.Taha & M. M. Selim. Cloud-Based Fuzzy Keyword Search Scheme Over Encrypted Documents. Int. J. Sociotechnology Knowl, Vol. 13, No. 4, pp. 82-100. 2021.

R. Campos, V. Mangaravite, A. Pasquali, A. M. Jorge, C. Nunes & A. Jatowt. A Text Feature Based Automatic Keyword Extraction Method for Single Documents. European Conference on Information Retrieval (ECIR), zrenoble, France. 2018.

M. Chaudhary. TF-IDF vectorizer scikit-learn. Medium. 2020. https://medium.com/@cmukesh8688/tf-idf-vectorizerscikit-learn-dbc0244a911a. (Diakses pada 11 Juli 2024).

J. Alammar. The illustrated transformer. Jalammar.github.io. 2018. https://jalammar.github.io/illustrated-transformer. (Diakses pada 11 Juli 2024).




DOI: http://dx.doi.org/10.36722/exc.v2i1.3378

Refbacks

  • There are currently no refbacks.