Enhancing K-Means Clustering for Journal Articles using TF-IDF and LDA Feature Extraction

Authors

  • Dewi Fatmarani Surianto dewifatmaranis@unm.ac.id
  • Dewi Fatmawati Surianto Universitas Bakrie

DOI:

https://doi.org/10.47709/brilliance.v4i2.5547

Keywords:

clustering, feature extraction, k-means, TF-IDF, LDA

Abstract

Clustering is a fundamental technique in data analysis, particularly in unsupervised learning, to group data with similar characteristics. However, the effectiveness of the K-Means algorithm in text clustering heavily depends on proper feature extraction. This study proposes an enhanced feature extraction approach by integrating Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Dirichlet Allocation (LDA) to improve clustering performance on journal article datasets. The dataset consists of 427 journal article abstracts collected from Google Scholar. The preprocessing steps include tokenization, stopword removal, and TF-IDF vectorization, followed by topic extraction using LDA, which serves as input features for the K-Means clustering algorithm. The optimal number of clusters is determined using the Silhouette Score, with the best result obtained at k=9, achieving a score of 0.6806. The practical implications of this study include improved accuracy in academic document clustering, with applications in journal recommendation systems, digital library indexing, and research trend analysis. The results demonstrate that the combination of TF-IDF and LDA produces more informative text representations, significantly enhancing clustering quality. This study contributes to text mining and data science by proposing a systematic preprocessing framework for document clustering. Future research could explore its application to full-text articles, hierarchical clustering, or deep learning-based models to further improve clustering performance.

References

Aggarwal, A. G. (2024). Online Reviews Based Ranking of Hotels Integrating Topic Modelling and Picture Fuzzy Approach. https://doi.org/10.21203/rs.3.rs-3876413/v1

Al-Obaydy, W. N. I., Hashim, H. A., Najm, Y. A., & Jalal, A. A. (2022). Document Classification Using Term Frequency-Inverse Document Frequency and K-Means Clustering. Indonesian Journal of Electrical Engineering and Computer Science, 27(3), 1517. https://doi.org/10.11591/ijeecs.v27.i3.pp1517-1524

Chen, F., Yang, Y., Xu, L., Zhang, T., & Zhang, Y. (2019). Big-Data Clustering: K-Means or K-Indicators? https://doi.org/10.48550/arxiv.1906.00938

Dewald, C. L. A., Balandis, A., Becker, L. S., Hinrichs, J. B., Falck, C. v., Wacker, F., Laser, H., Gerbel, S., Winther, H. B., & Apfel, J. (2023). Automated Classification of Free-Text Radiology Reports: Using Different Feature Extraction Methods to Identify Fractures of the Distal Fibula. Röfo - Fortschritte Auf Dem Gebiet Der Röntgenstrahlen Und Der Bildgebenden Verfahren, 195(08), 713–719. https://doi.org/10.1055/a-2061-6562

George, L., & Sumathy, P. (2023). An Integrated Clustering and BERT Framework for Improved Topic Modeling. International Journal of Information Technology, 15(4), 2187–2195. https://doi.org/10.1007/s41870-023-01268-w

Ghosh, S. (2021). Using Transformer Based Ensemble Learning to Classify Scientific Articles. https://doi.org/10.48550/arxiv.2102.09991

Goh, C.-H., Wong, K. K., Tan, M. P., Ng, S.-C., Chuah, Y. D., & Kwan, B.-H. (2022). Development of an effective clustering algorithm for older fallers. PLOS ONE, 17(11), e0277966. https://doi.org/10.1371/journal.pone.0277966

Im, Y., Park, J., Kim, M., & Park, K. (2019). Comparative Study on Perceived Trust of Topic Modeling Based on Affective Level of Educational Text. Applied Sciences, 9(21), 4565. https://doi.org/10.3390/app9214565

Istiqomah, K. N., Widodo, I. D., Mufid, N. F., & Qurtubi, Q. (2023). Identifying Improvement Strategic From User Application Reviews Group Using K-Means Clustering and TF-IDF Weighting. International Journal of Artificial Intelligence Research, 7(2), 152. https://doi.org/10.29099/ijair.v7i2.1062

Karthik, J., Tamizhazhagan, Dr. V., & Narayana, Dr. S. (2019). Data Leak Identification in Social Networks Using K Means Clustering &Amp; Tabu K Means Clustering. International Journal of Innovative Technology and Exploring Engineering, 9(2), 2777–2783. https://doi.org/10.35940/ijitee.b6635.129219

Liu, B., Zhang, T., Li, Y., Liu, Z., & Zhang, Z. (2021). Kernel Probabilistic K-Means Clustering. Sensors, 21(5), 1892. https://doi.org/10.3390/s21051892

Liu, H., Zhi-wang, C., Tang, J., Zhou, Y., & Liu, S. (2020). Mapping the Technology Evolution Path: A Novel Model for Dynamic Topic Detection and Tracking. Scientometrics, 125(3), 2043–2090. https://doi.org/10.1007/s11192-020-03700-5

Muhajir, M., Rosadi, D., & Danardono, D. (2024). Improving the Term Weighting Log Entropy of Latent Dirichlet Allocation. Indonesian Journal of Electrical Engineering and Computer Science, 34(1), 455. https://doi.org/10.11591/ijeecs.v34.i1.pp455-462

Nugroho, S. A., Bachtiar, F. A., & Wihandika, R. C. (2022). Aspect Extraction in E-Commerce Using Latent Dirichlet Allocation (Lda) With Term Frequency-Inverse Document Frequency (Tf-Idf). Jurnal Ilmiah Kursor, 11(2), 53. https://doi.org/10.21107/kursor.v11i2.247

Pokharel, M., Bhatta, J., & Paudel, N. (2021). Comparative Analysis of K-Means and Enhanced K-Means Algorithms for Clustering. Nuta Journal, 8(1–2), 79–87. https://doi.org/10.3126/nutaj.v8i1-2.44044

Rashid, J., Kim, J., Hussain, A., Naseem, U., & Juneja, S. (2022). A Novel Multiple Kernel Fuzzy Topic Modeling Technique for Biomedical Data. BMC Bioinformatics, 23(1). https://doi.org/10.1186/s12859-022-04780-1

Rolf, B., Beier, A., Jackson, I., Müller, M., Reggelin, T., Stuckenschmidt, H., & Lang, S. (2024). A review on unsupervised learning algorithms and applications in supply chain management. International Journal of Production Research, 1–51. https://doi.org/10.1080/00207543.2024.2390968

Simoes, S., Deepak, P., & MacCárthaigh, M. (2022). Exploring Rawlsian Fairness For K-Means Clustering. 47–59. https://doi.org/10.1007/978-981-19-4453-6_3

Suyal, M., & Sharma, S. (2024). A Review on Analysis of K-Means Clustering Machine Learning Algorithm based on Unsupervised Learning. Journal of Artificial Intelligence and Systems, 6(1), 85–95. https://doi.org/10.33969/AIS.2024060106

Wang, X., Shao, Z., Shen, Y., & He, Y. (2023). Research on fast marking method for indicator diagram of pumping well based on K-means clustering. Heliyon, 9(10), e20468. https://doi.org/10.1016/j.heliyon.2023.e20468

Wardy, D. K., Putra, I. K. G. D., & Rusjayanthi, N. K. D. (2022). Clustering Artikel pada Portal Berita Online Menggunakan Metode K-Means. JITTER?: Jurnal Ilmiah Teknologi dan Komputer, 3(1), 985. https://doi.org/10.24843/JTRTI.2022.v03.i01.p34

Zhao, Y. (2022). Exploring Redditors’ Topics With Natural Language Processing. https://doi.org/10.4995/carma2022.2022.15022

Downloads

Published

2024-12-30

How to Cite

Surianto, D. F., & Surianto, D. F. (2024). Enhancing K-Means Clustering for Journal Articles using TF-IDF and LDA Feature Extraction. Brilliance: Research of Artificial Intelligence, 4(2), 964–972. https://doi.org/10.47709/brilliance.v4i2.5547

Similar Articles

1 2 3 4 > >> 

You may also start an advanced similarity search for this article.