ac

Spam Detection on YouTube Comments Using Advanced Machine Learning Models: A Comparative Study

Authors

  • Gregorius Airlangga Atma Jaya Catholic University of Indonesia

DOI:

10.47709/brilliance.v4i2.4670

Keywords:

Spam Detection, Machine Learning, YouTube Comments, Text Classification, LinearSVC

Dimension Badge Record



Abstract

The exponential growth of user-generated content on platforms like YouTube has led to an increase in spam comments, which negatively affect the user experience and content moderation efforts. This research presents a comprehensive comparative study of various machine learning models for detecting spam comments on YouTube. The study evaluates a range of traditional and ensemble models, including Linear Support Vector Classifier (LinearSVC), RandomForest, LightGBM, XGBoost, and a VotingClassifier, with the goal of identifying the most effective approach for automated spam detection. The dataset consists of labeled YouTube comments, and text preprocessing was performed using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. Each model was trained and evaluated using a stratified 10-fold cross-validation to ensure robustness and generalizability. LinearSVC outperformed all other models, achieving an accuracy of 95.33% and an F1-score of 95.32%. The model demonstrated superior precision (95.46%) and recall (95.33%), making it highly effective in distinguishing between spam and legitimate comments. The results highlight the potential of LinearSVC for real-time spam detection systems, offering a reliable balance between accuracy and computational efficiency. Furthermore, the study suggests that while ensemble models like RandomForest and VotingClassifier performed well, they did not surpass the simpler LinearSVC model in this context. Future work will explore the incorporation of deep learning techniques, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to capture more complex patterns and further enhance spam detection accuracy on social media platforms like YouTube.

Google Scholar Cite Analysis
Abstract viewed = 4 times

References

Abbasi, A., Dobolyi, D., Vance, A. & Zahedi, F. M. (2021). The phishing funnel model: a design artifact to predict user susceptibility to phishing websites. Information Systems Research, 32(2), 410–436.

Abkenar, S. B., Kashani, M. H., Akbari, M. & Mahdipour, E. (2023). Learning textual features for Twitter spam detection: A systematic literature review. Expert Systems with Applications, 228, 120366.

Agarwal, R., Dhoot, A., Kant, S., Bisht, V. S., Malik, H., Ansari, M. F., Afthanorhan, A. & Hossaini, M. A. (2024). A novel approach for spam detection using natural language processing with AMALS models. IEEE Access.

Ahmed, S. F., Alam, M. S. Bin, Hassan, M., Rozbu, M. R., Ishtiak, T., Rafa, N., Mofijur, M., Shawkat Ali, A. B. M. & Gandomi, A. H. (2023). Deep learning modelling techniques: current progress, applications, advantages, and challenges. Artificial Intelligence Review, 56(11), 13521–13617.

Akinyelu, A. A. (2021). Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques. Journal of Computer Security, 29(5), 473–529.

Al Zoubi, A. M. & others. (2024). Spam Reviews Detection Models in Multilingual Contexts applying Sentiment Analysis, Metaheuristics, and Advanced Word Embedding.

Alkhamees, M., Alsaleem, S., Al-Qurishi, M., Al-Rubaian, M. & Hussain, A. (2021). User trustworthiness in online social networks: A systematic review. Applied Soft Computing, 103, 107159.

Andresini, G., Iovine, A., Gasbarro, R., Lomolino, M., de Gemmis, M. & Appice, A. (2022). Review Spam Detection using Multi-View Deep Learning Combining Content and Behavioral Features. ItaDATA, 87–98.

Appel, G., Grewal, L., Hadi, R. & Stephen, A. T. (2020). The future of social media in marketing. Journal of the Academy of Marketing Science, 48(1), 79–95.

Barushka, A. (2020). Machine Learning Techniques in Spam Filtering.

Bazzaz Abkenar, S., Mahdipour, E., Jameii, S. M. & Haghi Kashani, M. (2021). A hybrid classification method for Twitter spam detection based on differential evolution and random forest. Concurrency and Computation: Practice and Experience, 33(21), e6381.

Evans, D., Bratton, S. & McKee, J. (2021). Social media marketing. AG Printing & Publishing.

Fafalios, S., Charonyktakis, P. & Tsamardinos, I. (2020). Gradient boosting trees. Gnosis Data Analysis PC, 1.

Gaafar, A. S., Dahr, J. M. & Hamoud, A. K. (2022). Comparative analysis of performance of deep learning classification approach based on LSTM-RNN for textual and image datasets. Informatica, 46(5).

Galli, F., Loreggia, A. & Sartor, G. (2022). The Regulation of Content Moderation. International Conference on the Legal Challenges of the Fourth Industrial Revolution, 63–87.

Gasparetto, A., Marcuzzo, M., Zangari, A. & Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Information, 13(2), 83.

Gongane, V. U., Munot, M. V & Anuse, A. D. (2022). Detection and moderation of detrimental content on social media platforms: current status and future directions. Social Network Analysis and Mining, 12(1), 129.

Gunturi, S. K. & Sarkar, D. (2021). Ensemble machine learning models for the detection of energy theft. Electric Power Systems Research, 192, 106904.

Hakak, S., Alazab, M., Khan, S., Gadekallu, T. R., Maddikunta, P. K. R. & Khan, W. Z. (2021). An ensemble machine learning approach through effective feature extraction to classify fake news. Future Generation Computer Systems, 117, 47–58.

Hassani, H., Beneki, C., Unger, S., Mazinani, M. T. & Yeganegi, M. R. (2020). Text mining in big data analytics. Big Data and Cognitive Computing, 4(1), 1.

Hussain, K., Khan, M. L. & Malik, A. (2024). Exploring audience engagement with ChatGPT-related content on YouTube: Implications for content creators and AI tool developers. Digital Business, 4(1), 100071.

Jahan, M. S. & Oussalah, M. (2023). A systematic review of hate speech automatic detection using natural language processing. Neurocomputing, 546, 126232.

Jain, A. K., Sahoo, S. R. & Kaubiyal, J. (2021). Online social networks security and privacy: comprehensive review and analysis. Complex & Intelligent Systems, 7(5), 2157–2177.

Jáñez-Martino, F., Alaiz-Rodr’iguez, R., González-Castro, V., Fidalgo, E. & Alegre, E. (2023). A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artificial Intelligence Review, 56(2), 1145–1173.

Kavzoglu, T. & Teke, A. (2022). Advanced hyperparameter optimization for improved spatial prediction of shallow landslides using extreme gradient boosting (XGBoost). Bulletin of Engineering Geology and the Environment, 81(5), 201.

Khan, H., Asghar, M. U., Asghar, M. Z., Srivastava, G., Maddikunta, P. K. R. & Gadekallu, T. R. (2021). Fake review classification using supervised machine learning. Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10--15, 2021, Proceedings, Part IV, 269–288.

Liu, J., Singhal, T., Blessing, L. T. M., Wood, K. L. & Lim, K. H. (2021). Crisisbert: a robust transformer for crisis classification and contextual crisis embedding. Proceedings of the 32nd ACM Conference on Hypertext and Social Media, 133–141.

Manca, S., Bocconi, S. & Gleason, B. (2021). “Think globally, act locally”: A glocal approach to the development of social media literacy. Computers & Education, 160, 104025.

Mohammed, A. & Kora, R. (2023). A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University-Computer and Information Sciences, 35(2), 757–774.

Nascimento, F. R. S., Cavalcanti, G. D. C. & Da Costa-Abreu, M. (2023). Exploring automatic hate speech detection on social media: a focus on content-based analysis. SAGE Open, 13(2), 21582440231181310.

Rao, S., Verma, A. K. & Bhatia, T. (2021). A review on social spam detection: Challenges, open issues, and future directions. Expert Systems with Applications, 186, 115742.

Rao, S., Verma, A. K. & Bhatia, T. (2023). Hybrid ensemble framework with self-attention mechanism for social spam detection on imbalanced data. Expert Systems with Applications, 217, 119594.

Rastogi, A., Mehrotra, M. & Ali, S. S. (2020). Effective opinion spam detection: A study on review metadata versus content. Journal of Data and Information Science, 5(2), 76–110.

Reshma, P. K. (2020). Soft computing approaches to domain specific information retrieval in the semantic web. University of Calicut.

Salman, M., Ikram, M. & Kaafar, M. A. (2024). Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning Models. IEEE Access.

Saumya, S. & Singh, J. P. (2022). Spam review detection using LSTM autoencoder: an unsupervised approach. Electronic Commerce Research, 22(1), 113–133.

Teng, S., Khong, K. W., Pahlevan Sharif, S. & Ahmed, A. (2020). YouTube Video comments on healthy eating: descriptive and predictive analysis. JMIR Public Health and Surveillance, 6(4), e19618.

Wang, J., Xue, D. & Shi, K. (2021). An ensemble framework for spam detection on social media platforms. International Journal of Machine Learning and Computing, 11(1), 77–84.

Wang, M., Fu, W., He, X., Hao, S. & Wu, X. (2020). A survey on large-scale machine learning. IEEE Transactions on Knowledge and Data Engineering, 34(6), 2574–2594.

Yang, X., Yang, K., Cui, T., Chen, M. & He, L. (2022). A study of text vectorization method combining topic model and transfer learning. Processes, 10(2), 350.

Zeakis, A., Papadakis, G., Skoutas, D. & Koubarakis, M. (2023). Pre-trained embeddings for entity resolution: an experimental analysis. Proceedings of the VLDB Endowment, 16(9), 2225–2238.

Zhang, M. (2024). Ensemble-Based Text Classification for Spam Detection. Informatica, 48(6).

Downloads

ARTICLE Published HISTORY

Submitted Date: 2024-09-16
Accepted Date: 2024-09-17
Published Date: 2024-10-04

How to Cite

Airlangga, G. (2024). Spam Detection on YouTube Comments Using Advanced Machine Learning Models: A Comparative Study. Brilliance: Research of Artificial Intelligence, 4(2), 500-508. https://doi.org/10.47709/brilliance.v4i2.4670

Most read articles by the same author(s)