Распознавание ботов в онлайновых социальных сетях при помощи алгоритма «Случайный лес»

Главная

Статья в Elpub

РУС ENG

Машиностроение и компьютерные технологии. 2019; : 24-41

Распознавание ботов в онлайновых социальных сетях при помощи алгоритма «Случайный лес»

Хачатрян М. Г., Ключарев П. Г.

https://doi.org/10.24108/0419.0001473

Аннотация

Онлайновые социальные сети играют важную роль в жизни миллионов людей в качестве инструмента для коммуникации. Однако, онлайновые социальные сети также являются ареной информационного противоборства. Одним из инструментов ведения информационного противоборства являются боты, где под ботами понимается программное обеспечение, предназначенное для имитации поведения реального пользователя в онлайновых социальных сетях.

Целью данной работы является разработка модели по выявлению ботов в онлайновых социальных сетях. Для разработки данной модели был использован алгоритм машинного обучения «Случайный лес» (Random Forest). Так как для реализации алгоритмов машинного обучения необходимо максимальное количество данных, в качестве онлайновой социальной сети в рамках которой решается задача распознавания ботов была использована онлайновая социальная сеть Twitter. Данная онлайновая социальная сеть активно используется во многих исследованиях по обнаружению ботов.

Для обучения и тестирования алгоритма «Случайный лес» был использован набор данных аккаунтов Twitter, состоящий из более 3000 пользователей и более 6000 ботов. В процессе обучения и тестирования алгоритма «Случайный лес» были определены оптимальные гиперпараметры алгоритма при которых достигается наибольшее значение F₁-метрики. В качестве языка программирования, который позволил реализовать выше описанные действия, был выбран Python, активно использующийся при решении задач, связанных с машинным обучением.

Для сравнения разработанной модели с моделями других авторов было произведено тестирование на двух наборах данных аккаунтов Twitter, состоящих наполовину из ботов и наполовину из реальных пользователей. В результате тестирования на указанных наборах данных были получены значения F₁-метрик равных 0.973 и 0.923. Полученные значения F₁-метрик являются довольно высокими по сравнению с работами других авторов.

В результате, в данной работе была получена модель, способная распознавать ботов в онлайновой социальной сети Twitter, которая обладает высокими точностными показателями.

Список литературы

1. De Meo P., Ferrara E., Fiumara G., A. Provetti A. On Facebook, most ties are weak // Communications of the ACM. 2014. Vol. 57. No. 11. Pp. 78-84. DOI: 10.1145/2629438

2. Tsagkias M., de Rijke M., Weerkamp W. Linking online news and social media // 4th ACM intern. conf. on Web search and data mining: WSDM’11 (Hong Kong, China, February 9-12, 2011): Proc. N.Y.: ACM, 2011. Pp. 565-574. DOI: 10.1145/1935826.1935906

3. Губанов Д.А., Новиков Д.А., Чхартишвили А.Г. Социальные сети: модели информационного влияния, управления и противоборства. М.: Физматлит, 2010. 225 с.

4. Шушков Г.М., Сергеев И.В. Концептуальные основы информационной безопасности Российской Федерации // Актуальные вопросы научной и научно-педагогической деятельности молодых ученых: III Всеросс. заочная науч.-практич. конф. (Москва, Россия, 23 ноября – 30 декабря 2015 г.): Сб. науч. тр. М., 2016. С. 69-76.

5. Лыфенко Н.Д. Виртуальные пользователи в социальных сетях: мифы и реальность // Вопросы кибербезопасности. 2014. № 5(8). С. 17-20.

6. Ferrara E., Varol O., Davis C., Menczer F., Flammini A. The rise of social bots // Communications of the ACM. 2016. Vol. 59. No. 7. Pp. 96-104. DOI: 10.1145/2818717

7. Ratkiewicz J., Conover M.D., Meiss M.R., Gonçalves B., Flammini A., Menczer F. Detecting and tracking political abuse in social media // 5th intern. AAAI conf. on weblogs and social media: ICWSM’11 (Barcelona, Spain, July 17-21, 2011): Proc. Palo Alto, CA: AAAI Press, 2011. Pp. 297-304.

8. Ferrara E. Manipulation and abuse on social media // ACM SIGWEB Newsletter. 2015. Article no. 4. DOI: 10.1145/2749279.2749283

9. Wang A.H. Detecting spam bots in online social networking sites: A machine learning approach // Data and application security and privacy XXIV: 24th Annual IFIP conf. on data and applications security and privacy: DBSec 2010 (Rome, Italy, June 21-23, 2010): Proc. B.; HDBL.: Springer, 2010. Pp. 335-342. DOI: 10.1007/978-3-642-13739-6_25

10. Faraz Ahmed, Muhammad Abulaish. A generic statistical approach for spam detection in online social networks // Computer Communications. 2013. Vol. 36. No. 10-11. Pp. 1120-1129. DOI: 10.1016/j.comcom.2013.04.004

11. Zi Chu, Indra Widjaja, Haining Wang. Detecting social spam campaigns on Twitter // Applied cryptography and network security: 10th intern. conf. on applied cryptography and network security: ACNS'12 (Singapore, Singapore, June 26-29, 2012): Proc. B.; Hdbl.: Springer, 2012. Pp. 455-472. DOI: 10.1007/978-3-642-31284-7_27

12. Haewoon Kwak, Changhyun Lee, Hosung Park, Sue Moon. What is Twitter, a social network or a news media? // 19th intern. conf. on World Wide Web: WWW’10 (Raleigh, NC, USA, April 26-30, 2010): Proc. N.Y.: ACM, 2010. Pp. 591-600. DOI: 10.1145/1772690.1772751

13. Liaw A., Wiener M. Classification and regression by randomForest // R News. 2002. Vol. 2. No. 3. Pp. 18-22.

14. Biau G., Scornet E. A random forest guided tour // TEST. 2016. Vol. 25. No. 2. Pp. 197-227. DOI: 10.1007/s11749-016-0481-7

15. Classification and regression trees / L. Breiman a.o. Belmont, CA: Wadsworth Intern. Group, 1984. 358 p.

16. Safavian S., Landgrebe D. A survey of decision tree classifier methodology // IEEE Trans. on Systems, Man and Cybernetics. 1991. Vol. 21. No. 3. Pp. 660-674. DOI: 10.1109/21.97458

17. Raileanu L.E., Stoffel K. Theoretical comparison between the Gini Index and Information Gain Criteria // Annals of Mathematics and Artificial Intelligence. 2004. Vol. 41. No. 1. Pp. 77-93. DOI: 10.1023/B:AMAI.0000018580.96245.c6

18. Cresci S., Di Pietro R., Petrocchi M., Spognardi A., Tesconi M. The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race // 26th

19. intern. conf. on World Wide Web Companion: WWW’17 Companion (Perth, Australia, April 3-7, 2017): Proc. N.Y.: ACM, 2017. Pp. 963-972. DOI: 10.1145/3041021.3055135

20. Chao Yang, Harkreader R., Guofei Gu. Empirical evaluation and new design for fighting evolving Twitter spammers // IEEE Trans. on Information Forensics and Security. 2013. Vol. 8. No. 8. Pp. 1280-1293. DOI: 10.1109/TIFS.2013.2267732

21. Raschka S. Model evaluation, model selection, and algorithm selection in machine learning / Univ. of Wisconsin–Madison; Dep. of Statistics. 2018. Режим доступа: https://sebastianraschka.com/pdf/manuscripts/model-eval.pdf

22. (дата обращения 13.04.2019).

23. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection //14th Intern. joint conf. on artificial intelligence: IJCAI’95 (Montreal, Canada, August 20-25, 1995): Proc. N.Y.: ACM, 1995. Vol. 2. Pp. 1137–1143.

24. Hossin M., Sulaiman M.N. A review on evaluation metrics for data classification evaluations // Intern. J. of Data Mining & Knowledge Management Process (IJDKP). 2015. Vol. 5. No. 2. Pp. 1-11. DOI: 10.5121/ijdkp.2015.5201

25. Caelen O. A Bayesian interpretation of the confusion matrix // Annals of Mathematics and Artificial Intelligence. 2017. Vol. 81. No. 3-4. Pp. 429-450. DOI: 10.1007/s10472-017-9564-8

26. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A.,. Cournapeau D., Brucher M., Perrot M., Duchesnay E. Scikit-learn: Machine learning in Python // The J. of Machine Learning Research. 2011. Vol. 12. Pp. 2825-2830.

27. Davis C.A., Varol O., Ferrara E., Flammini A., Menczer F. BotOrNot: A system to evaluate social bots // 25th intern. conf. companion on World Wide Web: WWW’16 (Montreal, Canada, April 11-15, 2016): Proc. N.Y.: ACM, 2016. Pp. 273-274. DOI: 10.1145/2872518.2889302

28. Miller Z., Dickinson B., Deitrick W., Wei Hu, Alex Hai Wang. Twitter spammer detection using data stream clustering // J. Information Sciences – Informatics and Computer Science, Intelligent Systems, Applications. 2014. Vol. 260. Pp. 64-73. DOI: 10.1016/j.ins.2013.11.016

29. Cresci S., Di Pietro R., Petrocchi M., Spognardi A., Tesconi M. DNA-inspired online behavioral modeling and its application to spambot detection // IEEE Intelligent Systems. 2016. Vol. 31. No. 5. Pp. 58-64. DOI: 10.1109/MIS.2016.29

Mechanical Engineering and Computer Science. 2019; : 24-41

Bots Recognition in Social Networks Using the Random Forest Algorithm

Khachatrian M. G., Klyucharev P. G.

https://doi.org/10.24108/0419.0001473

Abstract

Online social networks are of essence, as a tool for communication, for millions of people in their real world. However, online social networks also serve an arena of information war. One tool for infowar is bots, which are thought of as software designed to simulate the real user’s behaviour in online social networks.

The paper objective is to develop a model for recognition of bots in online social networks. To develop this model, a machine-learning algorithm “Random Forest” was used. Since implementation of machine-learning algorithms requires the maximum data amount, the Twitter online social network was used to solve the problem of bot recognition. This online social network is regularly used in many studies on the recognition of bots.

For learning and testing the Random Forest algorithm, a Twitter account dataset was used, which involved above 3,000 users and over 6,000 bots. While learning and testing the Random Forest algorithm, the optimal hyper-parameters of the algorithm were determined at which the highest value of the F₁ metric was reached. As a programming language that allowed the above actions to be implemented, was chosen Python, which is frequently used in solving problems related to machine learning.

To compare the developed model with the other authors’ models, testing was based on the two Twitter account datasets, which involved as many as half of bots and half of real users. As a result of testing on these datasets, F₁-metrics of 0.973 and 0.923 were obtained. The obtained F₁-metric values are quite high as compared with the papers of other authors.

As a result, in this paper a model of high accuracy rates was obtained that can recognize bots in the Twitter online social network.

References

1. De Meo P., Ferrara E., Fiumara G., A. Provetti A. On Facebook, most ties are weak // Communications of the ACM. 2014. Vol. 57. No. 11. Pp. 78-84. DOI: 10.1145/2629438

3. Gubanov D.A., Novikov D.A., Chkhartishvili A.G. Sotsial'nye seti: modeli informatsionnogo vliyaniya, upravleniya i protivoborstva. M.: Fizmatlit, 2010. 225 s.

4. Shushkov G.M., Sergeev I.V. Kontseptual'nye osnovy informatsionnoi bezopasnosti Rossiiskoi Federatsii // Aktual'nye voprosy nauchnoi i nauchno-pedagogicheskoi deyatel'nosti molodykh uchenykh: III Vseross. zaochnaya nauch.-praktich. konf. (Moskva, Rossiya, 23 noyabrya – 30 dekabrya 2015 g.): Sb. nauch. tr. M., 2016. S. 69-76.

5. Lyfenko N.D. Virtual'nye pol'zovateli v sotsial'nykh setyakh: mify i real'nost' // Voprosy kiberbezopasnosti. 2014. № 5(8). S. 17-20.