YouTube Spam Collection

March 25, 2017

The YouTube Spam Collection is a valuable dataset curated for spam research in user-generated content. It features 1,956 real comments collected from five YouTube videos that were among the 10 most viewed during the collection period. Each comment is labeled as either legitimate (ham) or spam, making it an ideal resource for developing and testing spam filtering algorithms.

The dataset includes five subsets, with video IDs and labeled data as follows:

Psy – “Gangnam Style” (9bZkp7q19f0): 175 spam, 175 ham (350 total).
Katy Perry – “Roar” (CevxZvSJLk8): 175 spam, 175 ham (350 total).
LMFAO – “Party Rock Anthem” (KQ6zr6kCPj8): 236 spam, 202 ham (438 total).
Eminem – “Love the Way You Lie” (uelHwf8o7_U): 245 spam, 203 ham (448 total).
Shakira – “Waka Waka” (pRpeEdMmmQ0): 174 spam, 196 ham (370 total).

The chronological order of comments is preserved, ensuring a realistic data structure for time-sensitive studies. The dataset contains no missing values, making it ready for use in classification tasks, particularly spam filtering using machine learning techniques.

Publications

MDLText e Indexação Semântica aplicados na Detecção de Spam nos Comentários do YouTube
R.M. SILVA, T.C. ALBERTO, T.A. ALMEIDA
iSys – Revista Brasileira de Sistemas de Informação, 10(3), 49-73, 2017 [pdf]
Towards Filtering Undesired Short Text Messages using an Online Learning Approach with Semantic Indexing
R.M. SILVA, T.C. ALBERTO, T.A. ALMEIDA, A. YAMAKAMI
Expert Systems with Applications, Elsevier, 83(2017), 314-325, 2017 [pdf]
Filtrando Comentários do YouTube através de Classificação Online baseada no Princípio MDL e Indexação Semântica
R.M SILVA, T.C. ALBERTO, T.A. ALMEIDA, A. YAMAKAMI
Anais do XIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’16), Recife, Brazil, October, 2016 [pdf]
Post or Block? Advances in Automatically Filtering Undesired Comments
T.C. ALBERTO, J. VON LOCHTER, T.A. ALMEIDA
Journal of Intelligent & Robotic Systems, 80, 245-259, Springer, 2015 [pdf]
TubeSpam: Comment Spam Filtering on YouTube
T.C. ALBERTO, J. VON LOCHTER, T.A. ALMEIDA
Proceedings of the 14th IEEE International Conference on Machine Learning and Applications (ICMLA’15), 138-143, Miami, FL, USA, December, 2015 [pdf]
Filtragem Automática de Spam nos Comentários do YouTube
T.C. ALBERTO, J. VON LOCHTER, T.A. ALMEIDA
Anais do XII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’15), Natal, Brazil, November, 2015 [pdf]
Learning to Block Undesired Comments in the Blogosphere
T.A. ALMEIDA, T.C. ALBERTO
Proceedings of the 12th IEEE International Conference on Machine Learning and Applications (ICMLA’13), Vol. 2, 261-266, Miami, FL, USA, December, 2013 [pdf]
Aprendizado de Máquina Aplicado na Detecção Automática de Comentários Indesejados
T.C. ALBERTO, T.A. ALMEIDA
Anais do X Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’13), Fortaleza, Brazil, October, 2013 [pdf]

Data

This collection presents a significant opportunity for researchers and developers to design models capable of combating spam in online platforms effectively. Dive into the dataset and enhance your spam classification algorithms today!

The datasets and documentation are publicly available at Kaggle and UCI ML Rep.