Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features - CIRAD - Centre de coopération internationale en recherche agronomique pour le développement Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features

Résumé

Text clustering and topic learning are two closely related tasks. In this paper, we show that the topics can be learnt without the absolute need of an exact categorization. In particular, the experiments performed on two real case studies with a vocabulary based on bigram features lead to extracting readable topics that cover most of the documents. Precision at 10 is up to 74% for a dataset of scientific abstracts with 10,000 features, which is 4% less than when using unigrams only but provides more interpretable topics.
Fichier principal
Vignette du fichier
dmnlp revisedJulienVelcin2016.pdf (278.62 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

lirmm-01362434 , version 1 (08-09-2016)

Identifiants

  • HAL Id : lirmm-01362434 , version 1

Citer

Julien Velcin, Mathieu Roche, Pascal Poncelet. Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features. DMNLP: Data Mining and Natural Language Processing, Sep 2016, Riva del Garda, Italy. ⟨lirmm-01362434⟩
217 Consultations
260 Téléchargements

Partager

Gmail Facebook X LinkedIn More