Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

On the Helmholtz principle for text mining

Dadachev, Boris 2015. On the Helmholtz principle for text mining. PhD Thesis, Cardiff University.
Item availability restricted.

[thumbnail of 2015dadachevbkphd.pdf]
Preview
PDF - Accepted Post-Print Version
Download (2MB) | Preview
[thumbnail of orca.pdf] PDF - Supplemental Material
Restricted to Repository staff only

Download (105kB)

Abstract

The majority of text mining systems rely on bag-of-words approaches, representing textual documents as multi-sets of their constituent words. Using term weighting mechanisms, this simple representation allows to derive features that can be used as input by many different algorithms and for a variety of applications, including document classification, information retrieval, sentiment analysis, etc. Since the performance of many mining algorithms directly depend on term weights, techniques for quantifying term importance are of paramount importance in text processing. This thesis takes advantage of recent advances in keyword extraction mechanisms, which further select the terms with the highest weights to keep only the most important words. More precisely, building on a recent keyword extraction technique, we develop novel text mining algorithms for information retrieval, text segmentation and summarization. We find these algorithms to provide state-of-the-art performance using standard evaluation techniques. However, contrary to many state-of-the-art algorithms, we try to make as few assumptions as possible on the data to analyze while keeping good computational performances, both in terms of speed and accuracy. As such, our algorithms can work with inputs from a variety of domains and languages, but they can also run in environments with limited resources. Additionally, in a field that tends to be dominated by empirical approaches, we strive to rely on sound and rigorous mathematical principles

Item Type: Thesis (PhD)
Date Type: Completion
Status: Unpublished
Schools: Mathematics
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Date of First Compliant Deposit: 30 March 2016
Last Modified: 10 Jun 2022 14:55
URI: https://orca.cardiff.ac.uk/id/eprint/86418

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics