Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Corpus-consulting probabilistic approach to parsing: the CCPX parser and its complementary components

Day, Michael David 2007. Corpus-consulting probabilistic approach to parsing: the CCPX parser and its complementary components. PhD Thesis, Cardiff University.

[img] PDF - Accepted Post-Print Version
Download (29MB)

Abstract

Corpus linguistics is now a major field in the study of language. In recent years corpora that are syntactically analysed have become available to researchers, and these clearly have great potential for use in the field of parsing natural language. This thesis describes a project that exploits this possibility. It makes four distinct contributions to these two fields. The first is an updated version of a corpus that is (a) analysed in terms of the rich syntax of Systemic Functional Grammar (SFG), and (b) annotated using the extensible Mark-up Language (XML). The second contribution is a native XML corpus database, and the third is a sophisticated corpus query tool for accessing it. The fourth contribution is a new type of parser that is both corpus-consulting and probabilistic. It draws its knowledge of syntactic probabilities from the corpus database, and it stores its working data within the database, so that it is strongly database-oriented. SFG has been widely used in natural language generation for approaching two decades, but it has been used far less frequently in parsing (the first stage in natural language understanding). Previous SFG corpus-based parsers have utilised traditional parsing algorithms, but they have experienced problems of efficiency and coverage, due to (a) the richness of the syntax and (b) the challenge of parsing unrestricted spoken and written texts. The present research overcomes these problems by introducing a new type of parsing algorithm that is 'semi-deterministic' (as human readers are), and utilises its knowledge of the rules—including probabilities—of English syntax. A language, however, is constantly evolving. New words and uses are added, while others become less frequent and drop out altogether. The new parsing system seeks to replicate this. As new sentences are parsed they are added to the corpus, and this slowly changes the frequencies of the words and the syntactic patterns. The corpus is in this sense dynamic, and so simulates a human's changing knowledge of words and syntax.

Item Type: Thesis (PhD)
Status: Unpublished
Schools: Computer Science & Informatics
Subjects: Q Science > QA Mathematics > QA76 Computer software
ISBN: 9781303213052
Date of First Compliant Deposit: 30 March 2016
Last Modified: 12 Feb 2016 23:12
URI: http://orca-mwe.cf.ac.uk/id/eprint/54704

Citation Data

Cited 1 time in Google Scholar. View in Google Scholar

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics