Designing efficient algorithms for querying large corpora


  • Paul Meurer University of Bergen



I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.



opyright (c) 2014-2020 Simon Fraser University * Copyright (c) 2003-2020 John Willinsky * Distributed under the GNU GPL v3. For full terms see the file docs/COPYING. * * @brief Common site frontend footer. * * @uses $isFullWidth bool Should this page be displayed without sidebars? This * represents a page-level override, and doesn't indicate whether or not * sidebars have been configured for thesite. *}