Expressing queries as finite state automata

Expressing queries as finite state automata software#

Spellchecking Lucene spellchecker builds a separate index to find correction candidates Perhaps our fuzzy enumeration is now fast enough for small edit distances (e.g. Scoring that doesnt need stopwords For now, use CommonGrams! 16. Language-specific resources lucene-hunspell could provide these Language-independent tokenization Unicode rules go a long way. TODO: Support expansion models, too in Lucene. Experimental results 125k docs English test collection Results are for TD queries Inverted the S-Stemmer 6 declarative rewrite rules to regex Competitive with traditional stemming. Automata expansion Natural fit for morphology Use set intersection operators Minus to subtract exact match case Union to search multiple languages Efficient operation Doesnt explode for languages with complex morphology 14. No need to worry about language ID for docs. Simplifies search configuration Tuning relevance is easier, no re-indexing. Single field now works well for all queries: exact match, wildcard, expanded, etc. Expansion instead Dont remove data at index time Expand the query instead. Stemming Stemmers work at index and query time walked, walking -> walk Can increase retrieval effectiveness Some problems Mistakes: international -> intern Must determine language of documents Multilingual cases can get messy Tuning is difficult: must re-index Unfriendly: wildcards on stemmed terms 12. Regex, Wildcard, Fuzzy Without constant prefix, exhaustive Regex: (http|ftp)://foo.com Wildcard: ?oo?ar Fuzzy: foobar~ Re-implemented as automata queries Just parsers that produce a DFA Improved performance and scalability (http|ftp)://foo.com examines 2 terms. Depends on what is in term dictionary Depends on state machine structure MultiTermQuery API further improved Easier and more efficient to skip around. Query API improvements Automata might need to do many seeks around the term dictionary. The intersection of the two emits search results. AutomatonQuery represents a users search need as a FSM. Another way to think of it Index as a state machine that recognizes Terms and transduces matching Documents. AutomatonQuery traverses the term dictionary and the state machine in parallel 7. Automaton Queries Only explore subtrees that can lead to an accept state of some finite state machine.

But typically this is a small number: e.g. Lucene 2.9: Fast Numeric Ranges Indexes at different levels of precision. TermQuery is just a special case, looks at one leaf node. For example, Regex and Fuzzy exhaustively evaluate all terms, unless you give them a constant prefix. Inverted Index Like a TreeMap Before 2.9, queries only operate on one subtree. Eclipse For a search server, see Solr For web search + crawler, see Nutch Website: 4. Commercial support via several companies Just the library Embed for your own uses, e.g.

Introduction to Lucene Open Source Search Engine Library Not just Java, ported to other languages too. Finite-State Queries in LuceneRobert Muir Agenda Introduction to Lucene Improving inexact matching: Background Regular Expression, Wildcard, Fuzzy Queries Additional use cases: Language support: expansion versus stemming Improved spellchecking Other ongoing developments in Lucene 3. For the last few years Robert has been working on foreign language NLP problems - "I really enjoy working with Lucene, as it's always receptive to better int'l/language support, even though everyone seems to be a performance freak. Robert received his MS in Computer Science from Johns Hopkins and BS in CS from Radford University.

Expressing queries as finite state automata software#

He works as a software developer for Abraxas Corporation. Finite-State Queries in Lucene:* Background, improvement/evolution of MultiTermQuery API in 2.9 and Flex* Implementing existing Lucene queries with NFA/DFA for better performance: Wildcard, Regex, Fuzzy* How you can use this Query programmatically to improve relevance (I'll use an English test collection/English examples)Quick overview of other Lucene features in development, such as:* Flexible Indexing* "More-Flexible" Scoring: challenges/supporting BM25, more vector-space models, field-specific scoring, etc.* Improvements to analysisBonus:* Lucene / Solr merger explanation and future plansAbout the presenter:Robert Muir is a super-active Lucene developer.