What is Information Retrieval?
According to Wikipedia,
Information retrieval (IR) is the process of obtaining information system resources that are relevant to an information need from a collection of those resources.
In simple words, Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). ~ Manning, Raghavan, and Schütze, 2008
Information retrieval is about finding something that already is part of your data, as fast as possible. Machine learning involves techniques to generalize existing knowledge to new data, as accurate as possible. Data mining is primarily about discovering something hidden in your data, that you did not know before, as “new” as possible.
Related but different courses • Database management • Common: Data modeling, query processing, efficiency, • Different: Unstructured data, subjective evaluation, ranked retrieval, • NLP • Common: Language modeling, indexing, • Different: User modeling, ranked retrieval, • Machine Learning • Common: Text classification, text mining, • Different: User satisfaction, limited training,
Ok enough. Let us try to understand it in a structured format.
Overview
The following topics are normally discussed in any IR class:
- Retrieval framework
- Models: Boolean, vector space, probabilistic, language modelling
- Systems: search engine framework, inverted indexes — construction and storage, compression, algorithms
- Ecosystem: TREC and other commonly used collections
- Evaluation: metrics for evaluation and experiment protocols
- Advanced retrieval models: phrase queries and proximity-based models, diversification, passage retrieval
- Web-search Engines crawling, query-logs, click-through.
- Learning to Rank: methods, embeddings and semantic matching by query expansion, neural information retrieval
- Knowledge-graphs: entity-centric search, enhanced using knowledge graphs
- Responsible IR: fake-news, privacy, bias and fairness
Basic Terminology
• Relevance-centric view • User-centric view • Entity-centric view • Time-centric view • Fairness-centric view • Performance-centric view • Expert-centric view
Term — A semantic unit, a word, phrase or potentially root of a word
Document — A sequence/set of terms, expressing ideas about one or more topics in natural language
Query — A request for documents pertaining to a topic. Also, the expression of information needs by the user.
Information Need — An innate idea of information/knowledge that the user is currently looking for
Collection — A set of documents
Information Retrieval System — An automatic system that attempts to find relevant documents for a given query
Recommended books
- Introduction to Information Retrieval (Manning, Raghavan, and Schütze, 2008)
- Modern Information Retrieval: The Concepts And Technology Behind Search (Baeza-Yates and Ribeiro-Neto, 2010)
- Search Engines: Information Retrieval in Practice (Croft, Metzler and Strohman, 2009)
- Information Retrieval — Implementing and Evaluating Search Engines (Büttcher, Clarke and Cormack, 2010)