Summary and study resources for beginners: Information Retrieval and Web Search

Pseudo
3 min readJun 22, 2021

About Information Retrieval and Web Search

What is Information Retrieval?

According to Wikipedia,

Information retrieval (IR) is the process of obtaining information system resources that are relevant to an information need from a collection of those resources.

In simple words, Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). ~ Manning, Raghavan, and Schütze, 2008

Information retrieval is about finding something that already is part of your data, as fast as possible. Machine learning involves techniques to generalize existing knowledge to new data, as accurate as possible. Data mining is primarily about discovering something hidden in your data, that you did not know before, as “new” as possible.

Related but different courses • Database management • Common: Data modeling, query processing, efficiency, • Different: Unstructured data, subjective evaluation, ranked retrieval, • NLP • Common: Language modeling, indexing, • Different: User modeling, ranked retrieval, • Machine Learning • Common: Text classification, text mining, • Different: User satisfaction, limited training,

Ok enough. Let us try to understand it in a structured format.

Overview

The following topics are normally discussed in any IR class:

  1. Retrieval framework
  2. Models: Boolean, vector space, probabilistic, language modelling
  3. Systems: search engine framework, inverted indexes — construction and storage, compression, algorithms
  4. Ecosystem: TREC and other commonly used collections
  5. Evaluation: metrics for evaluation and experiment protocols
  6. Advanced retrieval models: phrase queries and proximity-based models, diversification, passage retrieval
  7. Web-search Engines crawling, query-logs, click-through.
  8. Learning to Rank: methods, embeddings and semantic matching by query expansion, neural information retrieval
  9. Knowledge-graphs: entity-centric search, enhanced using knowledge graphs
  10. Responsible IR: fake-news, privacy, bias and fairness

Basic Terminology

• Relevance-centric view • User-centric view • Entity-centric view • Time-centric view • Fairness-centric view • Performance-centric view • Expert-centric view

Term — A semantic unit, a word, phrase or potentially root of a word

Document — A sequence/set of terms, expressing ideas about one or more topics in natural language

Query — A request for documents pertaining to a topic. Also, the expression of information needs by the user.

Information Need — An innate idea of information/knowledge that the user is currently looking for

Collection — A set of documents

Information Retrieval System — An automatic system that attempts to find relevant documents for a given query

Recommended books

  1. Introduction to Information Retrieval (Manning, Raghavan, and Schütze, 2008)
  2. Modern Information Retrieval: The Concepts And Technology Behind Search (Baeza-Yates and Ribeiro-Neto, 2010)
  3. Search Engines: Information Retrieval in Practice (Croft, Metzler and Strohman, 2009)
  4. Information Retrieval — Implementing and Evaluating Search Engines (Büttcher, Clarke and Cormack, 2010)

--

--

Pseudo

Hey all! I’m here to share experiences and the best of my learnings with you. Drop a mail at spseudo001@gmail.com