Tags for optimal information retrieval - part 1: motivation

This is part one of a two-article series. I recommend reading them in order:


In information management systems tags are one of several ways to arrange and filter items. The goal of using tags is to help users navigate and retrieve information efficiently. Here are two examples of real-world datasets with tags:

  • Persons and skills. The items are persons and the tags are their skills. Tag \(i\) is applied to person \(j\) if the person has the skill. The tag “data science” might apply to me.
  • Movies and keywords. The items are movies and the tags are keywords that describe the movies. Tag \(i\) is applied to movie \(j\) if the keyword is relevant for the movie. As an example, the tag “superhero” could be applied to the movie “Ant-Man.”

The figure below shows an example of a tag structure. We visualize tags in rows and items in columns. If you had to pick two out of the four tags to describe the items, which two would you choose to help your users navigate the items? The answer is not obvious.

The plan for this article is to discuss what a descriptive set of tags looks like and create an objective function that can be used to mathematically evaluate and choose descriptive tags. We’ll also apply the proposed method to the two real-world datasets introduced above and study the results.

What does a descriptive set of tags look like?

Let’s investigate which properties a good tag structure might have. It’s always a good idea to fetch a pen and paper and play with a few simple ideas when attacking problems like these. Here are my own initial thoughts about the problem.

No tag should cover very few or very many items. A tag that covers all or almost all items is hardly descriptive. For instance a movie tag “color-tv” does not appear worthwhile, since almost all films are shot in color. Conversely, extremely specific tags are also quite worthless, for instance a tag like “romantic horror filmed outdoors.”

Equal group size. Tags group the items, and in a descriptive tag structure each group should have roughly the same size. If we had to choose between the two sets of tags below, we would prefer the more balanced set of tags on the right. Notice that equal group size and equal tag size are not identical concepts—two tags could each apply to the same number of items but also overlap on some items, inducing a third group of a different size.

Balanced coverage. We define the coverage of an item as the number of tags that are applied to it. The figure below shows full one-coverage, two-coverage and three-coverage. In general some items might have large coverage while others have little or no coverage. We might wish for balanced coverage—there is no good reason for some items to have no tags while others have very many tags.

Binary trees are in a sense optimal. To compress as much information as possible into a set of tags, a binary tree structure is optimal. Using a binary tree, \(m\) tags induce \(2^m\) distinct groups onto the items. However, the groups induced by binary trees have very unbalanced coverage. Some items are covered by zero tags, while others will be covered by all \(m\) tags.

Binary trees are optimal if the information management system shows exactly the set of items corresponding to the tags applied and not applied. But information systems rarely do—if you go to a website with tags and you apply no tags, you’ll see every item, not the items with no tags. For our purposes, binary trees are not the answer—their optimality does not correspond to how tags are typically used in information systems.

The objective function: minimize number of items viewed

It is tempting to start working with the properties above directly. We could search for a set of tags that (1) induces groups of roughly equal size and (2) has roughly equal coverage of all items. But going down this path would be a modeling mistake.

Instead of trying to satisfy several ad-hoc properties at once in a messy multi-objective problem, we should pause and think—what is it that we’re trying to accomplish here, really?

The goal of using tags is to help users retrieve information as efficiently as possible. Let’s pursue this thought and create a unified objective function. To do so, we must first understand how tags are used in more detail.

Tags and filters

Go to a few websites that use tags and look for an item: an apartment, a movie, or a new dishwasher. You’ll quickly realize that tags are used differently depending on the website and context:

  • Tags as one-of-many filters over mutually exclusive outcomes (“cat” or “dog”)
  • Tags as OR-filters (selecting tags “a” and “b” shows the union “a OR b”)
  • Tags as AND-filters (selecting tags “a” and “b” shows the intersection “a AND b”)

The UC Irvine Machine Learning Repository uses OR-filters for tags related to the “data types” filter on their datasets. There are 665 datasets in total, and if we activate the “Tabular” tag we filter down to 48 datasets. If we add the “Text” tag we increase the search result to 121 datasets. Adding more tags enlarges the result set by widening the selection.

The Norwegian marketplace FINN.no uses AND-filters for real estate properties. When we activate the “Elevator” tag and the “Fireplace” tag, we are presented with housing that has both an elevator and a fireplace. Adding more tags narrows the selection and decreases the size of the result set.

The objective function: We assume that we’re using tags with AND-filters, and that we’re looking for a randomly selected, but specific item. The goal is to minimize the number of items viewed on average before we find the item we’re looking for.

A view means an item considered; like going through a deck of cards one-by-one and asking “is this my card?”. Each item considered this way counts as a view, and each view costs time for the person searching through the items.

Example computation

The equation for the objective function, which we will present shortly, is a bit opaque. To make the equation more understandable we introduce it with an example.

Consider the following example with \(m=3\) tags and \(n=6\) items. The tag structure is described by a binary matrix, with entry \(1\) if tag \(i\) is applied to item \(j\). Zeros are omitted for readability.

\begin{align*}  \begin{bmatrix} 1 & 1 &  &  &  & 1 \\ 1 &  & 1 & 1 & 1 & 1 \\  &  &  &  & 1 & 1 \\ \end{bmatrix} \end{align*}

Suppose we’re looking for the first item with tags \([1, 1, 0]\). We know exactly what we’re looking for, so we apply tags one and two to our AND-filter. The result of filtering all items on tags one and two is the result set \(\{[1, 1, 0], [1, 1, 1]\}\). It is coincidental that we happen to be looking for the first item in this set. We want the objective to be invariant to permutations of the columns, and we assume that we have no control of the order of the filtered results. If we’re looking for the first item we’ll view one item, which is the item we wanted to find. If we’re looking for the second item we’ll view two items before we find what we’re looking for. On average we view \((1 + 2)/2 = 3/2\) items.

Now suppose we’re looking for the second item, with tags \([1, 0, 0]\). Again we know exactly what we’re looking for, so we apply the first tag to filter the items. The result set is \(\{[1, 0, 0], [1, 0, 0], [1, 1, 1]\}\), with three items. The ordering is arbitrary and on average we view \((1+2+3)/3 = 2\) items before finding what we’re looking for.

We continue this way for every item. If we’re looking for the third item we’ll view \(3\) items on average before we find it, \(3\) for the fourth, \(3/2\) for the fifth and \(1\) for the sixth.

Averaging over all the six items which we might be looking for, we’ll view

\begin{align*} \frac{3/2 + 2 + 3 + 3 + 3/2 + 1}{6} = 2 \end{align*}

items on average before we find a specific item, if the item we’re looking for was drawn uniformly at random.

Equation and code

We now formalize the logic above into an equation and into code. Consider a binary matrix \(T \in \{0,1\}^{m \times n}\) consisting of \(m\) tags and \(n\) items. An equation that implements the computation above is

\begin{align*} f(T) = \frac{1}{2} + \frac{1}{2n} \sum_{j=1}^{n} \sum_{k=1}^{n} \prod_{i=1}^{m} \left[ T_{ik} \geq T_{ij} \right]. \end{align*}

The square bracket is the Iverson bracket, which evaluates to \(1\) if the argument inside it is true and \(0\) otherwise.

Let’s untangle the equation a little. Given items \(j\) and \(k\), the expression \(\prod_{i=1}^{m} \left[ T_{ik} \geq T_{ij} \right]\) equals \(1\) if applying filters for tags associated with item \(j\) retains item \(k\). If item \(k\) is excluded by the filter, then the expression becomes zero. We can think of the product as the AND function applied to a set of logical implications computed using the “greater than or equal to” function. The total number of items in the result set when we apply tags corresponding to item \(j\) is therefore \(\sum_{k=1}^{n} \prod_{i=1}^{m} \left[ T_{ik} \geq T_{ij} \right]\).

Assume that applying filters for item \(j\) leads to a result set with \(\ell\) items. On average we then have to view

\begin{align*} \frac{1 + 2 + 3 + \cdots + \ell}{\ell} = \frac{\ell (\ell + 1)/2}{\ell} = \frac{\ell + 1}{2} \end{align*}

items before finding the specific item we’re looking for.

Combining these observations together and re-arranging a little leads to the objective function \(f\) above. The idea is perhaps more easily understood through code. Below is a Python snippet that implements the objective function above.

import numpy as np


def items_viewed(T):
    """Assume we're looking for a random item and we apply
    the correct AND-filters. How many items must we look at
    at on average before finding it?"""

    views = []  # Average items viewed before finding each item

    # Loop over each item (column) in T
    for j in range(T.shape[1]):
        # Applying the AND-filter, this is the number of items found
        num_found = np.all(T >= T[:, [j]], axis=0).sum()

        # On average we look at (1 + 2 + ... + num_found) / num_found
        # items, and (1 + 2 + ... + k) / k = (k + 1)/2
        views.append((num_found + 1) / 2)

    return np.mean(views)  # Average over every item


# An example where T has shape (tags, items)
T = np.array([[1, 1, 0, 0, 0, 1], 
              [1, 0, 1, 1, 1, 1], 
              [0, 0, 0, 0, 1, 1]])
items_viewed(T)  # 2.0

The function runs in \(\mathcal{O}(mn^2)\) time.

A penalty for the number of tags

The objective function \(f\) never penalizes adding more tags to the system (rows to the matrix \(T\)). A thousand tags for a thousand items is perfectly OK. We will now introduce a penalty for the number of tags. This makes the model more realistic and helps us choose an optimal subset of tags without having to specify the size of the subset beforehand.

Recall that \(f(T)\) is the expected number of items we have to view before we find the item we’re looking for, averaged over all possible items that we could be looking for in the first place. We can also penalize viewing the tags themselves. After all, in a database we have to view all the \(m\) tags to decide whether or not to apply each tag. Just like viewing items, this represents a cost of time wasted.

The simplest way to modify the objective to account for the number of tags is to penalize viewing a tag roughly equally to viewing an item:

\begin{align*} g(T; \alpha) = f(T) + \alpha m = \frac{1}{2} + \frac{1}{2n} \sum_{j=1}^{n} \sum_{k=1}^{n} \prod_{i=1}^{m} \left[ T_{ij} \geq T_{ik} \right] + \alpha m \end{align*}

The number \(\alpha \geq 0\) is the relative cost of viewing a tag compared to an item. We set \(\alpha=1\) unless we explicitly state otherwise.

While \(f\) can answer questions like “which ten tags out of these \(m\) are the best ones?”, the function \(g\) does not require choosing the number of tags beforehand. Therefore \(g\) can answer “which subset of these \(m\) tags is the best one?”

To get a feeling for the objective function, here are a few matrices \(T\) and the objectives \(f(T)\).

With no tags the objective is \(1/2 + n/2 = 1/2 + 12/2 = 6.5\). If one tag is available, the optimal structure is to cover half the items with it. If two tags are available, the optimal structure is that each tag covers half the items with no overlap. If three tags are available, there are several minima. One of them is a structure where each tag covers a third of the items with no overlap. Another allows symmetric overlaps between pairs of two tags, but no overlap between all three tags.

We discuss theoretical properties of the optimal tag structure in part two of this article series.

Computational solution: finding an optimal subset

Given a dataset as a matrix \(T \in \{0,1\}^{m \times n}\), we want to find a subset of the \(m\) tags that minimize \(g\). Typically \(m\) is large and we wish to eliminate tags that are redundant.

We encode the solution as a binary vector \(\mathbf{x} \in \{0,1\}^{m}\). If \(x_i = 1\), then tag \(i\) is chosen. In Python we use \(\mathbf{x}\) to subset the rows of \(T\), and penalize the number of tags chosen \(\sum_i x_i\).

def evaluate(x, alpha=1):
    """Given a binary vector x, evaluate the solution."""
    return items_viewed(T[x, :]) + alpha * np.sum(x)

Finding the optimal \(\mathbf{x}\) is a combinatorial optimization problem with a large solution space. There are \(2^m\) possible solutions, but even in such unfathomably large search spaces we can often find high-quality solutions. A number of heuristic algorithms can be used: greedy algorithms, simulated annealing, tabu search, genetic algorithms, variable neighborhood search, and so forth.

Greedy forward selection

Greedy forward selection starts with an all-zero solution vector \(\mathbf{x} = \mathbf{0}\). The first iteration considers every tag, and picks the one that leads to the lowest objective. Subsequent iterations consider every tag not already chosen, and after each iteration one new tag is added to the solution. The algorithm continues to add tags in this manner until the objective no longer improves.

Each of the outer iterations requires evaluating \(\mathcal{O}(m)\) tags, and each evaluation of the objective function takes \(\mathcal{O}(mn^2)\) time. The number of outer iterations is at most \(\mathcal{O}(m)\), so the worst case complexity is \(\mathcal{O}(m^2 n^2)\).

Greedy forward selection performs well in practice, but it’s not guaranteed to find an optimal solution to this problem. An alternative approach would be a greedy backward algorithm—starting with \(\mathbf{x} = \mathbf{1}\) and iteratively removing tags.

Simulated annealing

Simulated annealing starts with an all-zero vector \(\mathbf{x} = \mathbf{0}\). In each iteration we choose a random element in \(\mathbf{x}\) and flip it from zero to one or vice versa. If this new candidate solution is better, or if the temperature criterion kicks in, we keep the candidate solution and try to improve upon it.

Results on real-world datasets

We now consider a few real-world datasets. For each dataset we run (1) greedy forward selection and (2) simulated annealing with \(\sqrt{n}m\) iterations.

Dataset: consultants and roles

Folq is a Norwegian company that matches independent software consultants with projects. One of the ways consultants describe themselves is by using binary tags for roles. Examples of roles are “Tech Lead”, “Scrummaster” and “Data Scientist.” Most roles are Norwegian words, but many of them are not too far from their English counterparts.

The dataset contains \(m=48\) roles (tags) and \(n=2255\) consultants (items). Are all \(48\) roles needed, or could some of them be removed? Let’s see what the model has to say about the matter.

Greedy. The best objective achieved was \(99.57\), and \(29\) out of \(48\) roles were chosen.

Simulated annealing. The best objective achieved was \(100.43\), and \(30\) out of \(48\) roles were chosen.

The figure below shows how roles were chosen in each iteration of the forward greedy algorithm. A subset of approximately \(30\) roles is all we need to efficiently find consultants in this dataset. Notice the diminishing returns adding more and more roles. It seems like \(15\) roles or so does a perfectly okay job.

It should go without saying that mathematical models must be interpreted with care. Highly relevant tags could exist but not be in this dataset, because no one created them in the first place. Conversely, tags could be chosen by the model but still be nonsensical in the real world (for instance a tag for “blonde hair” for consultants). Models are often helpful, but work best alongside humans who understand the domain at hand.

Below is the full tag structure. The tags on the top were added first by the greedy solution. The items are sorted as binary vectors. Beware that even though the tag “Fullstackutvikler” (full stack developer) was added first by the greedy algorithm, it is not necessarily the most important tag in the context of all other chosen tags.

The following roles were discarded: AI-rådgiver, AI-utvikler, CISO, Content Manager, Endringsleder, Informasjonsarkitekt, Organisasjonsutvikler, Penetrasjonstester, Personvernrådgiver, Produkteier, Produktleder, Prosessleder, Security governance manager, Sikkerhetsanalytiker, Sikkerhetsarkitekt, Sikkerhetsrevisor, Sikkerhetsutvikler, Smidig coach and Virksomhetsarkitekt.  

Dataset: consultants and skills

Folq also allows consultants to describe themselves using skills, such as “SQL”, “Java”, “Agile” and “UI-design.” This dataset contains \(m=291\) skills (tags) and \(n=2255\) consultants (items).

Greedy. The best objective achieved was \(58.02\), and \(38\) out of \(291\) skills were chosen.

Simulated annealing. The best objective achieved was \(57.01\), and \(38\) out of \(291\) skills were chosen.

Simulated annealing found a solution that was slightly better than the greedy solution. The figure below shows which roles were chosen in each iteration of the forward greedy algorithm.

Below is the full tag structure. The tags on the top were added first by the greedy algorithm.

Dataset: movies and keywords

We will now consider a dataset from the GitHub repository IMDB-Movie-Dataset-Analysis. It’s quite large, so we select a random subset of movies. After filtering, the dataset has \(m=3341\) keywords (tags) and \(n=2000\) movies (items).

We set \(\alpha=10\) on this dataset, since very many tags are chosen if \(\alpha=1\). This dataset has a sub-optimal structure with little overlap between tags.

Greedy. The best objective achieved was \(859.39\), and \(17\) out of \(3341\) keywords were chosen.

Simulated annealing. We do not bother with simulated annealing on this dataset, since there is so little overlap between the tags. The greedy algorithm does a much better job in reasonable time.

The objective decreases very slowly as keywords are added. The keywords are not very descriptive, in the sense that they won’t help a user find a movie quickly.

The tag structure reveals why the objective decreases so slowly. A tag structure such as this is sub-optimal—each tag covers a small set of movies, and there is little overlap between the tags.

The same movies are also tagged with genres. There are \(m=20\) movie genres, and \(11\) are selected by our greedy selection if we run it with \(\alpha=10\). The objective function is \(234.12\) if we use genres—much better than the \(859.39\) achieved with keywords. Genres result in a lower objective function value while also using fewer tags.

Summary and references

Tags should help users retrieve information efficiently. We assumed that a user is looking for a particular item, and that he knows which tags to apply. Then we asked: “which tags will, on average, minimize time spent searching for the item?” We formulated this as a mathematical optimization problem.

For concrete datasets we solve the problem using greedy forward selection or simulated annealing. This works reasonably well for small-to-medium sized datasets. It does not scale to large datasets, since evaluating the objective function takes \(\mathcal{O}(mn^2)\) time, where \(m\) is the number of chosen tags and \(n\) is the number of items.

Minimizing the worst case–the maximum number of views—is also an option, but generally harder due to discontinuities in the objective function. An interesting question is whether the objective function can be evaluated more efficiently, either by using previous computations or by exploiting the subset structure of the tags. A more in-depth analysis of optimal tagging structures is presented in a follow-up article.

Literature review

Wikipedia has entries on subject indexing, tagging, taxonomies and information retrieval—but I found no research into optimal tag structures.

This and this question on Stack Overflow ask about the minimum number of tags that cover all items. This is the NP-hard set cover problem, but solving it does not lead to efficient information retrieval—the number of items covered by each tag is not accounted for.

In the paper “Limiting Tags Fosters Efficiency,” the authors use the conditional entropy of items given tags as a proxy for retrieval efficiency. The optimal solution is then one unique tag per item. A system with that many tags is obviously only “efficient” in a very specific and somewhat theoretical sense.

The idea in the paper “Mining Association Rules in Folksonomies” could be used on tags to create a hierarchical structure out of non-hierarchical tags, probably with varying results. The papers “Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering” appears to be in the same vein.