Tags for optimal information retrieval - part 2: theory

This is part two of a two-article series. I recommend reading them in order:


In this article we investigate optimal tag structures. In the previous article we answered “which subset of a given set of \(m\) tags is best?”, and we will now focus on the continuous problem “given \(m\) tags, how should they be structured if we have freedom to arrange them as we wish?”

Consider again the case where we’re looking for a specific item. The item is uniformly distributed on the interval \([0, 1]\). Assume that we have \(m\) tags that we can cover the interval with, to help us find the item as quickly as possible.

Each combination of tags induces a sub-interval. In the figure above, the two tags \(1\) and \(2\) induce intervals with lengths \(p_1\), \(p_{12}\) and \(p_2\). The goal is to choose the tag structure (interval lengths) to minimize the total expected length of interval searched as we look for an item. As in the previous article, we assume that we are searching using an AND-filter.

Small problems

To get a feeling for the problem, we’ll consider the cases \(m=1, 2, 3\) in detail before generalizing.

A single tag

Consider first a single tag. The tag covers an interval of length \(0 \leq p_1 \leq 1\), as shown in the figure below. Our goal is to find the optimal interval length \(p_1\).

There are two outcomes: either the item we’re looking for is in the interval covered by \(p_1\) or it is not. The probability that the item we’re looking for is in the interval is \(p_1\), since it is drawn uniformly from \([0, 1]\). If the item is in the interval \(p_1\), then on average we will look through half the interval length \(p_1/2\) before we find it. On the other hand, the probability that the item is not in the interval is \(1 - p_1\). In that case, we have to look through the entire interval \([0, 1]\) since we cannot apply any tags. On average we find it after searching half the interval, so \(1/2\) is the expected length of interval searched.

Combining these two cases, the objective function that measures the expected total length of interval searched is

\begin{align*} f(p_1) = p_1 \times (p_1)/2 + (1 - p_1) \times (1)/2. \end{align*}

We minimize this function subject to the constraint \(0 \leq p_1 \leq 1\). The minimizer is \(p_1^\star = 1/2\), and \(f(p_1^\star) = 3/8\).

Two tags

Now consider two tags. There are three intervals covered by at least one tag: \(p_1\) is the interval covered by only tag \(1\), \(p_2\) is the interval covered by only tag \(2\) and \(p_{12}\) is the interval covered by both tags. The figure below shows the situation:

Again we sum probabilities times outcomes. Each outcome is the expected interval searched before we find the item. Notice that if the item is in the interval covered by \(p_1\) and we apply the filter for tag \(1\), then an AND-filter will present us with all items covered by tag \(1\), irrespective of whether or not they are also covered by tag \(2\). This means that we have to search through both intervals \(p_1\) and \(p_{12}\).

The objective function becomes:

\begin{align*} f(p_1, p_2, p_{12}) &= p_1 \times (p_1 + p_{12})/2 \\ &+ p_2 \times (p_2 + p_{12})/2 \\ &+ p_{12} \times (p_{12})/2 \\ &+ (1 - p_1 - p_2 - p_{12}) \times (1)/2 \end{align*}

We must minimize this function, subject to \(p_1 + p_2 + p_{12} \leq 1\) and \(p_1, p_2, p_{12} \geq 0\). The minimizer is \((p_1^\star, p_2^\star, p_{12}^\star)  = (1/2, 1/2, 0)\), which leads to the objective function value \(1/4\).

Three tags

A pattern starts to emerge. The case of three tags can be visualized as follows.

There are \(2^m-1\) intervals covered by at least one tag. The objective function becomes:

\begin{align*} f(p_1, p_2, \ldots, p_{123}) &= p_1 \times (p_1 + p_{12} + p_{13} + p_{123})/2 \\ &+ p_2 \times (p_2 + p_{12} + p_{23} + p_{123})/2 \\ &+ p_3 \times (p_3 + p_{13} + p_{23} + p_{123})/2 \\ &+ p_{12} \times (p_{12} + p_{123})/2 \\ &+ p_{13} \times (p_{13} + p_{123})/2 \\ &+ p_{23} \times (p_{23} + p_{123})/2 \\ &+ p_{123} \times ( p_{123})/2 \\ &+ (1 - p_1 - p_2- p_3- p_{12}- p_{13}- p_{23}- p_{123}) \times (1)/2 \end{align*}

The constraints for the minimization problem are \(\sum_i p_i \leq 1\) and \(p_i \geq 0\) for all tags \(i\). While the \(m=1\) and \(m=2\) cases have unique minimizers, here the solution is a convex combination

\begin{align*} (1-\beta) \times (1/3, 1/3, 1/3, 0, 0, 0, 0) + \beta \times(0, 0, 0, 1/3, 1/3, 1/3, 0) \end{align*}

for any \(0 \leq \beta \leq 1\).

General problem with \(m\) tags

In the general problem, we sum over all children in the powerset of tags. The powerset can be visualized as a Hasse diagram, shown below for \(m=4\) tags. The general function for \(m\) tags takes \(2^m-1\) variables as inputs. Let \(P(m)\) be the intervals corresponding to the powerset of \(m\) tags, for instance \(P(3) = \{p_{1}, p_{12}, p_{2}, p_{13}, p_{123}, p_{23}, p_{3}\}\).

The general function considers each interval in the powerset in turn, and loops down through the Hasse diagram to all children (supersets). The function is

\begin{align*} f(\boldsymbol{p}) &= \sum_{p_i \in P(m)} p_i \times (p_i + \operatorname{children}(p_i))/2 \\ &+ (1 - \sum_{p_i \in P(m)} p_i)/2. \end{align*}

As always, the function must be minimized subject to constraints \(p_i \geq 0\) and \(\sum_{p_j \in P(m)} p_i \leq 1\).

Another way to construct this function is to use a recursive definition of the power sets, ordering the variables as follows:

  • \(m=1\) yields the ordered set \(P(1) = \{p_{1}\}\).
  • \(m=2\) yields the ordered set \(P(2) = \{p_{1}, p_{12}, p_{2}\}\).
  • \(m=3\) yields the ordered set \(P(3) = \{p_{1}, p_{12}, p_{2}, p_{13}, p_{123}, p_{23}, p_{3}\}\).
  • In general \(P(m) = S_{m-1} \cup \{ p_i \cup m \mid p_i \in S_{m-1} \} \cup \{ p_m \}\).

Given this ordering of the variables, we can write the objective function as a quadratic form

\begin{align*} f(\boldsymbol{p}) = \frac{1}{2} \boldsymbol{p}^T N_{k} \boldsymbol{p} - \frac{1}{2} \boldsymbol{1}^T\boldsymbol{p} + \frac{1}{2}, \end{align*}

where the non-symmetric matrix \(N_{k} \in \{0, 1\}^{2^m-1}\) is recursively defined as

\begin{align*} N_1 = \begin{bmatrix}1\end{bmatrix} \quad N_{k+1}= \begin{bmatrix} N_k & N_k &  \\  & N_k &  \\  & \mathbf{1}^T & 1 \\ \end{bmatrix}. \end{align*}

Reducing the number of variables with symmetry

We make three observations about the objective function:

  • The function \(f(\boldsymbol{p})\) is invariant to permutations of values of variables that represent the same cardinality elements in the power set (the same cover of the items). For instance, when \(m=2\) the solutions \((p_1, p_2, p_{12}) = (1/3, 1/2, 0)\) and \((p_1, p_2, p_{12}) = (1/2, 1/3, 0)\) yield identical objective function values.
  • If only the variables that represent a given cardinarily in the power set are non-zero, then the optimal solution is to let them be equal to each other. In other words, if we consider only one level in the Hasse diagram, then the optimal solution is to spread out the coverage equally across all variables associated with that level. For instance, if we only consider variables that represent cardinality three in the \(m=4\) problem, then \(p_{123} = p_{124} = p_{134} = p_{234} = 1/4\) is optimal.
  • If we consider sets of variables with a given cardinality, i.e., a single level in the Hasse diagram, then the level with the most variables is the optimal level. With \(m\geq2\) tags this is level \(\lfloor m/2 \rfloor\), which has \(\binom{m}{\lfloor m/2 \rfloor}\) variables representing unique intervals all having \(\lfloor m/2 \rfloor\)-coverage. This will be explained in more detail later in the article.

We conjecture that it suffices to check solutions where each variable that represents the same cardinality elements in the power set is equal, for instance if \(m=3\) then we assume \(p_{1} = p_{2} = p_{3}\) and \(p_{12} = p_{13} = p_{23}\). By this symmetry argument, we reduce the number of variables from \(2^m\) to \(m\).

Let \(q_i\) be the value of every variable that represents \(i\) cardinality variables, i.e., intervals covered by exactly \(i\) tags. For instance, if \(m=3\) then the \(q_1\) is the value of variables \(p_1, p_2, p_3\) and \(q_2\) is the value of \(p_{12}, p_{13}, p_{23}\) and \(q_3\) is the value of \(p_{123}\).

With respect to the new variable vector \(\boldsymbol{q}\), the problem becomes:

\begin{align*} f(\boldsymbol{q}) &= \frac{1}{2} \sum_{i=1}^m \binom{m}{i} q_i \sum_{k=i}^m q_k \binom{m-i}{k-i} - \frac{1}{2} \sum_{i=1}^m \binom{m}{i} q_i + \frac{1}{2} \\ &= \frac{1}{2} \sum_{i=1}^m \sum_{k=i}^m \binom{m}{k} \binom{k}{i} q_i q_k - \frac{1}{2} \sum_{i=1}^m \binom{m}{i} q_i + \frac{1}{2} \\ &= \frac{1}{2} \boldsymbol{q}^T C_m \boldsymbol{q} - \frac{1}{2} \boldsymbol{c}_m^T \boldsymbol{q} + \frac{1}{2} \end{align*}

We minimize this function with the constraint \(\boldsymbol{c}_m^T \boldsymbol{q} \leq 1\) and \(\boldsymbol{q} \geq \boldsymbol{0}\).

The problem is not convex

With one and two tags the problem is strictly convex. With three tags the problem is convex, but not strictly convex. With four or more tags the problem is neither convex nor concave.

Here is an example with \(m=4\) showing the non-convexity of \(f(\boldsymbol{q})\). Along one search direction in four dimensional space the function is convex, but along another search direction it is concave.

For each \(m\) we will now search along a path that represents convex combinations of neighboring levels in the Hasse diagram. In the figure below \(0\) maps to \((0, 0, 0, \ldots)\), \(1\) maps to \((1/\binom{m}{1}, 0, 0, \ldots)\), \(2\) maps to \((0, 1/\binom{m}{2}, 0, \ldots)\) and so forth. A decimal value like \(1.3\) maps to a convex combination of the vectors that \(1\) and \(2\) map to.

We observe that setting \(q_i = 1/\binom{m}{i}\) at level \(i = \lfloor m/2 \rfloor\) minimizes the function along the path searched in the figure. A formal proof that this is the best achievable solution escapes me. However, in millions of computer simulations no better solution was found for \(m\geq 2\), so I am quite confident that it’s the optimal solution.

Two bounds on the optimal solution

We present two upper bounds on the minimization problem—the objective is at least as low as these.

One-coverage. A bound for \(m \geq 2\) can be established by covering each item with a single tag, where each tag covers an interval of length \(1/m\). This is represented by \(\boldsymbol{q} = (1/m, 0, 0, \ldots)\), which is clearly a feasible solution. The objective function value is \(f(\boldsymbol{q}) = 1/(2m)\). In the discrete case of \(n\) items, the objective function \(g\), which penalizes the number of tags too, becomes

\begin{align*} g(m) = \frac{n}{2m} + \alpha m. \end{align*}

Differentiating with respect to \(m\) and solving \(g'(m) = 0\) shows us that the optimal number of tags in this situation is \(m^\star = \sqrt{n/2 \alpha }\). Furthermore \(g(m^\star) = \sqrt{2n\alpha}\). To summarize, if we construct a set of \(\sqrt{n/2 \alpha}\) mutually exclusive tags we have a reasonable starting point in a real-world system.

The conjectured optimal coverage. We conjecture that the following is an optimal solution for \(m \geq 2\) tags. Choose a full, balanced \(\lfloor m/2 \rfloor\) coverage. In other words, choose the level in the Hasse diagram with the largest number of variables. This is represented by setting \(q_i = 1/\binom{m}{i}\) at level \(i = \lfloor m/2 \rfloor\), and the remaining entries in \(\boldsymbol{q}\) are set to zero. The objective function value can be shown to be \(f(\boldsymbol{q}) = \frac{1}{2 \binom{m}{\lfloor m/2 \rfloor}}\).

Solving

\begin{align*} g(m) = \frac{n}{2 \binom{m}{\lfloor m/2 \rfloor}} + \alpha m \end{align*}

for the optimal \(m\) must be done numerically, as no simple analytical expression is available. For instance, if we set \(\alpha=1\) and have \(n=10^2\) items, then \(m=7\) tags is the optimal number, given that those tags obey the conjectured optimal structure. With \(n=10^3\) items \(m=10\) tags is all we need, \(n=10^4\) items require \(m=14\) tags and \(n=10^6\) items require only \(m=21\) tags.

Here are two examples showing optimal tag structures:

Summary and references

In the article leading up to this one we solved a combinatorial optimization problem using real-world data. That led us to wonder more about the theoretically optimal tag structure, which we investigated in this article.

We began by solving small cases with one, two and three tags. Only after understanding the objective function for these small cases did we generalize. This is how I typically work with mathematical problems—mastering the small concrete cases first, then attempting to generalize.

After writing down the objective function for \(m\) tags, we made observations regarding the symmetry of the function. We reduced it from a function taking \(2^m\) arguments to one taking \(m\) arguments. Then we determined that the function was not convex and established two upper bounds, one of which we conjecture to be the optimal solution. I was unable to prove this, but ran millions of computer simulations trying out random tag structures, and not a single one beat the bound—so I believe the conjecture is true.

For me this was an interesting problem to look at, since I found no literature on this subject. Wikipedia has entries on subject indexing, tagging, taxonomies and information retrieval—but I found no research into how tags can be used for maximally efficient information retrieval.