Solving two-sum using relational algebra: from leetcode to index datastructures

I’m trying to build a good mental model for which types of problems admit faster solutions by building “index” datastructures. In doing so, I have found that it helps to think about the logical structure of the puzzle and extract its relational structure. This can then be used to extract invariants which often lead to data structures that exploit that invariant.

We will explore this approach via SQL and its relational algebra, in which place the “index datastructures” are literally database indexes.

Two Sum Problem Statement

In this post, I want to give some intuition for this approach, by deconstructing the very first problem on leetcode: https://leetcode.com/problems/two-sum/description/. We slightly modify it to assume that potentially many 2-sums exist in the array:

Given an array of integers nums and an integer target, return all pairs of indices for which their two numbers add up to target.

Logically, this is an enumeration problem that can be written as:

Enumerate: { (i,j) | i≠j ∧ nums[i] + nums[j] == target }

where enumerate means to list the entire set of indices that match the predicate P(i,j) := i≠j ∧ nums[i] + nums[j] == target.

Rust O(n^2) Solution

Let’s first start by actually solving this problem in rust, without any fanciness. We just need to enumerate all pairs (i,j) and check if the predicate matches:

fn two_sum(nums: Vec<i32>, target: i32) -> Vec<(i32, i32)> {
    let mut result = Vec::new();
    for i in 0..nums.len() {
        for j in i + 1..nums.len() {
            if nums[i] + nums[j] == target {
                result.push((nums[i], nums[j]));
            }
        }
    }
    result
}

Rust O(n) Solution using Hashmap

Naturally the follow-up interview question is: is there a way to speed this up? And it turns out that there is: we can use a hashmap to trade space for time, and end up with an O(n) time solution at the cost of extra O(n) space (for storing the hashmap):

fn two_sum(nums: Vec<i32>, target: i32) -> Vec<(i32, i32)> {
    let mut index = HashMap::with_capacity(nums.len());
    for (i, num) in nums.iter().enumerate() {
        index.insert(target - num, i);
    }

    let mut result = Vec::new();
    for (i, num) in nums.iter().enumerate() {
        if let Some(&j) = index.get(num) {
            if i < j {
                result.push((i as i32, j as i32));
            }
        }
    }
    result
}

Predicate Logic view of the Hashmap solution

The set-theoretic formulation of the problem hides how the indices are found/iterated on.

Enumerate: { (i,j) | i≠j ∧ nums[i] + nums[j] == target }

We can reformulate the above to make that double for-loop iteration more explicit:

# First rewrite `nums` as a set of (index,value) pairs
# This set will be the SQL tables below
nums := {(i,x) | nums[i] = x}

solution = { (i,j) | (i,x) ∈ nums, // loop over all (idx,value) pairs
                     (j,y) ∈ nums, // loop over all (idx,value) pairs
                     x+y=k,        // filter 
                     i<j }

Here we make the double loop over nums explicitly specified via (i,x) ∈ nums, (j,y) ∈ nums.

And the way to speed this up is to replace one of them with an index (which is just the implementation of a function i->[j]; given an index i, that index should return all possible j’s that match the predicate):

# precompute the target set for each value
target_index = { (k-x, i) | nums[i] = x }

solution = { (i, j) | (i,x) ∈ nums,        // loop over all (idx,value) pairs
                      j ∈ target_index[x], // lookup in index
                      i < j }

Note that this solution is doing the EXACT same thing as the rust solution above: first building the index and then looping (once instead of twice nested!) over the array.

SQL O(n^2) Solution

We can first assume that we have a nums table to represent the array.

CREATE TABLE nums (
    idx INTEGER,  -- the index i
    val INTEGER   -- the value nums[i]
);

-- Insert sample data (e.g., [2, 7, 11, 15] with target k=9)
INSERT INTO nums VALUES 
    (0, 2),
    (1, 7),
    (2, 11),
    (3, 15);

SELECT a.idx AS i, b.idx AS j
FROM nums a
JOIN nums b ON a.val + b.val = k
WHERE a.idx < b.idx;

SQL O(n) Solution using Hash Join

-- Create index on val (to give us the mapping val->idx)
CREATE INDEX idx_val ON nums(val);

SELECT a.idx AS i, b.idx AS j
FROM nums a
JOIN nums b ON b.val = k - a.val
WHERE a.idx < b.idx;

We can now explain these two queries to see that one is using a nested for loop, and the other the index (we replace k:=9):

sqlite> EXPLAIN QUERY PLAN
SELECT a.idx AS i, b.idx AS j
FROM nums a
JOIN nums b ON a.val + b.val = 9
WHERE a.idx < b.idx;

QUERY PLAN
|--SCAN a
`--SCAN b

sqlite> EXPLAIN QUERY PLAN
SELECT a.idx AS i, b.idx AS j
FROM nums a
JOIN nums b ON b.val = 9 - a.val
WHERE a.idx < b.idx;

QUERY PLAN
|--SCAN a
`--SEARCH b USING INDEX idx_val (val=?)

Note that the only difference between using the index and not here is whether we JOIN nums b ON a.val+b.val=9 or b.val=9-a.val. Somehow I would have thought that sqlite3 would have been smart enough to use the index in both cases…

Bloom Filter

Another interesting case is using a Common Table Expression instead of an inverse index, in which case sqlite3 uses a bloom filter index:

sqlite> EXPLAIN QUERY PLAN
WITH target AS (
    SELECT val, idx
    FROM nums
)
SELECT 
    a.idx AS i,
    b.idx AS j
FROM nums a
JOIN target b ON b.val = 9 - a.val
WHERE a.idx < b.idx;

QUERY PLAN
|--SCAN a
|--BLOOM FILTER ON nums (val=?)
`--SEARCH nums USING AUTOMATIC COVERING INDEX (val=?)

What about Merge Joins?

Relational/SQL experts will quickly notice that we haven’t used a merge join. That is, we’ve used a hash index but not a B+tree index (typically what is used to keep a sorted index needed for merge join). Unfortunately for 2-sum, given that we want the indices, we cannot sort in place, so using a merge-join algorithm still costs as much space as using a hash index (unless the nums array is already sorted!):

algorithm	time	space
nested loop	O(n^2)	O(1)
hash join	O(n)	O(n)
merge join	O(nlogn)	O(n)

Still, it is instructive to compare the merge join algorithm and think about what it would be equivalent to if written in rust. First the SQL query would look like:

-- Create a CTE with the "complement" values
WITH nums_with_complement AS (
    SELECT idx, val, 10 - val AS complement
    FROM nums
)
SELECT a.idx AS i, b.idx AS j
FROM nums_with_complement a
JOIN nums_with_complement b ON a.val = b.complement
WHERE a.idx < b.idx;

Apparently sqlite doesn’t implement merge join (I’m not sure why..), but this will work in postgres:

test_db=# EXPLAIN (ANALYZE)
WITH nums_with_complement AS (
    SELECT
        idx,
        val,
        10 - val AS complement
    FROM nums
    ORDER BY val
)
SELECT a.idx AS i, b.idx AS j, a.val, b.val
FROM nums_with_complement a
JOIN nums_with_complement b ON a.val = b.complement
WHERE a.idx < b.idx;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=512.02..970.23 rows=8513 width=16) (actual time=0.056..0.058 rows=3 loops=1)
   Merge Cond: (a.val = b.complement)
   Join Filter: (a.idx < b.idx)
   Rows Removed by Join Filter: 3
   CTE nums_with_complement
     ->  Sort  (cost=164.16..169.81 rows=2260 width=12) (actual time=0.028..0.028 rows=8 loops=1)
           Sort Key: nums.val
           Sort Method: quicksort  Memory: 25kB
           ->  Seq Scan on nums  (cost=0.00..38.25 rows=2260 width=12) (actual time=0.013..0.014 rows=8 loops=1)
   ->  Sort  (cost=171.11..176.76 rows=2260 width=8) (actual time=0.041..0.041 rows=8 loops=1)
         Sort Key: a.val
         Sort Method: quicksort  Memory: 25kB
         ->  CTE Scan on nums_with_complement a  (cost=0.00..45.20 rows=2260 width=8) (actual time=0.030..0.032 rows=8 loops=1)
   ->  Sort  (cost=171.11..176.76 rows=2260 width=12) (actual time=0.012..0.012 rows=8 loops=1)
         Sort Key: b.complement
         Sort Method: quicksort  Memory: 25kB
         ->  CTE Scan on nums_with_complement b  (cost=0.00..45.20 rows=2260 width=12) (actual time=0.000..0.001 rows=8 loops=1)
 Planning Time: 0.178 ms
 Execution Time: 0.110 ms
(19 rows)

And funnily enough, the merge join algorithm is doing internally the exact same thing as a famous leet code technique of using 2 pointers and moving them towards the center of the (sorted) array! In fact, this is sort of implicitly merging the same array’s head with its tail…

fn two_sum(nums: Vec<i32>, target: i32) -> Vec<(i32, i32)> {
    let mut sorted_nums = nums.into_iter().enumerate().collect::<Vec<(usize, i32)>>();
    sorted_nums.sort_by_key(|(_, num)| *num);

    let mut left = 0;
    let mut right = sorted_nums.len() - 1;
    let mut result = Vec::new();
    loop {
        if left >= right {
            break;
        }
        let sum = sorted_nums[left].1 + sorted_nums[right].1;
        if sum == target {
            result.push((sorted_nums[left].0 as i32, sorted_nums[right].0 as i32));
            left += 1;
            right -= 1;
        } else if sum < target {
            left += 1;
        } else {
            right -= 1;
        }
    }
    result
}

When can index datastructures help?

This was just a first glimpse into thinking logically about problem definitions. This systematization can often help figure out when/where/how datastructures can help find faster algorithms.

First it’s often helpful to think of datastructures in terms of the invariants/properties that they afford:

Sorted order → Binary search tree, skip list
Hash bucketing → Hash table
Prefix structure → Trie
Range decomposition → Segment tree, interval tree
Disjoint sets → Union-find
Partial order → Heap

The main question to ask oneself when looking at a problem formulation is whether such invariant could be useful to reduce its search space:

indexes help when queries can be decomposed into subproblems that the index’s invariant makes cheap.

Translating a problem into a logical expression can then help more systematically determine whether these invariants exist. Here’s a (very likely non-exhaustive) list of logical operators and how they related to invariants/data-structures, generated with the help of Claude:

∃x (existence) → Bloom filter (probabilistic membership)
min(x) → Heap (maintains partial order: parent ≤ children)
x in range [a,b] → Interval tree (decomposes space into intervals)
∧ of predicates → Bitmap index (each bit = one predicate result)
Reachability → Transitive closure precomputation
Dominance → Range tree/kd-tree (geometric decomposition)

I plan to dive deeper into some of these in future articles.