Data structures in Python
- 18. September 2021 (modified 19. September 2021)
- #algorithms
I recently held a seminar for some colleagues about data structures. In preparation for the talk, I took some notes about how common data types are used in Python and how they are implemented. Below are my notes. The code snippets are all Python 3.7.
Table of Contents
What are data structures?
A data structure is a way of organizing information in a computer’s memory. An analogy to libraries and books is fitting. The books in a library are first categorized by topic, then sorted by the last name of the author. When the books are organized like this, we don’t have to inspect every book on the shelf to find the one we’re looking for. To find a section in the book itself, we can either use the table of contents or the index.
These are all clever ways of organizing information. The goal is to minimize the time searching through data. In the same manner, data structures organize data for a computer program.
This article contains some notes on abstract data types and their implementation using data structures. I also show how to use the various data structures in Python, and provide some comments about how data types are implemented in C++.
On terminology
Wikipedia makes a distinction between abstract data types and data structures. For instance, a queue is an abstract data type. A specific implementation of a queue might use a linked list, which is a data structure.
- An abstract data type is a user-centric description. It provides a conceptual model and a set of methods for the data type.
- A data structure is a more implementation-centric description. It provides a concrete and efficient implementation of the methods needed for the data type.
Matters are complicated by how different programming languages use these terms, and the fact that some textbooks refer to both concepts simply as data structures.
For instance, Python’s dict
and C++’s unordered_map
are both maps (an abstract data type) implemented using hash tables (a data structure).
Method names also differ vastly. Consider the abstract data type queue:
- In the book “Introduction to algorithms” by Cormen et al., we
enqueue
anddequeue
items to a queue. - In C++ we
push
andpop
. - In Python we
append
andpop
.
The examples above are not all that problematic, since we understand what the operations mean.
The names push
and append
and enqueue
all signify that we’re adding an item.
The terms vary wildly across languages, implementations and literature.
Abstract data types
Below is a list of common abstract data types, and how to use them in Python. Each data type has a description, a note on how it can be implemented, some common methods that we expect to be implemented efficiently, and a Python example. The big-\(\mathcal{O}\) notation showcasing the complexity of the methods is generally the worst case, but sometimes it’s the amortized complexity.
Stack
A stack behaves like a stack of coins on a table, where we are only allowed to touch the top coin. Adding a coin to the stack is called pushing to the stack. Removing a coin from the top is called popping from the stack.
Stacks are useful when the last-in-first-out (LIFO) property is important. This is true for depth-first search and for recursion call stacks in programming languages.
Can be implemented by: linked list, array, list
Method | Description | Complexity |
---|---|---|
size() |
Return the number of items in the stack | \(\mathcal{O}\left( 1 \right)\) |
push(item) |
Add a new item to the stack | \(\mathcal{O}\left( 1 \right)\) |
peek() |
Get the most recently added item | \(\mathcal{O}\left( 1 \right)\) |
pop() |
Remove the most recently added item | \(\mathcal{O}\left( 1 \right)\) |
In Python we can use a list
as a stack.
It’s crucially important that we push and pop using append
and pop
from the back of the list.
If we push and pop in the front of the list, the time complexity becomes \(\mathcal{O}\left( n \right)\) since all the other elements have to be moved in memory.
>>> stack = []
>>> stack.append("Tom")
>>> stack.append("Mary")
>>> print(stack[-1]) # Peek at most recently added item
>>> while stack:
... print(f"Popped: {stack.pop()}")
Popped: Mary
Popped: Tom
Queue
The queue data type is analogous to a queue in a convenience store. The person who enters the queue first, is served first. People get in line at one end, and are served from the other end of the queue.
Queues are particularly useful when the first-in-first-out (FIFO) property is important. An obvious example is software that simulates real-life queues. Queues that coordinate work in concurrent programs also come to mind.
Can be implemented by: linked list, doubly-linked list
Method | Description | Complexity |
---|---|---|
size() |
Return the number of items in the queue | \(\mathcal{O}\left( 1 \right)\) |
enqueue(item) |
Add a new item to the queue | \(\mathcal{O}\left( 1 \right)\) |
peek() |
Get the least recently added item | \(\mathcal{O}\left( 1 \right)\) |
dequeue() |
Remove the least recently added item | \(\mathcal{O}\left( 1 \right)\) |
In Python we use the double ended queue in collections.deque
as a queue.
We avoid using list
, since operations performed at the front of the list run in \(\mathcal{O}\left( n \right)\) time.
Bad idea!
>>> from collections import deque
>>> queue = deque()
>>> queue.append("Tom")
>>> queue.append("Mary")
>>> while queue:
... print(f"Popped: {queue.popleft()}")
Popped: Tom
Popped: Mary
If we wanted to, we could’ve used appendleft
to enqueue and pop
to dequeue instead.
Double ended queue (deque)
A double ended queue, which supports pushing and popping from both the front and the back.
Can be implemented by: doubly-linked list
Method | Description | Complexity |
---|---|---|
size() |
Return the number of items in the deque | \(\mathcal{O}\left( 1 \right)\) |
push_back(item) |
Add to the back of the deque | \(\mathcal{O}\left( 1 \right)\) |
pop_back() |
Remove from the back of the deque | \(\mathcal{O}\left( 1 \right)\) |
push_front(item) |
Add to the front of the deque | \(\mathcal{O}\left( 1 \right)\) |
pop_front() |
Remove from the front of the deque | \(\mathcal{O}\left( 1 \right)\) |
In python we use collections.deque
, just as we saw above.
Priority queue
Priority queues are queues where each element has a priority. In a min (max) priority queue the priority represents cost (utility).
A priority queue is crucial for Dijkstra’s shortest-path algorithm. More generally, the wide class of graph search methods called best-first searches all rely on priority queues.
Priority queues are almost always implemented using binary heaps, which are in turn implemented on top of dynamic arrays (such as Python’s list
).
The more general balanced search tree can do almost everything a priority queue can do and more.
The advantage of a priority queue is its ease of implementation, somewhat lower overhead and \(\mathcal{O}\left( n \right)\) heapify
method.
Can be implemented by: binary heap, other types of heaps, balanced search trees
Method | Description | Complexity |
---|---|---|
heapify(items) |
Create a new priority queue | \(\mathcal{O}\left( n \right)\) |
insert(item, priority) |
Add a new item | \(\mathcal{O}\left( \ln n \right)\) |
minimum() |
Get the minimum item | \(\mathcal{O}\left( 1 \right)\) |
extract_minimum() |
Get the minimum element and pop it | \(\mathcal{O}\left( \ln n \right)\) |
Some implementations also support contains(item)
, update_key(item, new_key_value)
and delete(item)
—all of which can be implemented in \(\mathcal{O}\left( \ln n \right)\) time.
Some types of heaps also support fast merging (or melding), which is useful if two or more priority queues need to be combined.
In Python we use the heapq
module to manipulate a list
as a priority queue.
>>> import heapq
>>> priority_queue = [(1, "Mary"), (3, "Tom"), (0, "James")]
>>> heapq.heapify(priority_queue) # Changes list in place
>>> priority_queue # Notice that the order has changed
[(0, 'James'), (3, 'Tom'), (1, 'Mary')]
>>> print(f"The smallest item is: {priority_queue[0]}")
The smallest item is: (0, 'James')
>>> heapq.heappush(priority_queue, (2, "Alice"))
>>> while priority_queue: # Get all items
... priority, item = heapq.heappop(priority_queue)
... print(f"Popped item {item} with priority {priority}.")
Popped item James with priority 0.
Popped item Mary with priority 1.
Popped item Alice with priority 2.
Popped item Tom with priority 3.
Set
A set data type models the mathematical concept of a set: a collection of unique items (no duplicates).
Sets typically have fast membership testing, as well as support for set operations such as union
and intersection
.
Can be implemented by: hash table, red-black tree, other balanced binary search trees
The methods contains
, add
and remove
below run in \(\mathcal{O}\left( \ln n \right)\) time instead of \(\mathcal{O}\left( 1 \right)\) when a balanced binary search tree is used instead of a hash table.
The advantage of using a self-balanced tree is that minimum
, maximum
, successor
and predecessor
can be implemented in \(\mathcal{O}\left( \ln n \right)\) time.
When sets are implemented with hash tables, the following complexity is achieved:
Method | Description | Complexity |
---|---|---|
size() |
Return the number of items in the set | \(\mathcal{O}\left( 1 \right)\) |
contains(item) |
Check if the item is in the set | \(\mathcal{O}\left( 1 \right)\) |
add(item) |
Add an item | \(\mathcal{O}\left( 1 \right)\) |
remove(item) |
Remove an item | \(\mathcal{O}\left( 1 \right)\) |
In Python we use set
, which is implemented with a hash table.
>>> football_players = set(["Joe", "Mary", "Joanna"])
>>> handball_players = {"Brad", "Joanna", "Ashley"}
>>> "Joe" in football_players # Fast membership testing in O(1) time
True
>>> football_players.intersection(handball_players) # Plays both sports
{'Joanna'}
>>> football_players.union(handball_players) # Plays any sport
{'Ashley', 'Brad', 'Joanna', 'Joe', 'Mary'}
>>> football_players - handball_players # Plays only football
{'Joe', 'Mary'}
In C++ we have two types of sets:
set
, a sorted set implemented using a red-black tree (a self-balancing tree).unordered_set
, an unsorted set implemented using a hash table.
Map
A map stores key-value pairs, where looking up a value by a key is fast.
Can be implemented by: hash table, red-black tree, other self-balanced trees
Method | Description | Complexity |
---|---|---|
size() |
Return the number of items in the map | \(\mathcal{O}\left( 1 \right)\) |
insert(key, value) |
Insert a new key-value pair | \(\mathcal{O}\left( 1 \right)\) |
lookup(key) |
Return the value if it exists | \(\mathcal{O}\left( 1 \right)\) |
remove(key) |
Remove the value if it exists | \(\mathcal{O}\left( 1 \right)\) |
In Python the dict
is a map type, implemented by a hash table.
>>> phonebook = {"Mary": 92839482, "Joe": 92857294}
>>> len(phonebook) # The size
2
>>> phonebook["Tom"] = 18572957 # Insert a new key and value
>>> phonebook.get("Mary", None) # Lookup a value, return None if not found
92839482
>>> phonebook.pop("Joe") # Remove a key and value
92857294
In C++ we have two types of maps:
map
, a sorted map implemented using a red-black trees (a self balancing tree)unordered_map
, an unsorted map implemented using a hash table
Balanced Search Tree
The balanced search tree is in many ways the ultimate sorted collection. Every method runs in logarithmic time, the data type is extremely versatile, and many interesting problems can be solved by augmenting the data structure. The four most common implementations are given below.
Can be implemented by:
- Red-black trees. Each node stores a color. Upon insertion and deletion, the tree is re-balanced using the color properties to guarantee near-perfect balance.
- AVL trees. Each node has a balance factor. Upon insertion and deletion the balance factor might change, causing the tree to become unbalanced. If this happens, the tree is rebalanced using so-called rotations.
- Splay trees. Recently accessed elements are moved to the top of the tree. A splay tree is not guaranteed to be balanced, but for real-world access patterns it’s often well-balanced. The performance characteristics are amortized costs.
- B-trees. The B-tree allows nodes to have more than two children. It’s the standard index implementation in almost all relational databases. In databases, the bottleneck in terms of time is reading from disk, and organizing an index in a B-tree ensures few disk reads.
Method | Description | Complexity |
---|---|---|
contains(key) |
Does the tree contain key ? |
\(\mathcal{O}\left( \ln n \right)\) |
predecessor(key) |
Return greatest key smaller than key |
\(\mathcal{O}\left( \ln n \right)\) |
successor(key) |
Return smallest key greater than key |
\(\mathcal{O}\left( \ln n \right)\) |
minimum() |
Return the minimum key | \(\mathcal{O}\left( \ln n \right)\) |
maximum() |
Return the maximum key | \(\mathcal{O}\left( \ln n \right)\) |
insert(key) |
Insert a new key | \(\mathcal{O}\left( \ln n \right)\) |
delete(key) |
Delete a key | \(\mathcal{O}\left( \ln n \right)\) |
The Python standard library has no balanced search tree.
However, the sortedcontainers project is a pure Python implementation of sorted lists, sorted dicts, and sorted sets.
Interestingly it leverages Python’s built in list
instead of implementing an actual tree.
Other important abstract data types
I mention three other abstract data types that are important. Graphs are well-known from the computer science curriculum. Linear algebra is the workhorse of numerical methods, machine learning and optimization. Dataframes are the applied statistician’s best friend.
- Graph types. In Python, the package NetworkX implements graphs and their algorithms. For graphs, representing the objects and a few methods is not enough. Working efficiently with graphs requires a package that implements all the common graph algorithms, such as search, shortest path, DAG algorithms, and so forth. NetworkX has all this and much more.
- Linear algebra types. To a software engineer, a matrix is a two-dimensional array. In many other disciplines, a matrix is something more – it comes with a matrix product, several important factorizations, eigenvalues, eigenvectors, and so forth. A good workspace for linear algebra requires implementations of numerical routines for mathematical operations. The implementations must be efficient, both in terms of big-\(\mathcal{O}\) and in terms of constants. Both dense and sparse matrices should be supported. In the Python world, NumPy and SciPy has all this.
- DataFrame types. For those who work with data sets, a DataFrame type is essential. It should implement SQL like methods, such as
GROUP BY
,WHERE
,ORDER BY
,SELECT
, and all the rest. The pandas package is often the correct choice in Python.
Summary
I believe there are roughly three levels to knowledge and maturity in applied data structures.
- A novice does not know the complexity of the methods in his language of choice. This was me for years when I started programming. I had no idea that membership testing was fast for sets, but slow for lists. I had no qualms about pushing and popping the front-most element in a list. I didn’t know what big-\(\mathcal{O}\) notation was.
- An intermediate user of a language knows the big-\(\mathcal{O}\) complexity of the methods he uses. He is able to use the correct data structure to solve everyday problems.
- An advanced user has knowledge of data structures that are not in the standard library of his language of choice. With some work, he can augment existing data structures to solve novel problems. He is able to reason in detail about big-\(\mathcal{O}\) complexity of methods. He is also aware of the potential downsides of the big-\(\mathcal{O}\) model itself.
Notes and references
- Wikipedia has a list of data structures.
- VisuAlgo shows visualizations of data structures and algorithms. The website is created by Steven Halim, who also wrote the book Competitive Programming: The Lower Bound of Programming Contests.
- Data Structure Visualizations by professor Galles at the University of San Francisco.
- The Algorithms and Data Structures Cheatsheet accompanying the book Algorithms by Robert Sedgewick and Kevin Wayne.
- The book Introduction to Algorithms by Cormen et al.