21 DSA

21.1 Introduction

Data structures and algorithms, referred to as DSA, is an advanced topic in computer science.

At macro level, data and correspondingly requirement for data analysis is growing exponentially, different use cases get created with this in every domain. It is thus a requirement to store and operate on data efficiently. Storage has become cheap hence efficiency/speed is more important now.

Efficiency in context of speed is one of the most needed property of a computer program. Managing storage and operations is a major factor on which efficiency of a program depends.

There are multiple ways to increase performance.

Hardware: faster CPU, which is hitting a boundary
Program Design: concurrency, multi-threading, multi-processing, async-io
DSA: has been instrumental in increased performance of applications

At micro level, a program instructs CPU to perform operations which take time.

load data into RAM
fetch data from RAM
operate on data
write data onto RAM

All these operations happen at very large scale even in simple programs, e.g. basic data analysis. Therefore designing how the data is structured (data structure) and operated upon (algorithms) allows reducing these operations to improve performance.

Computer scientists are involved in rigorous study of the topic, including math involved, to come up with new data structures and algorithms. There are 2 main challenges

find optimal solutions in terms of speed and design to the ever expanding field of data
prove and communicate the solutions

As a user, the problem is to match the right data structure if it exists with the use case.

As an advanced user or developer there might be a need to implement needed data structures.

Some languages like Python provide implementation of basic data structures as default while others like C leave it to the users to implement.

In Python, sequence types like list and tuple, mapping types like set and dictionary are all end result of this field of study, and are implemented and provided by default.

The choice of data structure decides available algorithms for different operations implemented for a data structure.

Choosing the data structure depends on use case

how data is to be structured?
which type of operations are to be performed?
how frequently?

This section of the course aims to present and introduce some of the core results which are used currently, along with context and intuition behind the developments in the field. This should help

everyone
- during usage of these concepts in writing programs
someone whose target is to get into software development
- lay foundation for advanced study of the topic

The study of data structures and algorithms can be summarized as below.

problem specification: interface
- specification for structure of data
- specification for operations needed to be performed on data
solution/implementation
- data structure: implementation of how to store the data with given structure
- algorithms: implementation of how to perform operations for a given data structure
measuring efficiency
- asymptotic notation
- computation model: WordRAM model

21.1.1 Interface vs data structure

Interface	Data structure & Algorithms
Interface is a specification of structure of data and operations that should be supported	Data structure is a implementation to store data with specified structure Algorithms for a given data structure are the implementation of operations
Interface is a specification	Data structure with Algorithms is an implementation
Interface is the problem definition	Data structure with algorithms is a solution
There could be multiple data structures for the same interface, performance will be different	A data structure might solve multiple interfaces, performance will be different

21.2 Measuring efficiency

There are 2 major costs involved while using a data structure

time taken to perform operations
memory (space) used on RAM

Measuring time and memory usage depends on 2 factors

computation model (hardware)
how the data structure uses this model

Typically time is more critical as memory is considered cheap in the context of present day computer hardware.

21.2.1 Computation model

Computational model refers to abstracting hardware performance in general terms to study the behavior of different operations to be performed. Typically word RAM model is used in theoretical studies. In practice it may differ.

Implication of assumptions in word RAM model are that all elementary operations take constant amount of time

read, write, delete in RAM for a location
mathematical operations
logical operations

RAM => Random access memory

21.2.2 Measuring time

21.2.2.1 Context

Time complexity of an algorithm can be measured in different ways based on different contexts.

how the time complexity is measured?
- measure time to run the algorithm
- measure operations performed and how they grow with size of data structure
what time is captured?
- best case time
- average time
- worst case

Measuring time directly has following disadvantages

it depends on the quality of machine
- tests can be run on various machines but that is inefficient
- testing a slow algorithm on very fast machine leads to false conclusions
it depends on size of data
- tests can be run on different data sizes but it is difficult to test for very large sizes

It is more useful to measure time in terms of operations performed and then see how they grow with size of data. Using this approach isolates dependency on machine.

Using best case time for deciding time complexity might give false results as an algorithm might be fast on small data size but perform slow on large data size.

Using average time will give more information, but for that the probability distribution is needed to decide which data sizes will be used more frequently. It is very difficult to get an accurate distribution hence this option is not feasible.

Worst case time is used as it ensures bound on performance and reduces noise in comparing performance of different algorithms.

21.2.2.2 Solution

Time complexity is measured for worst case performance, using asymptotic notation for number of operations performed depending on data size, usually represented with \(n\).

21.2.2.3 Asymptotic notation

Asymptotic notation is used to get an idea of asymptotic growth ignoring scaling factors and constants. More formally, asymptotic notation represents a set of functions.

Asymptotic growth simply means how the growth of a function behaves when the underlying variable grows too large. For example, if the the performance of an algorithm for a data structure takes the form \(T(n) = 3n^2\), what is the growth in value of \(T(n)\) as \(n \to \infty\), where \(T(n)\) is the time the algorithm takes and \(n\) is the size of data structure.

Below are the set of asymptotic notations.

Name	Notation	Description
Asymptotically Equal	\(f \sim g\)
Big O	\(f \in O(g)\)	upper bound
Omega	\(f \in \Omega(g)\)	lower bound
Theta	\(f \in \Theta(g)\)	both upper and lower bound
Small O	\(f \in o(g)\)	strictly larger
Little Omega	\(f \in \omega(g)\)	strictly smaller

Theta and Big O are mostly used in measuring complexity.

The notation is some times used as \(f(n) = O(g(n))\) but it is more accurate to use \(f(n) \in O(g(n))\) as \(O(g(n))\) denotes a collection of functions.

Definition 21.1 (O Notation) A non-negative function \(f(n)\) is in \(O(g(n))\) if and only if there exist constants \(c, n_0 \in \mathbb{R}\) such that \(f(n) \le c \cdot g(n) \quad \forall \ n > n_0\).

Definition 21.2 (Omega Notation) A non-negative function \(f(n)\) is in \(\Omega(g(n))\) if and only if there exist constants \(c, n_0 \in \mathbb{R}\) such that \(f(n) \ge c \cdot g(n) \quad \forall \ n > n_0\).

Definition 21.3 (Theta Notation) A non-negative function \(f(n)\) is in \(\Theta(g(n))\) if and only if \(f(n) \in (O(g(n)) \cap \Omega(g(n)))\).

For example, if it is concluded that number of operations in an algorithm is given by the function \(T(n) = 3n^2 + 4n+ 2\) then \(T(n) \in \Theta(n^2)\). Theta notation allows for stating complexity in simpler terms ignoring scaling and constant factors.

Below plots show how the \(n^2\) term dominates as n grows larger.

In terms of efficiency of algorithms it is often desired to keep the complexity low. Below are the common functions which complexity takes in increasing order. Anything below linear is considered good.

Constant	Logarithmic	Linear	Quadratic	Polynomial	Exponential
\(\Theta(1)\)	\(\Theta(log(n))\)	\(\Theta(n)\)	\(\Theta(n^2)\)	\(\Theta(n^c)\)	\(2^{\Theta(n^c)}\)

21.3 Interfaces

Interface provides the abstract or theoretical requirements of storing and operating on certain type of data. For example, tuple, list or dictionary is used depending on the use case. Abstraction and generalization of use cases is interface. Tuple, list and dictionary are the implementation or data structures.

Interfaces are also referred to as

Application programming interface (API): by developers
Abstract data type (ADT): by computer scientists

There are 2 main types of interfaces.

sequence
- extrinsic order: order of items provided externally is preserved
- special sequence types
  - stack: insert last and delete last (LIFO: last in first out)
  - queue: insert last and delete first (FIFO: first in first out)
  - dequeue (aka deque, double-ended queue)
mapping (also referred to as set, associative array)
- intrinsic order: items identified by unique keys, keys can optionally be stored in order
- types of mapping types
  - set: items are keys themselves
  - dictionary: items are associated with keys

21.3.1 Operations

Operations on major interfaces can be categorized into 3 basic types

container operations: operations on container itself
- e.g. build, length
static operations: operations on elements that do not alter the container
- e.g. search (query) operations on elements by index (sequence) or value (mapping)
dynamic operations: operations on elements that alter the container itself
- e.g. inserting, deleting elements

Below tables summarize the common operation specifications for sequence and mapping interfaces.

21.3.1.1 Sequence interface

Category	Method	Description
Container	build(I)	build a sequence from items in iterable
	len(x)	return number of items
Static	iter_seq(x)	return stored items one at a time in sequence order
	get_at(i)	return the item at index i
	set_at(i, x)	replace the item at index i with x
Dynamic	insert_at(i, x)	add x as i^th item
	delete_at(i)	remove and return i^th item
	insert_first(x)	add x as the first item
	delete_first()	remove and return 1^st item
	insert_last(x)	add x as the last item
	delete_last()	remove and return last item

21.3.1.2 Map interface

Category	Method	Description
Container	build(I)	build a sequence from items in iterable
	len(x)	return number of items
Static	find(k)	return stored item with key k
Dynamic	insert(x)	add x to set with key x.key, replace if key exists
	delete(k)	remove and return the item with key k
Order	iter_order()	return the stored items one by one in key order
	find_min()	return the item with smallest key
	find_max()	return the item with largest key
	find_next(k)	return the item with key larger than k
	find_prev(k)	return the item with key smaller than k

21.4 Data structures

Data structures are the actual implementations of interfaces. There is no single data structure that solves all operations required in an interface efficiently. Different data structures solve a subset of operations efficiently.

Below are some basic categories of data structures with examples.

array based: static array, dynamic array
pointer based: linked list (w[/o] tail), doubly linked list (w[/o] tail)
hash table (array and pointer mixed)
tree based: binary search tree, AVL tree, …
graph based

21.4.1 Static array

Random access memory (RAM) of computer hardware is like a giant continuous slots of memory addresses.

A static array finds and stores data in contiguous slots of equal size, depending on the size of data elements. The order is extrinsic, i.e. the items are stored in the order provided.

The static array object then needs to contain only the start memory address and length of the array.

static operations: constant time (\(\Theta(1)\))
- memory address(i) = memory address(0) + i * item storage size
dynamic operations: linear time (\(\Theta(n)\))

Python tuples are approximately static array, one main difference is tuples are immutable, i.e. modification operations are not supported, a new object is created on modification.

Below are tables for performance of static array for sequence and map operations. Static arrays are efficient for sequences with static operations only.

For dynamic operations, in the worst case new contiguous slot of free memory has to be looked and container has to be build, therefore it is \(O(n)\) operations.

	Sequence interface operations - \(O(\cdot)\)
	Build	Static	Dynamic
Data Structure	build(X)	get_at(i) set_at(i)	insert_first(x) delete_first()	insert_last(x) delete_last()	insert_at(i, x) delete_at(i)
array	n	1	n	n	n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

	Map interface operations - \(O(\cdot)\)
	Build	Static	Dynamic	Order
Data Structure	build(X)	find(k)	insert(k) delete(k)	find_min() find_max()	find_prev(k) find_next(k)
array	n	n	n	n	n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.2 Linked list

21.4.2.1 Singly linked list without tail

random slots alloted
object stores the
- head memory address
- len
each node stores
- item
- link to next nodes memory address

static operations: linear time (\(\Theta(n)\))
dynamic operations
- insert/delete first: constant time (\(\Theta(1)\))
- other: linear time (\(\Theta(n)\))

	Sequence interface operations - \(O(\cdot)\)
	Build	Static	Dynamic
Data Structure	build(X)	get_at(i) set_at(i)	insert_first(x) delete_first()	insert_last(x) delete_last()	insert_at(i, x) delete_at(i)
linked list	n	n	1	n	n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.2.2 Doubly linked list with tail

random slots alloted
object stores the
- head memory address
- tail memory address
- len
each node stores
- item
- memory address of previous node
- memory address of next node

static operations: linear time (\(\Theta(n)\))
dynamic operations
- insert/delete first: constant time (\(\Theta(1)\))
- insert/delete last: constant time (\(\Theta(1)\))
- other: linear time (\(\Theta(n)\))
good for implementing stack and queue with unknown length
- insert/delete first/last
  aka push/pop/queue/dequeue

	Sequence interface operations - \(O(\cdot)\)
	Build	Static	Dynamic
Data Structure	build(X)	get_at(i) set_at(i)	insert_first(x) delete_first()	insert_last(x) delete_last()	insert_at(i, x) delete_at(i)
linked list	n	n	1	n	n
doubly linked list w/ tail	n	n	1	1	n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.3 Dynamic array

Provides faster insert/delete first/last operations
Extra space is alloted at beginning/end
Idea is to amortize the cost of operations by providing extra space
- when list is full, it is extended by a target fill ratio
- when list is empty, the size is reduced to target delete ratio
Python list is a dynamic array
- inserting or deleting at last position takes \(O(1)\) time (amortized over \(n\) operations)

	Sequence interface operations - \(O(\cdot)\)
	Build	Static	Dynamic
Data Structure	build(X)	get_at(i) set_at(i)	insert_first(x) delete_first()	insert_last(x) delete_last()	insert_at(i, x) delete_at(i)
dynamic array	n	1	\(1_{(a)}\)	\(1_{(a)}\)	n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.4 Hash tables

21.4.4.1 Overview

Hash tables are used to implement set interface: sets and dictionaries.

Requirement: find items based on keys efficiently
- arrays find item by looking up memory address based on index
- in set interface the keys can be strings or other objects
- search based on item value in array is not fast
Key idea is to associate/map the keys with non negative integers (index)
- integers map to memory address
- given the key, find the associated integer with the key
- given the integer, find the associated key
Keys can be numeric, strings or other objects
- generally, they must be unique

21.4.4.2 Hashing numbers

Taking example of a phone book where the requirements are

store names and phone numbers
search by phone number: hashing numbers
search by name: hashing strings

assuming phone numbers are 15 digit numbers
\(u\) is the set of all possible 15 digit numbers
- size of universe is \(|u| = 10^{15} = 1,000,000,000,000,000\)
- not possible to store array of this size
- this makes use of direct access array (\(|m| = |u|\)) impractical
let \(n\) be the set of numbers to be stored \(|n| = ?\), e.g. \(|n| \le 1000\)
\(m\) is the array of reduced size that you can store
- choose \(|m|\) such that \(|n| \le |m|\)
- size of the array is \(|m|\), assume \(|m| = 1000\)
hash function maps the key from \(u \to m\)
hash table is the map of key from \(u \to m\)
to search any key takes \(\Theta(1)\) time
- convert the key to array index using hash function
- access the returned index

21.4.4.3 Collisions and chaining

assuming \(n = 4\) contacts have to be stored in array of size \(|m| = 7\)
since \(|u| >>>> |m|\) (\(|u| = 10^{15}\)) there will be collisions
- collisions: hash function will give same integer for multiple inputs \(h(k_1) = h(k_2)\), where \(k_1 \ne k_2\)
- this is by pigeon hole principle
there are 2 methods to resolve the issue of collisions
- chaining: store collisions in a linked list
- open addressing: store collisions in next free slot

21.4.4.4 Hashing strings

convert characters into integers using ascii or unicode
there are many methods to combine the codes to form an integer
then rest of the problem is similar to hashing numbers
there are many good solutions available
to maintain the phone book by both name and number use 2 hash tables
one hash table with hash of numbers, other with hash of strings

21.4.4.5 Key requirements for a good hash function

deterministic: returns same integer for a given key always
fast to compute: \(\Theta(1)\)
keep the array size (\(|m|\)) low to minimize space
keep number of collisions low

21.4.4.6 Use cases

Hash tables have revolutionized searching and can be found everywhere in technology
Sets and dictionaries are generally implemented using hash table
- Python dict and set
Language interpreter does fast lookup for keywords
Contact list in phones
Web and other text searches use hashing
Applications use hash table for configuration files to search settings

21.4.5 Summary

21.4.5.1 Sequence interface

	Sequence interface operations - \(O(\cdot)\)
	Build	Static	Dynamic
Data Structure	build(X)	get_at(i) set_at(i)	insert_first(x) delete_first()	insert_last(x) delete_last()	insert_at(i, x) delete_at(i)
array	n	1	n	n	n
linked list	n	n	1	n	n
doubly linked list w/ tail	n	n	1	1	n
dynamic array	n	1	\(1_{(a)}\)	\(1_{(a)}\)	n
binary tree	n	h	h	h	h
avl tree	n	log n	log n	log n	log n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.5.2 Map interface

	Map interface operations - \(O(\cdot)\)
	Build	Static	Dynamic	Order
Data Structure	build(X)	find(k)	insert(k) delete(k)	find_min() find_max()	find_prev(k) find_next(k)
array	n	n	n	n	n
sorted array	n log n	log n (binary search)	n	1	log n
direct access array	u	1	1	u	u
hash tables	\(n_{(e)}\)	\(1_{(e)}\)	\(1_{(e)(a)}\)	n	n
binary tree	n log n	h	h	h	h
avl tree	n log n	log n	log n	log n	log n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.5 Algorithms

Algorithm is a procedure to solve a problem.

Study of algorithms involves study of finding correct and efficient procedures to solve problems.

Some classical algorithms are listed below along with their efficiency.

Application	Name	Performance	Comments
Search	Linear search	\(O(n)\)	Iterative
	Binary search	\(O(Log_2 \ n)\)	Recursive Works on sorted arrays
Sort	Insertion Sort	\(O(n^2)\)	In-place Unstable
	Selection Sort	\(O(n^2)\)	In-place
	Merge Sort	\(O(n Log_2 \ n)\)	Creates new collection Recursive

An algorithm is a solution to implementing operations like search and sort for a data structure.

An operation for a implementation of a data structure can use any of the compatible algorithms, but not all will be efficient. For example, for sorting an array merge sort is much quicker than most algorithms.

The algorithms provided here should also serve as mini projects to research and implement the algorithm. This should be a good practice to apply combining basic building blocks in previous sections.