21  DSA

21.1 Introduction

Data structures and algorithms, referred to as DSA, is an advanced topic in computer science.


At macro level, data and correspondingly requirement for data analysis is growing exponentially, different use cases get created with this in every domain. It is thus a requirement to store and operate on data efficiently. Storage has become cheap hence efficiency/speed is more important now.

Efficiency in context of speed is one of the most needed property of a computer program. Managing storage and operations is a major factor on which efficiency of a program depends.

There are multiple ways to increase performance.

  • Hardware: faster CPU, which is hitting a boundary
  • Program Design: concurrency, multi-threading, multi-processing, async-io
  • DSA: has been instrumental in increased performance of applications

At micro level, a program instructs CPU to perform operations which take time.

  • load data into RAM
  • fetch data from RAM
  • operate on data
  • write data onto RAM

All these operations happen at very large scale even in simple programs, e.g. basic data analysis. Therefore designing how the data is structured (data structure) and operated upon (algorithms) allows reducing these operations to improve performance.


Computer scientists are involved in rigorous study of the topic, including math involved, to come up with new data structures and algorithms. There are 2 main challenges

  • find optimal solutions in terms of speed and design to the ever expanding field of data
  • prove and communicate the solutions

As a user, the problem is to match the right data structure if it exists with the use case.

As an advanced user or developer there might be a need to implement needed data structures.

Some languages like Python provide implementation of basic data structures as default while others like C leave it to the users to implement.

In Python, sequence types like list and tuple, mapping types like set and dictionary are all end result of this field of study, and are implemented and provided by default.

The choice of data structure decides available algorithms for different operations implemented for a data structure.

Choosing the data structure depends on use case

  • how data is to be structured?
  • which type of operations are to be performed?
  • how frequently?

This section of the course aims to present and introduce some of the core results which are used currently, along with context and intuition behind the developments in the field. This should help

  • everyone
    • during usage of these concepts in writing programs
  • someone whose target is to get into software development
    • lay foundation for advanced study of the topic

The study of data structures and algorithms can be summarized as below.

  • problem specification: interface
    • specification for structure of data
    • specification for operations needed to be performed on data
  • solution/implementation
    • data structure: implementation of how to store the data with given structure
    • algorithms: implementation of how to perform operations for a given data structure
  • measuring efficiency
    • asymptotic notation
    • computation model: WordRAM model

21.1.1 Interface vs data structure

Interface

Data structure & Algorithms

Interface is a specification of structure of data and operations that should be supported

Data structure is a implementation to store data with specified structure

Algorithms for a given data structure are the implementation of operations

Interface is a specification

Data structure with Algorithms is an implementation

Interface is the problem definition

Data structure with algorithms is a solution

There could be multiple data structures for the same interface, performance will be different

A data structure might solve multiple interfaces, performance will be different

21.2 Measuring efficiency

There are 2 major costs involved while using a data structure

  • time taken to perform operations
  • memory (space) used on RAM

Measuring time and memory usage depends on 2 factors

  • computation model (hardware)
  • how the data structure uses this model

Typically time is more critical as memory is considered cheap in the context of present day computer hardware.

21.2.1 Computation model

Computational model refers to abstracting hardware performance in general terms to study the behavior of different operations to be performed. Typically word RAM model is used in theoretical studies. In practice it may differ.

Implication of assumptions in word RAM model are that all elementary operations take constant amount of time

  • read, write, delete in RAM for a location
  • mathematical operations
  • logical operations

RAM => Random access memory

21.2.2 Measuring time

21.2.2.1 Context

Time complexity of an algorithm can be measured in different ways based on different contexts.

  • how the time complexity is measured?
    • measure time to run the algorithm
    • measure operations performed and how they grow with size of data structure
  • what time is captured?
    • best case time
    • average time
    • worst case

Measuring time directly has following disadvantages

  • it depends on the quality of machine
    • tests can be run on various machines but that is inefficient
    • testing a slow algorithm on very fast machine leads to false conclusions
  • it depends on size of data
    • tests can be run on different data sizes but it is difficult to test for very large sizes

It is more useful to measure time in terms of operations performed and then see how they grow with size of data. Using this approach isolates dependency on machine.

Using best case time for deciding time complexity might give false results as an algorithm might be fast on small data size but perform slow on large data size.

Using average time will give more information, but for that the probability distribution is needed to decide which data sizes will be used more frequently. It is very difficult to get an accurate distribution hence this option is not feasible.

Worst case time is used as it ensures bound on performance and reduces noise in comparing performance of different algorithms.

21.2.2.2 Solution

Time complexity is measured for worst case performance, using asymptotic notation for number of operations performed depending on data size, usually represented with \(n\).

21.2.2.3 Asymptotic notation

Asymptotic notation is used to get an idea of asymptotic growth ignoring scaling factors and constants. More formally, asymptotic notation represents a set of functions.

Asymptotic growth simply means how the growth of a function behaves when the underlying variable grows too large. For example, if the the performance of an algorithm for a data structure takes the form \(T(n) = 3n^2\), what is the growth in value of \(T(n)\) as \(n \to \infty\), where \(T(n)\) is the time the algorithm takes and \(n\) is the size of data structure.

Below are the set of asymptotic notations.

Name

Notation

Description

Asymptotically Equal

\(f \sim g\)

Big O

\(f \in O(g)\)

upper bound

Omega

\(f \in \Omega(g)\)

lower bound

Theta

\(f \in \Theta(g)\)

both upper and lower bound

Small O

\(f \in o(g)\)

strictly larger

Little Omega

\(f \in \omega(g)\)

strictly smaller

Theta and Big O are mostly used in measuring complexity.

The notation is some times used as \(f(n) = O(g(n))\) but it is more accurate to use \(f(n) \in O(g(n))\) as \(O(g(n))\) denotes a collection of functions.

Definition 21.1 (O Notation) A non-negative function \(f(n)\) is in \(O(g(n))\) if and only if there exist constants \(c, n_0 \in \mathbb{R}\) such that \(f(n) \le c \cdot g(n) \quad \forall \ n > n_0\).

Definition 21.2 (Omega Notation) A non-negative function \(f(n)\) is in \(\Omega(g(n))\) if and only if there exist constants \(c, n_0 \in \mathbb{R}\) such that \(f(n) \ge c \cdot g(n) \quad \forall \ n > n_0\).

Definition 21.3 (Theta Notation) A non-negative function \(f(n)\) is in \(\Theta(g(n))\) if and only if \(f(n) \in (O(g(n)) \cap \Omega(g(n)))\).

For example, if it is concluded that number of operations in an algorithm is given by the function \(T(n) = 3n^2 + 4n+ 2\) then \(T(n) \in \Theta(n^2)\). Theta notation allows for stating complexity in simpler terms ignoring scaling and constant factors.

Below plots show how the \(n^2\) term dominates as n grows larger.

In terms of efficiency of algorithms it is often desired to keep the complexity low. Below are the common functions which complexity takes in increasing order. Anything below linear is considered good.

Constant Logarithmic Linear Quadratic Polynomial Exponential
\(\Theta(1)\) \(\Theta(log(n))\) \(\Theta(n)\) \(\Theta(n^2)\) \(\Theta(n^c)\) \(2^{\Theta(n^c)}\)

21.3 Interfaces

Interface provides the abstract or theoretical requirements of storing and operating on certain type of data. For example, tuple, list or dictionary is used depending on the use case. Abstraction and generalization of use cases is interface. Tuple, list and dictionary are the implementation or data structures.

Interfaces are also referred to as

  • Application programming interface (API): by developers
  • Abstract data type (ADT): by computer scientists

There are 2 main types of interfaces.

  • sequence
    • extrinsic order: order of items provided externally is preserved
    • special sequence types
      • stack: insert last and delete last (LIFO: last in first out)
      • queue: insert last and delete first (FIFO: first in first out)
      • dequeue (aka deque, double-ended queue)
  • mapping (also referred to as set, associative array)
    • intrinsic order: items identified by unique keys, keys can optionally be stored in order
    • types of mapping types
      • set: items are keys themselves
      • dictionary: items are associated with keys

21.3.1 Operations

Operations on major interfaces can be categorized into 3 basic types

  • container operations: operations on container itself
    • e.g. build, length
  • static operations: operations on elements that do not alter the container
    • e.g. search (query) operations on elements by index (sequence) or value (mapping)
  • dynamic operations: operations on elements that alter the container itself
    • e.g. inserting, deleting elements

Below tables summarize the common operation specifications for sequence and mapping interfaces.

21.3.1.1 Sequence interface

Category

Method

Description

Container

build(I)

build a sequence from items in iterable

len(x)

return number of items

Static

iter_seq(x)

return stored items one at a time in sequence order

get_at(i)

return the item at index i

set_at(i, x)

replace the item at index i with x

Dynamic

insert_at(i, x)

add x as i^th item

delete_at(i)

remove and return i^th item

insert_first(x)

add x as the first item

delete_first()

remove and return 1^st item

insert_last(x)

add x as the last item

delete_last()

remove and return last item

21.3.1.2 Map interface

Category

Method

Description

Container

build(I)

build a sequence from items in iterable

len(x)

return number of items

Static

find(k)

return stored item with key k

Dynamic

insert(x)

add x to set with key x.key, replace if key exists

delete(k)

remove and return the item with key k

Order

iter_order()

return the stored items one by one in key order

find_min()

return the item with smallest key

find_max()

return the item with largest key

find_next(k)

return the item with key larger than k

find_prev(k)

return the item with key smaller than k

21.4 Data structures

Data structures are the actual implementations of interfaces. There is no single data structure that solves all operations required in an interface efficiently. Different data structures solve a subset of operations efficiently.

Below are some basic categories of data structures with examples.

  • array based: static array, dynamic array
  • pointer based: linked list (w[/o] tail), doubly linked list (w[/o] tail)
  • hash table (array and pointer mixed)
  • tree based: binary search tree, AVL tree, …
  • graph based

21.4.1 Static array

Random access memory (RAM) of computer hardware is like a giant continuous slots of memory addresses.

A static array finds and stores data in contiguous slots of equal size, depending on the size of data elements. The order is extrinsic, i.e. the items are stored in the order provided.

The static array object then needs to contain only the start memory address and length of the array.

  • static operations: constant time (\(\Theta(1)\))
    • memory address(i) = memory address(0) + i * item storage size
  • dynamic operations: linear time (\(\Theta(n)\))

Python tuples are approximately static array, one main difference is tuples are immutable, i.e. modification operations are not supported, a new object is created on modification.

Below are tables for performance of static array for sequence and map operations. Static arrays are efficient for sequences with static operations only.

For dynamic operations, in the worst case new contiguous slot of free memory has to be looked and container has to be build, therefore it is \(O(n)\) operations.

Sequence interface operations - \(O(\cdot)\)
Build
Static
Dynamic
Data Structure build(X) get_at(i) set_at(i) insert_first(x) delete_first() insert_last(x) delete_last() insert_at(i, x) delete_at(i)
array n 1 n n n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree
Map interface operations - \(O(\cdot)\)
Build
Static
Dynamic
Order
Data Structure build(X) find(k) insert(k) delete(k) find_min() find_max() find_prev(k) find_next(k)
array n n n n n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.2 Linked list

21.4.2.1 Singly linked list without tail

  • random slots alloted
  • object stores the
    • head memory address
    • len
  • each node stores
    • item
    • link to next nodes memory address

 

  • static operations: linear time (\(\Theta(n)\))
  • dynamic operations
    • insert/delete first: constant time (\(\Theta(1)\))
    • other: linear time (\(\Theta(n)\))

Sequence interface operations - \(O(\cdot)\)
Build
Static
Dynamic
Data Structure build(X) get_at(i) set_at(i) insert_first(x) delete_first() insert_last(x) delete_last() insert_at(i, x) delete_at(i)
linked list n n 1 n n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.2.2 Doubly linked list with tail

  • random slots alloted
  • object stores the
    • head memory address
    • tail memory address
    • len
  • each node stores
    • item
    • memory address of previous node
    • memory address of next node
  • static operations: linear time (\(\Theta(n)\))
  • dynamic operations
    • insert/delete first: constant time (\(\Theta(1)\))
    • insert/delete last: constant time (\(\Theta(1)\))
    • other: linear time (\(\Theta(n)\))
  • good for implementing stack and queue with unknown length
    • insert/delete first/last
      aka push/pop/queue/dequeue

Sequence interface operations - \(O(\cdot)\)
Build
Static
Dynamic
Data Structure build(X) get_at(i) set_at(i) insert_first(x) delete_first() insert_last(x) delete_last() insert_at(i, x) delete_at(i)
linked list n n 1 n n
doubly linked list w/ tail n n 1 1 n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.3 Dynamic array

  • Provides faster insert/delete first/last operations
  • Extra space is alloted at beginning/end
  • Idea is to amortize the cost of operations by providing extra space
    • when list is full, it is extended by a target fill ratio
    • when list is empty, the size is reduced to target delete ratio
  • Python list is a dynamic array
    • inserting or deleting at last position takes \(O(1)\) time (amortized over \(n\) operations)
Sequence interface operations - \(O(\cdot)\)
Build
Static
Dynamic
Data Structure build(X) get_at(i) set_at(i) insert_first(x) delete_first() insert_last(x) delete_last() insert_at(i, x) delete_at(i)
dynamic array n 1 \(1_{(a)}\) \(1_{(a)}\) n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.4 Hash tables

21.4.4.1 Overview

Hash tables are used to implement set interface: sets and dictionaries.

  • Requirement: find items based on keys efficiently
    • arrays find item by looking up memory address based on index
    • in set interface the keys can be strings or other objects
    • search based on item value in array is not fast
  • Key idea is to associate/map the keys with non negative integers (index)
    • integers map to memory address
    • given the key, find the associated integer with the key
    • given the integer, find the associated key
  • Keys can be numeric, strings or other objects
    • generally, they must be unique

21.4.4.2 Hashing numbers

Taking example of a phone book where the requirements are

  • store names and phone numbers
  • search by phone number: hashing numbers
  • search by name: hashing strings

  • assuming phone numbers are 15 digit numbers
  • \(u\) is the set of all possible 15 digit numbers
    • size of universe is \(|u| = 10^{15} = 1,000,000,000,000,000\)
    • not possible to store array of this size
    • this makes use of direct access array (\(|m| = |u|\)) impractical
  • let \(n\) be the set of numbers to be stored \(|n| = ?\), e.g. \(|n| \le 1000\)
  • \(m\) is the array of reduced size that you can store
    • choose \(|m|\) such that \(|n| \le |m|\)
    • size of the array is \(|m|\), assume \(|m| = 1000\)
  • hash function maps the key from \(u \to m\)
  • hash table is the map of key from \(u \to m\)
  • to search any key takes \(\Theta(1)\) time
    • convert the key to array index using hash function
    • access the returned index

21.4.4.3 Collisions and chaining

  • assuming \(n = 4\) contacts have to be stored in array of size \(|m| = 7\)
  • since \(|u| >>>> |m|\) (\(|u| = 10^{15}\)) there will be collisions
    • collisions: hash function will give same integer for multiple inputs \(h(k_1) = h(k_2)\), where \(k_1 \ne k_2\)
    • this is by pigeon hole principle
  • there are 2 methods to resolve the issue of collisions
    • chaining: store collisions in a linked list
    • open addressing: store collisions in next free slot

21.4.4.4 Hashing strings

  • convert characters into integers using ascii or unicode
  • there are many methods to combine the codes to form an integer
  • then rest of the problem is similar to hashing numbers
  • there are many good solutions available
  • to maintain the phone book by both name and number use 2 hash tables
    one hash table with hash of numbers, other with hash of strings

21.4.4.5 Key requirements for a good hash function

  • deterministic: returns same integer for a given key always
  • fast to compute: \(\Theta(1)\)
  • keep the array size (\(|m|\)) low to minimize space
  • keep number of collisions low

21.4.4.6 Use cases

  • Hash tables have revolutionized searching and can be found everywhere in technology
  • Sets and dictionaries are generally implemented using hash table
    • Python dict and set
  • Language interpreter does fast lookup for keywords
  • Contact list in phones
  • Web and other text searches use hashing
  • Applications use hash table for configuration files to search settings

21.4.5 Summary

21.4.5.1 Sequence interface

Sequence interface operations - \(O(\cdot)\)
Build
Static
Dynamic
Data Structure build(X) get_at(i) set_at(i) insert_first(x) delete_first() insert_last(x) delete_last() insert_at(i, x) delete_at(i)
array n 1 n n n
linked list n n 1 n n
doubly linked list w/ tail n n 1 1 n
dynamic array n 1 \(1_{(a)}\) \(1_{(a)}\) n
binary tree n h h h h
avl tree n log n log n log n log n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.4.5.2 Map interface

Map interface operations - \(O(\cdot)\)
Build
Static
Dynamic
Order
Data Structure build(X) find(k) insert(k) delete(k) find_min() find_max() find_prev(k) find_next(k)
array n n n n n
sorted array n log n log n (binary search) n 1 log n
direct access array u 1 1 u u
hash tables \(n_{(e)}\) \(1_{(e)}\) \(1_{(e)(a)}\) n n
binary tree n log n h h h h
avl tree n log n log n log n log n log n
Note:
\(\cdot_{(a)}\) implies amortized, \(\cdot_{(e)}\) implies expected, h is height of the tree

21.5 Algorithms

Algorithm is a procedure to solve a problem.

Study of algorithms involves study of finding correct and efficient procedures to solve problems.

Some classical algorithms are listed below along with their efficiency.

Application

Name

Performance

Comments

Search

Linear search

\(O(n)\)
  • Iterative

Binary search

\(O(Log_2 \ n)\)
  • Recursive
  • Works on sorted arrays

Sort

Insertion Sort

\(O(n^2)\)
  • In-place
  • Unstable

Selection Sort

\(O(n^2)\)
  • In-place

Merge Sort

\(O(n Log_2 \ n)\)
  • Creates new collection
  • Recursive

An algorithm is a solution to implementing operations like search and sort for a data structure.

An operation for a implementation of a data structure can use any of the compatible algorithms, but not all will be efficient. For example, for sorting an array merge sort is much quicker than most algorithms.

The algorithms provided here should also serve as mini projects to research and implement the algorithm. This should be a good practice to apply combining basic building blocks in previous sections.