Last Updated:

Data Structures, Big O and You

Data is at the heart of any non-trivial application —and while Ruby provides great Array and Hash classes with some serious syntax support, everything else is pretty rare in the standard library.

When you choose a data structure for your project that goes beyond the standard Hash and Array, the main consideration, and indeed the main motivation for switching, is often the efficiency of the data structure. In this article, I'll talk about how we can compare data structures in terms of efficiency.

We'll use link lists as an example.

Array Overview

For the purposes of this article, when I talk about an array, I'm referring to the Ruby implementation of an array, which is somewhat different from the implementation of many lower-level programming languages. First, it has no restrictions on either the type or size of the data that can be inserted. Second, it's dynamically resizable, which means that operations that would increase size no longer (or rarely) require the expensive step of copying an array to an older sibling.

All examples and tests are given using ruby 1.9.3, with the Array class naively implemented.

Difficulty class?

 

To understand why you prefer one data structure to another, it's helpful to be able to measure and categorize their performance as their input data grows. We put algorithms with similar costs in difficulty classes to compare them. Some algorithms take the same amount of time to run independently of the input data, some grow linearly with their input, and some grow exponentially, for example.

When we talk about startup "timing," it's rare to compare two algorithms using benchmarks that are subject to differences in things like programming language, machine speed, and garbage collector activity. Instead, computer scientists will look at the number of meaningful actions an algorithm must perform to complete.

A meaningful action rather depends on the subject area. For example, when it comes to sorting algorithms, a meaningful action might be to compare one value to another.

Algorithms also tend to have a constant cost. That is, the part of the total cost that does not depend on the size of the input data. When you think of large data sets, you usually ignore that cost. For any non-trivial application with a large data set, the growth rate will decrease rapidly even at the highest fixed costs as the input data grows.

For example, if an algorithm has a constant cost of '10' (say, in seconds), but requires you to enter ^2 time to run, then for input 2, a constant may make up a significant portion of the 14 seconds it takes to execute. But what if your contribution says, say, $1 billion? Well, you'd need to start the process some time before the big bang. In this context, the constant becomes negligible.

There are some very handy notations that we can use when comparing algorithms. A large O is used to express the worst-case scenario for a given algorithm. Thus, we could say that the algorithm x is equal to O (n), which means that in its worst case, algorithm x would have a runtime of 15 significant operations to enter 15. However, if x had a run time O (n * 15), we would look at 225 significant operations.

Related lists

 

Linked lists are one of the simplest data structures you can get, along with traditional arrays. While an array is a set of dedicated memory segments with content inside them, a linked list is a set of nodes where each node has a pointer to the next node.

 

Think of a 10-drawer dresser. With an array, you have each of these boxes with numbers from 0 to 9 with some value in each of the draws. To find an item in drawer 2, simply open that drawer and get your item.

However, in the linked list, the draw numbers do not matter. These can be random numbers, colors, or images of cats. All you know from the beginning is which box is first on the list. To find the second item, you must open the first box and read the instructions indicating the location of the next box in the sequence.

If we iterate through both the array and the associated list from start to finish, then it would take O(n) to go through the entire list. For each item in the list, we have to open exactly one drawer.

The difference, however, occurs when we try to find an item in the i list. In the case of an array, we can simply open the box and do away with it – give random access to the worst-case scenario O (1). In the linked list, however, we have to go through all the boxes by going through the signs and finding our item. If the item we're looking for is at the end of the list, then we have to make n drawer opens, which gives the complexity of O(n).


You may ask, why did we use a linked list then? Well, in cases where we need to insert an item somewhere in the list (and if we already know which box the item we want to insert is in after it's located), we can simply replace the pointer to the next item in the previous itembox with our new item in our new item, and then our new item points to the item. which was previously in this position. This means that the insertion will take no more than one open box, so we have a complexity of O (1).

However, in the case of arrays, if we want to put a new element in the top drawer, we have to move each element in each box below that element by one. This means that we will have up to n boxes open, which gives the complexity of O (n).

Let's look at a code sample with a simple implementation of a linked list and some tests to see how it works.

class LinkedListNode
attr_accessor :next, :value
def initialize(value)
@value = value
end
end
class LinkedList
attr_reader :last, :first
def initialize(starting_values = [])
starting_values. each{| a| add(a)}
end
def add(item)
set_last_as LinkedListNode. new(item)
end
def add_at(index, what)
add_after(at(index),what)
end
def add_after(item, what)
what = LinkedListNode. new(what)
what. next = item. next if item. next
item. next = what
end
def at(index)
i = 0
each{| a| return a if i == index; i +=1}
end
def each(&block)
i = first
while i
yield i
i = i. next
end
end
def each_value(&block)
each_in_box{| a| yield a. value}
end
protected
def set_last_as(node)
if first
@last. next = node
@last = node
else
@first = @last = node
end
end
end
require “benchmark”
puts ‘Adding to the end’
puts ‘Array’
time = Benchmark. measure do
a = Array. new()
(0.. 10000). each do | i|
a << i
end
end
puts time
puts ‘LinkedList’
time = Benchmark. measure do
a = LinkedList. new()
(0.. 10000). each do | i|
a. add i
end
end
puts time
puts ‘Adding lots of items to a few indexes’
puts ‘Array’
a = (1.. 10000). to_a
time = Benchmark. measure do
10. times do
el = rand(1000)
10000. times do
a. insert(el, “hello world”)
end
end
end
puts time
puts ‘LinkedList’
a = LinkedList. new((1.. 10000). to_a)
time = Benchmark. measure do
10. times do
el = a. at(rand(1000))
10000. times do
a. add_after(el, “hello world”)
end
end
end
puts time
#=> Output
#Array
# 0.000000 0.000000 0.000000 ( 0.003510)
#LinkedList
# 0.020000 0.000000 0.020000 ( 0.021374)
#Adding lots of items to a few indexes
#Array
# 3.420000 0.020000 3.440000 ( 3.441180)
#LinkedList
# 0.260000 0.000000 0.260000 ( 0.263655)
 

 

In the first example, we create our list. Both of these operations have an O(n) complexity. You'll notice that the Array version conveniently outperforms the linked list. That's because, as I mentioned earlier, we're comparing the native Array 1.9.3 class to a pure Ruby implementation that has some built-in performance overhead for each operation.

In the second test, we select 10 random items in the lists and add 10,000 items after that point to each collection. This example does show the strength of linked lists and how the difference in their complexity classes for this operation leads to actual differences in the tests.

Turn

Hopefully, this article has given you a reasonable, if concise, understanding of the world of algorithms and the complexity of data structure, as well as how the complexity class affects performance in the real world. If you want to write applications that work with large amounts of data, it is vital that you decide what operations you will perform on this data most often, and choose a data structure that performs this operation well. For example, our class of related lists would be a great candidate if you plan to insert a lot in the middle of lists rather than a lot of arbitrary access to those lists.

If you have any questions or I'm wrong, I'll hang out in the comments section.