I felt compelled to attempt an improvement on the existing resources on this topic because It took me a lot of time and frustration to find a resource that I felt explained it to a satisfactory and digestible degree.
A hash table is worth our time because it provides some alternatives to the humble array. In many cases, A hash table requires less time complexity than an array does.
In a typical scenario, adding a new item to a hash table is done in constant time relative to the data structure, but in polynomial time (or less) relative to the size of the key of the data. Considering the key of the data is usually short, we can comfortably call this constant time.
For those unfamiliar with Big O, I’ll provide a brief description right now since it is important for the motivation of this data structure.
Constant time or Big O([some number]) is an operation that takes a constant amount of time irrespective of the size of the data.
For example, if we had a pickup truck that could hold 2721.55 kilograms (or 3 tons), it would always take a fixed amount of time to transport anything less than 3 tons to the barn, because the size of the thing we are transporting is irrelevant to the time it takes to drive our truck to the barn. Driving to the barn always takes 3 minutes.
Polynomial-time, or Big O([n, where n is the size of the input]) is an operation that requires some time equivalent to some constant time multiplied by the size of the input.
For example, let’s say we have 100 watermelons, and we let our granddad borrow the truck for the day. We need to move the 100 watermelons to the shed, which is a 20-second walk. In order to complete this task, we will have to pick up a watermelon and walk to the shed 100 times. Therefore, the time to move out watermelons is in Polynomial-time or Big O(n * (20 seconds)) where n is the size of the input, which would be 100 in our watermelon scenario. This task would take 2000 seconds, or approximately 33 minutes assuming we don’t stop for refreshments.
Superpolynomial time, or Big O([some exponentiation relevant to n]) is some amount of time that is greater than a sum of constants.
A good way to think of this is as follows:
If we only have two watermelons, then add a third watermelon, the time required for the job increases by 20 seconds. If we had 100 watermelons and then add the 101st watermelon, the time required for the job still only increases by 20 seconds. This is polynomial-time because there is some constant amount of time that the job is increasing by for each unit of n.
A superpolynomial job requires progressively more time per input added. For example, if the 3rd watermelon added 20 seconds to the job, and the 101st watermelon added 5 days to the job, this would be a super polynomial task, but this is not the case.
Superpolynomial jobs are common in combinatorics. For example, if we wanted to make a computer that could play sudoku successfully, it would have to check a progressively higher number of collisions per cell added to the sudoku game, therefore every element of n would increase the time of the job by some amount higher than the previous element of n.
Moving on with our hash table motivation now that we understand the basics of time complexity:
A hash table is essentially an array with a hashing layer that makes certain operations more time-efficient.
Sedgewick, Robert, Kevin Wayne Section 3.4 of Algorithms 4th edition —
If keys are small integers, we can use an array to implement a symbol table
For those unaware, an array is an ordered list. It’s one of the most simple organizational conventions on the planet. For example, Let’s say we have a library that can hold 50,000 books. We put a number on each position in the library in a sequential manner like 1, 2, 3, 4, 5…50,000. Each book is assigned a number and is placed in its position. This is an array. If we want to read the 30,000th book we simply walk to the 30,000th position and select our book. This could be done in O(1) time assuming we know the 30,000th book is the one we want.
The problem with this is there are costly operations if we don’t know exactly how large the library will be or if the size of the library fluctuates, or if we don’t keep all of the book titles in our memory for the purpose of finding them.
If the librarian doesn’t know that harry potter is in the 43,485th position in the library the customer will have to go search through the library one book at a time which would cost them up to O(n) time where n is the number of books in the library.
Additionally, if our library is already full, and we need a book to be in the 102nd position we will have to allocate a 50,001st position and move the existing 102nd book to the 50,001st position. Or, if the positions of the books are important for some reason we might even have to push all books after 102 over one position. This would cost us up to O(n) time as well.
For example, if we had all of the national geographic magazines organized in publishing order, their order would be important. If national geographic published a special 2.5th episode edition, we would want that positioned after the 2nd episode, but the position of the other magazines would need to be preserved.
If we needed to remove a book from our library, and it was important to us to prevent our library from looking sparse, we would have to move all the other books down to fill in the gap. This would cost us up to O(n) time as well.
The primary motivation behind a hash table is it is capable of improving the cost in time complexity for some actions in some cases.
Let’s begin with potentially the greatest improvement on an array. Our librarian can’t possibly remember the position of all of the books, so rather than depending on the librarian to remember positions in the library we can make a hash function that maps the title of the book to a position in the library.
One such way to do this is to convert the letters of the book title into numbers and add them together. This is similar to an elementary checksum hash function.
There are 26 letters in the alphabet, so we can map them to the numbers in the set of 1…26 respectively. If a book had the title abc it would be mapped to position (1 + 2 + 3). Position 6.
This way, as long as we maintain the instructions to map the title name to a position, the customers will always be able to find the exact position of any book they know the title of in something close to O(1) time.
Similarly, as long as the array we’ve allocated to accept our hash-generated positions is large enough to incase the potential output values of our hash function, inserting and deleting from our hash table can be something very close to O(1) most of the time as well.
Well, This is a grand improvement indeed. It’s a data structure that we can delete, insert, and search in O(1) time assuming nothing weird happens.
However, if one book had the title abc and another book had the title cba they would both be positioned at 6. We can’t have two books at position 6, since each position can only hold one book. This is called a Collision.
One option is to place the second book mapped to position 6 at any open position near 6 that is still open. This is called Linear Probing. A hash table that uses linear probing might require some number higher than O(1) to find a book, but at least the hash function would tell you the general area where you can find the book you’re looking for.
Another option is to create a bucket for every group of positions and have the hash function map to the bucket rather than the exact position. A real-life example of this is how books are sorted alphabetically, or categorically or both in real libraries. So in real life, if a book began with the Letter “A” we know A is first in the alphabet and should map to the first section of the library. There are a bunch of books that start with the letter “A”, therefore, have collided, but at least our real-life “hash” function has directed us to the right section, which reduces our search operation from O(n) most of the time.
In programming we would just make every position of the hash table a smaller array, this serves the exact same function as sorting our library alphabetically. In effect, each smaller array is a “section” of our library.
But alas, there are problems with our alphabetically charged “hash” function. For example, the continents s, c, p, d are by far the most common starting letters in words. Therefore the buckets associated with these more commonly appearing letters will cause an asymmetrical distribution.
In hash tables, an asymmetrical distribution is bad because the more severe the asymmetry, the closer our search action will approach linear time complexity or O(n). The top ~10 most common constants in the English language could account for 80% of all book titles, therefore the larger our structure gets the greater the negative effect of asymmetrical distribution will be pronounced.
If the time complexity of searching in our structure is similar to an array, why use it all? No, we can do better.
Let this lead us into a brief description of Hash functions
A hash function is any method that translate some data into some smaller data. Most commonly it’s from some string into some number.
What I’ve described so far are elementary examples of “hash functions” that use some combination of tools to map a data’s key to the position containing its value. In our library, for example, let the key be the title of the book and the data be the contents of the book.
Well, How many hash functions are there? Trick question. There is an infinite number of hash functions. There are thousands commonly used and hundreds that have stood the test of time. Some are better for larger keys, some are better for certain types of data so on and so forth. Life is complex.
Another potential problem is since our hash table is essentially an allocated array with a hashing layer to map titles to positions… What if we have an amazing hash function with symmetrical distribution, and minimal collisions, and we want to maintain minimally sized “buckets”, and we run out of space in our hash table to meet all of these requirements because the amount of key-value pairs in the table exceeds the length of the allocated array?
In other words, we have a well-organized and distributed library with a maximum capacity of 50,000 books which is already holding 50,000 books and we’ve just received a shipment of 10,000 more books.
Well… First, we have to allocate more memory. In other words, install more shelves. However, Our hash function has been built to distribute symmetrically into a pool of 50,000. If we run the 10,000 new books through the hashing function it will not distribute evenly into the empty shelving, again this creates asymmetry.
One potential solution is to make a new hashing function that will evenly distribute the 10,000 new books strictly into the scope of the empty shelves, but the trade-off is any organizational scheme used in the rest of the library will be degraded.
For example, if we had used the alphabetical “hash function” for the rest of the library then we distribute 10,000 new books in isolation into the new shelving, we would be left with two sets of alphabetically distributed books. In this scenario, If somebody comes to the library with a book beginning with the letter “A”, they now have to search through two sections labeled “A”. This is an example of regression in our data structure and poses equivalent problems to collisions.
Well, Unfortunately, if we already have a hashing function that performs very well in terms of distribution the typical approach is to this problem is to pull all 50,000 books off the shelves, combine our 10,000 new books with our 50,000 old books, and then distribute all 60,000 books again into our allocated memory after increasing our shelf-space.
In programming terms, this means allocating a new array and running all 60,000 books through the hashing function to insert them into our new allocation. This is typically something close to O(n) time.
We can prevent the likelihood of this scenario.
The most common way to prevent the reallocation problem is to choose a size for our library that we are confident will cover any fluctuations in inventory size.
For example, let’s say we are making a library for a small school. We know from a sister school that the total number of books in the library has been some number between 80,000, and 130,000 for the past 10 years.
What we do is build shelving capable of holding 150,000 books. In other words, we allocate slightly more memory than we think we will need so we can comfortably account for any fluctuations without having to do a reallocation. If all goes well, our search time, insert time, and deletion time could all remain O(1) time complexity in perpetuity.
If this system can be maintained it will save a lot of time for the customers in the long run.