September 16, 2024


I temporarily moved to Berkeley, California, where I was the “science communicator in residence” at the Simons Institutethe world’s leading institute for collaborative research in theoretical computer science.

One nano collaboration is today’s puzzle – told by a computer scientist at Microsoft I befriended over tea. It’s about data centers – those warehouses that contain endless rows of computers that store all our data.

One problem data centers face is the unreliability of physical machines. Hard drives fail all the time, and when they do, all their data can be lost. How do companies like Microsoft make sure they can recover data from failed hard drives? The solution to the puzzle below is essentially the answer to this question.

An obvious strategy that a data center can use to protect its machines from random failures is for each machine to have a duplicate. In this case, if a hard drive fails, you restore the data from the duplicate. However, this strategy is not used because it is very inefficient. If you have 100 machines, you will need another 100 duplicates. There are better ways, as you will hopefully conclude!

The disappearing boxes

You have 100 boxes. Each box contains a single number in it, and no two boxes have the same number.

1. It is told to you one of the boxes at random will be removed. But before it is removed, you are given an extra box, and allowed to put a single number in it. What number do you put in the extra box that ensures that you will be able to get the number back from whichever box is removed?

2. It is told to you two of the boxes at random will be removed. But before it is removed, you are given two extra boxes, and allowed to put one number in each of them. What (different) numbers do you put in these two boxes that will ensure that you will be able to recover the numbers from both removed boxes?

I’ll be back with the answers at 5pm UK. In the meantime, NO SPOILERS, please discuss your favorite hard drives.

The analogy here is that each box is a hard drive, the number in the box is the data, and removing a box is the failure of the hard drive. With one extra hard drive we are safe from the random failure of a single hard drive, and with two we are safe from the failure of two. It seems magical that we can protect so much information from random failures with minimal backups.

The field of “error-correcting codes” is a large number of beautiful theories that provide answers to questions such as how to reduce the number of machines needed to protect against random hard disk failures. And the theories work! Data centers never lose your data due to mechanical failure.

My theme partner was Sivakanth Gopi, a principal researcher at Microsoft. He said: “The magic of error-correcting codes allows us to build reliable systems using noisy and faulty components. Thanks to them, we can communicate with someone as far as the ends of our solar system and store billions of terabytes of data securely in the cloud . We can forget about the noise and complexity of this world and instead enjoy its beauty.”

I’ve been doing a puzzle here on alternate Mondays since 2015. I’m always on the lookout for great puzzles. If you want to suggest one, email me.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *