Ternary Search: April 2013

Saturday, April 27, 2013

Blotto Games

I'm currently taking the Model Thinking class on Coursera, which is about applying mathematical models to real-world scenarios in order to derive interesting properties about how people or organizations make choices and why certain things happen. One model that was recently discussed in the class is Blotto games, which has applications in understanding presidential elections, terrorism, trials, hiring, etc. It has a nice mathematical formulation, which is why I decided to learn a bit more about it. The specific Blotto game I want to briefly discuss is as follows: there are two players in the game and they must allocate resources across three different "fronts." They each have an equal number of resources (assumed to be one without loss of generality), and whoever has more resources allocated to a front wins that front. The winner of the game is the one who wins two of the three fronts.

We can think of each player's allocation as a point in Euclidean space lying on the triangle given by $x + y + z = 1$ and $x, y, z \ge 0$. It has vertices at $(1, 0, 0)$, $(0, 1, 0)$, and $(0, 0, 1)$. If we visualize the plane of the triangle we get something like this:

Here, suppose the red dot is some allocation (or strategy). Then we can divide the rest of the triangle into two regions: the green region is the set of strategies which beat the red strategy and the blue region is the set of strategies which the red strategy beats. This follows from the fact that the strategies in the green region are closer to two of the triangle's vertices than the red strategy, and the opposite holds true for the blue strategies. From this picture we can immediately see that the regions close to the vertices are very bad, which is intuitively obvious because you are almost certainly going to lose the two fronts you have not allocated much to. It is also the case that any strategy has a "counter" strategy, so if the two players alternately picked pure strategies there would be no equilibrium.

It turns out, however, that there are mixed strategies for which you are guaranteed to win with probability at least $1/2$ regardless of what the other player chooses, so that allows for an equilibrium. To visualize one such strategy, consider the following picture:

The strategy involves only choosing points inside this hexagon, which has vertices at $(0, 1/3, 2/3)$ and permutations of that. It is not uniform, however, as the strategies near the edge of the hexagon are better than those inside. The exact distribution is constant along the edge of the hexagon, zero at the center, and increasing linearly from the center to the edge. This strategy, along with a number of others, can be proved "optimal" in the sense that your probability of winning is always at least $1/2$. So while none of these strategies is individually optimal, randomizing the choice makes it so that the other player cannot effectively counter your strategy. Since the game is symmetric, the existence of such a strategy solves the game; both players will opt to play one of these mixed strategies, and they will each of a $1/2$ chance of winning.

Mathematical games are often pretty interesting to think about, and even this simple Blotto game has unexpected properties and equilibria. Mixed strategies, in particular, are not very intuitive, yet quite valuable in understanding these games and sometimes even in understanding how the world works.

Sunday, April 21, 2013

Spanner TrueTime

Spanner is one of Google's recent developments in the area of large-scale distributed systems, and its design from the very beginning was intended for deployment at the global scale. Not only that, but because many consumers of systems similar to Spanner (e.g. BigTable and Megastore) wanted better consistency guarantees, the authors also built externally-consistent distributed transactions. This level of consistency is quite rare among the popular NoSQL databases out there, with many of them opting for eventual consistency, so it is quite refreshing. Moreover, Spanner uses schemas which consist of "semi-relational tables" (they require primary keys) and provides a "SQL-based query language," which is interesting given the direction that other distributed databases are generally heading. The technical paper for Spanner can be found here.

What appears to be the most revolutionary concept behind Spanner is how it deals with time. Managing time in distributed systems is a well-known difficulty and has been for many decades (see Leslie Lamport's paper), so it might not be surprising that the combination of strong consistency and global scale ends up focusing a great deal on time. Google's approach to solving this problem has two parts; the first is formalizing time uncertainty and working with it, rather than around it. They introduce TrueTime, which is an API that the machines in their system use to quantify their own uncertainty regarding the time. It has just three methods, which are TT.now(), TT.after(), and TT.before(). The first returns an interval that specifies the earliest and latest times that it could be right now, while the latter two allow you to query whether it is definitely after or before a point in time.

It turns out that by using TrueTime and a lot of Paxos, Spanner is able to guarantee a level of consistency that is stronger than that of many other distributed systems while still being very efficient. For example, you can implement snapshot isolation for read-only transactions and point-in-time reads by ensuring that all replicas can "safely" read at a certain timestamp (i.e. a later timestamp has been committed). And during transactional writes, the commit is delayed so that it occurs at a time known to be after the recorded timestamp, as specified by TrueTime, which prevents out-of-order writes. The details of how the authors manage all of this can be the found in the paper, but suffice it to say that TrueTime on top of Paxos is what allows all of these consistency guarantees to be made.

Lastly, it should be pointed out that with the Spanner design, if the clock skew of the individual machines is high, performance will suffer greatly. Commits on read-write transactions will be forced to block for long periods and thus limit latency and throughput. And so the second part of Spanner time management is in the hardware: GPS and atomic clocks. By using clocks that give you much better performance in terms of skew along with a custom clock synchronization protocol (instead of NTP), Spanner is able to maintain very low levels of uncertainty about the time. Their novel approach to time accounts for uncertainty so that they can reason formally about consistency but heavily leverages clock accuracy for performance. The fact that Google is able to deploy this type of system globally (it is powering their new advertising backend) speaks a lot to the affordability of accurate clocks today.

All in all, Spanner is a unique distributed system that brings together a set of ideas and even some hardware to solve a very hard problem. It will be interesting to see if the allure of a high-performance system that provides strong consistency starts drawing the attention of the NoSQL community and whether the TrueTime API becomes more common as a way to manage time uncertainty.

Wednesday, April 17, 2013

String.format

Most of the time when you are dealing with strings, you want to construct them dynamically from some variables, whether they are integers, objects, or strings themselves. You might end up with code that kind of looks like this:

To most people, this is not the nicest looking code, and it can be hard to parse while reading and easy to mess up while writing. Java offers an alternative that behaves like good old printf which is the String.format method. Instead of inlining the variables as in the above example, you can put format specifiers in their place and specify the variables at the end. This additionally has the benefit of allowing you to render strings in different ways (e.g. leading zeros). For example:

In many cases, this is preferable because it is easier to see what the string should look like. There are some disadvantages such as the increased difficulty in identifying which variable is associated with which format specifier, but as we can see that might be overcome through good naming conventions. Now that I've introduced (or reintroduced) to you this handy feature of Java, let me tell you why you should be wary of using it. The reason is, of course, performance. Let's take a look at a quick benchmark program:

This benchmark compares String.format to appending when there are between 0 and 10 format specifiers (all "%d" for simplicity). The output of the second run (to avoid the impact of JVM warm-up) is as follows, with the columns representing number of format specifiers, time taken by String.format, and time taken by appending, respectively:

Needless to say, the append method blows String.format out of the water, especially with a small number of format specifiers. As it turns out, String.format calls down into Java's regular expression library to parse the format string, and that has a lot of overhead, especially in simple cases. That's not to say String.format doesn't have its uses. The whole point of this discussion came from the fact that it can make your code more readable, which is a great thing and sometimes more important than performance. But if you ever need to manipulate strings in the innermost loop of your application, please just stick with appending.

Sunday, April 14, 2013

Distributed Matrix Computation

Anyone who has worked with MapReduce knows that, while it suits many distributed computations well, it is very difficult to implement matrix computations using the framework. It is possible to represent matrix-based algorithms such as PageRank using MapReduce (e.g. here) but it definitely isn't the most natural or efficient way. Moreover, PageRank is pretty simple as far as matrix computations go, and once you want to do things like Cholesky decompositions for solving systems of linear equations or singular value decompositions it gets very tricky. MapReduce was simply not built for these use cases, so there is nothing to fault it for, but since it is the standard go-to distributed computing framework people have tried to do just about everything with it. Different problems require different solutions, though, and the creators of MadLINQ, a system coming out of Microsoft Research, are trying to overcome the difficulty of implementing complex, distributed matrix algorithms with their new system.

The MadLINQ system shares a lot in common with its cousin DryadLINQ that it integrates with. They both have use the directed acyclic graph (DAG) model of computation where each node represents some unit of computation that depends on the outputs of some other nodes. For example, MapReduce itself can be represented in this model by a set of "map" nodes and a set of "reduce" nodes where an edge exists from each of the map nodes to each of the reduce nodes. The key then lies in how MadLINQ maps a matrix computation into a DAG that can be efficiently executed in parallel. To do this, the authors use the familiar tiling abstraction applied to many parallel matrix operations (e.g. tiled matrix multiplication) where the matrix is broken into disjoint, square sub-matrices to operate on. Each node in the DAG is then an optimized, parallel matrix operation on a set of input tiles from the original matrix or intermediate results. One nice thing is that they abstract away the underlying matrix library for the individual node computations so that improvements in that space (which many people work on) will result in improvements in MadLINQ.

One interesting optimization implemented in MadLINQ is what the authors call "fine-grained pipelining." In both MapReduce and DryadLINQ, a node in the DAG must wait for its predecessors to completely finish processing before reading all of their output and executing. This has two negative consequences: first is the fact that during periods where there is low parallelism in the DAG (you can only break down the units of execution so far before the overhead exceeds the benefits) the cluster will be idle, and second is the fact that network utilization will be more bursty because a node decides to read the entirety of its inputs at once. MadLINQ allows individual nodes to produce partial output and consume partial input, enabling pipelining at the sub-node level (hence the term "fine-grained"). This is particularly useful in the context of matrix computations because many of the individual nodes also execute tile-based algorithms which lend themselves to this granular pipelining. The authors went through quite a bit of trouble to optimize the system as much as possible to really "solve" the use case of distributed matrix computations.

MadLINQ really speaks to how hard problems in distributed systems are. MapReduce has been a huge step forward in the last 10 years, but it has its limitations, and as more and more people and companies are turning to large-scale computations and processing, they will need to invent new solutions to solve the wide array of problems out there. The fundamental design decisions that must be made when building a new distributed system inherently limit its applications in a way that, for example, modern processors are not limited. People (generally) do not design specific processors for their applications anymore, and one would expect distributed systems to eventually reach that point as well, but there is certainly a long road ahead.

Sunday, April 7, 2013

Encrypted VoIP

Voice over IP (VoIP) technology is extremely popular these days and as a result it is important to study the security of the protocols being used. While most such technologies use strong encryption to protect the data, there are often side channel attacks that can be exploited. This paper from a couple of years ago presents an avenue of attack that assumes very little about the adversary while still achieving nontrivial results. The authors describe a process that uses a machine learning techniques combined with computational linguistic knowledge and specific audio codec properties to reconstruct potential transcripts of encrypted conversations. It's particularly impressive because they are able to combine the research advancements of different fields and apply it to a problem that has serious real-world implications.

In their model, the adversary needs only access to the packet lengths of the VoIP call (easily attainable by sniffing the network, e.g. over wireless) and knowledge of the language of the call (assumed to be English). Amazingly, with just these two pieces of information, it is possible to perform a limited reconstruction of the transcript of the call; not enough to deem the protocols complete broken, but enough to warrant a much closer look at how voice data should be encrypted. The attack begins by training on a large corpus of voice data that is passed through the known speech codec to build a model of phoneme boundaries based on packet lengths. This is the most novel part as they use observations about the codec itself, in particular variable bitrate encoding, as the basis for the model. By considering the length of each packet in relation to the surrounding packets and using dynamic programming, they construct the highest-probability phoneme boundaries for an entire sequence of packets.

Following the identification of the boundaries is the hardest task: classifying a sequence of packet lengths as an actual phoneme. Again, the authors use a combination of the training data and context to determine the most likely sequence of phonemes. For example, having a full dictionary of possible phoneme bigrams and trigrams in the English language will greatly limit the space of possible outputs; using a frequency distribution can further aid in choosing certain phonemes over others. Once the phonemes are identified (usually with error), the attack identifies word boundaries and finally maps the phoneme sequences into actual English words. There is some novelty in their approach to these latter two stages as well, but as I understand it those are well-known problems in the speech recognition field. The result of all this is some transcriptions that looks like this (the best ones, naturally):

"Change involves the displacement of form." => "Codes involves the displacement of aim."
"The two artists exchanged autographs." => "The two artists instance attendants."
"Artiﬁcial intelligence is for real." => "Artiﬁcial intelligence is carry all."

Pretty impressive, as these sentences are spoken, encoded, encrypted, and then sent over the network, with the only observable factor being the resulting packet lengths. Being able to pick out words should be very worrisome for privacy, and even though the environment is controlled and some assumptions are made, the techniques presented are legitimately close to "breaking" the security of speech data encoded and sent over the network in this way.

This paper reinforces my thoughts on the sometimes misleading nature of "theoretical" security. Even using unbreakable encryption algorithms is not enough to provide security in the real world, as the possibility of side channel attacks is always present. Moreover, since every application has a different set of such attacks it is crucial to carefully design protocols to not be vulnerable, which is a significant burden. This topic of attacks based on packet lengths was also the basis of the StegoTorus project which I was a part of; we showed that the website you are visiting can often be identified based on packet lengths (when you cannot see the IP itself, e.g. over a proxy). Especially as Internet traffic becomes more dynamic and differentiated, patterns are emerging which can render encryption much less effective and may demonstrate the need for something more to protect privacy.

Analytics