Sunday, June 30, 2013

Dremel Data Model

It is common knowledge that analyzing large datasets efficiently can benefit greatly from column-oriented storage. That is, instead of storing all the data for a single row together like a traditional database would, separate the columns out and store all of the data for each column together (see the Wikipedia link for an example). The benefits of this are twofold: (1) queries that only access a subset of the columns can reduce the amount of data that has to be retrieved, and (2) compression algorithms often perform better on homogeneous data, e.g. a column with many values of the same type. As such, data stores such as Cassandra and HBase have become widely used when dealing with records that have a large number of columns. Given that both of these projects were based on the original ideas of Google's BigTable, it should be no surprise that Google has continued leveraging the power of column-oriented storage to deal with the scale of data that it processes. Dremel is a project that has been used at Google for a number of years now; it takes the idea of column-oriented storage, generalizes it, and then optimizes it in order to perform queries over tens to hundreds of terabytes of data in seconds.

The key to Dremel's performance is how the data is represented and consequently laid out on disk. First of all, they generalize having a bunch of columns per record to a hierarchical structure of nested columns that supports repeated and optional fields. That may sound familiar because it is exactly the format of protocol buffers, which is the language-independent binary representation of data that Google generally uses, so it is a natural choice for the type of data Dremel should support. Column-oriented storage is simple in the normal case of one (potentially optional) value per column per record since you can easily figure out which value corresponds to which record, but it gets trickier when you allow for the full protocol buffer specification. Consider the following example schema (a simplified version of what is presented in the paper):

This represents a web document, which has an ID and a set of names associated with it; each name is a URL that points to that document and the languages associated with the URL. The generalization of column-oriented storage to handle nested columns is that each leaf field in the hierarchy is stored separately. For example, all values of "docId" are stored together, and all values of "name.language.country" are stored together since both are leaf fields in the schema for a document. Let's look at two example records which we will assume are stored in the order presented:

The "docId" field is simple and only needs to store the values "10" and "20" since it is a top-level, required field. The "name.language.country" field, on the other hand, will need to store the values "us", NULL, NULL, "gb", NULL. We will see why these NULL values are necessarily shortly, but they are basically placeholders for potential values of "name.language.country."

It should be clear that because of the repeated and optional values we cannot only store the values above without additional metadata, since doing so would lead to the loss of record boundaries. To solve this, Dremel introduces two metadata fields known as the "repetition level" and the "definition level." These are logically attached to every column value in order to preserve record boundary information. The repetition level handles repeated fields and indicates the depth in the hierarchy that was repeated between the current column value and the previous one (0 is the root, meaning a new document). The definition level handles optional fields and indicates the depth in the hierarchy that actually exists for the current value; this is only relevant for NULL values. For the "name.language.country" field, Dremel would store the following:

#valuerepdef
1"us"03
2NULL22
3NULL11
4"gb"13
5NULL01

In row 1 of the table, rep = 0 because it is a new record, and def = 3 because all 3 levels of the hierarchy ("name", "language", and "country") are defined, which is why we have a non-NULL value. In row 2, rep = 2 because we are repeating "name.language" when going from the previous row to this one, and def = 2 because only "name" and "language" are defined so the value is NULL. In row 3, rep = 1 because we repeated at the "name" level, and def = 1 because only "name" is defined. In row 4, rep = 1 because we again repeated at the "name" level, and def = 3 because we have a real value. Lastly, row 5 has rep = 0 because it is a new record, and def = 1 because only "name" is defined. The meanings of the repetition and definition levels can be confusing at first, but if you understand how they are derived in the above table, you should be able to convince yourself that it is a lossless representation. Dremel then optimizes to the bit-level how much data they need to actually store, such as omitting representation and definition levels whenever possible and using as few bits as necessary to store the values.

It should be no surprise that Google's solution for interactive queries on huge datasets involves minimizing the amount of data that needs to be retrieved from disk, a concept that database engineers have employed for decades. Since the volume of data being collected today far outpaces amount that can be read off of disk in a few seconds, there seem to be few options in the analytics space for efficient queries other than designing to reduce what needs to be read. But one can certainly imagine more complex queries that involve a large number of fields or even joining two fields that would be extremely valuable yet essentially impossible given the limitations of today's technology. This area seems ripe for additional work in pushing the boundaries of what types of queries are possible on the huge datasets that companies are collecting, but Dremel is definitely a solid approach that Google has derived a lot of value from.

Sunday, June 23, 2013

HashCode "Randomness"

The basis of this post is a very confusing bug I ran into recently involving what appeared to be nondeterministic behavior in a program that has no explicit randomness. Consider the following example:

It looks innocent enough, but when you run it there is a very good chance it will not print out the same number all three times. Set data structures generally do not give you any guarantees as to the ordering of the elements, but Scala's HashSet also does not have any randomness, so the question remains as to why the head of the set would be different each time. Java veterans should be cringing by now at the fact that we are putting an object that does not override the equals() and hashCode() methods into a hash data structure, one of the basic rules of Java collections. And that is indeed the reason for the "nondeterministic" behavior we observe; the JVM's default hash code implementation can return basically anything (but tries to return unique numbers), so depending on the values of the hash codes for each instance of Foo that we create, the set may (and does) order them differently.

As it does with many confusing things in Java, Scala has ways of helping you out here. In the above code example, if we turn Foo into a case class, the same value will be printed every time. This is because case classes automatically have equals() and hashCode() implemented so you do not have to worry about doing it yourself, which is a good argument for making simple value wrapper classes always be case classes (not to mention the nice pattern matching functionality you get). For reference, the automatically generated hash code for case classes can be found here; it is an implementation of the MurmurHash algorithm using the hash codes for each of the fields in the class. So the hash code of the Foo case class is stable across different instances of the JVM as long as Scala does not change the implementation (which they could very well do).

In the end, this turned out to be just a lesson in Java basics brought to light by some old, convoluted code that I came across. It does make me wonder whether the Java and/or Scala compilers could emit a warning when they detect objects being put into a hash data structure without overriding equals() and hashCode() because I am sure that would save a lot of headaches. There is definitely a limit to how much can be done through static analysis, but perhaps a combination of that and runtime checks could do the trick.

Thursday, June 13, 2013

Scala Closures and Vars

One of the interesting choices that the designers of Scala made is how to deal with accessing mutable local variables from closures. For example, considering the following snippet:

The code is straightforward, and it is easy to see that the value 1 will be printed. What is interesting is that the foo method has a mutable local variable (var) i which is referenced from the closure f. For those of you familiar with Java, you'll know that the equivalent is prohibited; if you create an anonymous inner class in a method (what closures are in Scala) you can only refer to final variables (vals in Scala) from the local scope. That is because any such class you create has unknown scope as it lives on the heap, which means you cannot assume that the locally scoped variable will be around for the lifetime of the class. While it may be a sensible restriction, it is also the source of much confusion for beginning Java programmers who are creating classes local to the current scope and cannot understand the error. As such, Scala lifts this restriction in order to simplify things for developers.

The problem that Java avoids is still there, though. Scala has to do something under the hood to ensure that running the closure always works even if the variable has gone out of scope. The solution is to move the variable to the heap and put it in a mutable wrapper, which in the case of integers is the IntRef class. So instead of creating an integer on the stack, the line var i = 0 actually creates an IntRef on the heap which closures then capture references to. All direct references to the variable (both reads and writes) are replaced with indirect references through the wrapper, which ensures that the local method as well as closures will all see each others' modifications. Easy enough, right?

Well, the designers of Java were probably smart enough to figure something like this out themselves, yet they did not choose to. One good reason is that shared mutable state is a very tricky thing in concurrent environments. Consider the following variation on the above code:

Now instead of running f inline we spawn a thread that runs it. All sorts of alarms and warning bells should be going off in your head when you see this code, as there are now two threads potentially accessing an unsynchronized variable. And there is no way to determine whether 0 or 1 will be printed when you call foo since you have a race condition due to unsafe accessing of the variable i. Now you might think this example is contrived, but it could be that you pass the closure to some other method which, deep in the internals of its implementation, runs the closure on some other thread. Allowing references to local variables in closures throws away the invariant that you never need to synchronize access to local variables and essentially makes them as "dangerous" as instance variables. So while Scala simplifies some natural use cases it has also potentially opened up a can of worms with the types of subtle concurrent access bugs that can now affect local variables.

Sunday, June 9, 2013

SVM Probabilities

In a previous post, I described the basics of support vector machines (SVMs) in the context of binary classification problems. The original formulation of SVMs allows only for pure classification, i.e. for a test instance we return only $+1$ or $-1$, rather than a probabilistic output. This is fine if your loss function is symmetric so false positives and false negatives are equally undesirable, but that is often not the case and you may be willing to trade one for the other. Having probabilistic outputs allows you to choose what point along the receiver operating characteristic (ROC curve) you want to be, which is an important part in practical applications of machine learning. Fortunately, people have found ways of adapting the output of SVMs to produce reliable probability estimates; the one I will be discussing is known as Platt's method, developed by John Platt at Microsoft Research.

Recall that the output of SVM training is a predictor $\textbf{w} \in \mathbb{R}^n$ where classification is done by computing $h(\textbf{x}) = \text{sign}(\langle \textbf{w}, \textbf{x} \rangle)$ for some instance $\textbf{x}$. In that sense, the quantity $f(\textbf{x}) = \langle \textbf{w}, \textbf{x} \rangle$ is already some sort of continuous spectrum between the two classes (like a probability would be). Platt's method is simply a post-processing step after SVM training which fits a function to map the value of $f$ to probabilities. So the questions that remain are what kind of function should be fit and how should the fit be done? Platt observed that the class-conditional probability densities, i.e. $p(f | y = \pm 1)$, between the SVM margins appear to be exponential (see Figure 1 in the linked paper for a nice graph of this), which after applying Bayes' rule leads to fitting a parametrized sigmoid function of the form

$$P(y = 1 | f) = \frac{1}{1 + \exp(Af + B)}$$
The parameters $A$ and $B$ can then be trained using a number of optimization algorithms with either a hold-out set or cross-validation to maximize likelihood (or, as typically done, log-likelihood).

There are additional details such as regularization, choice of algorithm for fitting, and numerical stability, but at a high level producing probability estimates from SVM output is just mapping the value of $f$ to a probability. This is convenient because it means the actual SVM implementation can remain untouched, requiring only an additional step at the end. Moreover, this functionality is provided in the libsvm library, which makes it easily accessible. In particular, I was able to leverage it to apply SVM training to the Kaggle contest I mentioned last time, which uses the AUC metric for scoring and thus requires ranking rather than simple binary classification.

Tuesday, June 4, 2013

Learning with Categorical Features

I recently found out about Kaggle, which is a really neat website that hosts machine learning contests in a similar style to TopCoder's marathon matches. You are given training data and asked to make predictions on a test set using essentially whatever tools you want. They have a live scoreboard and a forum associated with each contest so people can see how they are doing and collaborate with others. I think it is an awesome idea and an excellent way to put all of the machine learning theory that you learn about in classrooms to a practical test. The first contest I decided to take a look at was the Amazon Employee Access Challenge; the problem is to, given information about an employee such as their manager, role, department, etc and a resource in question, determine whether the employee should be granted access to that resource. For example, a software engineer needs access to the development servers while a salesperson does not.

Since there are a lot of readily available machine learning libraries and packages out there, most of the interesting things happen in the data preparation and feature selection stage. Software like libsvm and liblinear provide good implementations of algorithms so it is a matter of figuring out how to manipulate the data to be usable by them. In the case of the Amazon contest, the interesting thing is that all of the information provided about the employees is categorical; two different departments, for example, are not comparable numerically. The data provided is also already prepared in such a way that the features are all integers representing the different possible values (e.g. software engineers might have role 123), but we are given no further information about what those integers mean. In that sense, choosing the category values as the features makes no sense in the context of algorithms like support vector machines (SVM) or logistic regression. If they were given as strings, it would be even less obvious how to produce numerical features.

So how does one handle categorical features? It turns out that the most common method is to take the feature and turn it into N different binary features, one for each of the N categories. Then each instance has exactly one of the features set to 1 and the rest to 0. For example, if we had color as a feature and potential values of red, blue, and green, we would create three new binary features (is_red, is_blue, is_green) which correspond to what the original feature was set to. So red would have be (1, 0, 0), blue would be (0, 1, 0), and green would be (0, 0, 1). In this way, there is a meaningful numerical relationship between 0 and 1 that SVM and logistic regression can handle. It is important to note, however, that if the number of potential categories is large, the number of resulting features also becomes very large. For the Amazon contest, there are eight categorical features provided, but after processing them in the described manner each instance has over 16,000 features. It is also the case that each instance will only have eight of those features active, though, so they are very sparse, allowing many algorithms to still operate efficiently. Applying logistic regression to the new data with 16,000 features gives a pretty reasonable prediction score. Not bad!