This is a blog post rewritten from the presentation at. NYC machine learning last week.It covers a library called Annoying What I built to help you perform (approximate) nearest neighbor queries in higher dimensional space. I divide it into several parts. It first explains why vector models, how to measure similarity, and why nearest neighbor queries are useful.

*Nearest neighbor* It’s conceptually very simple.I want to find the closest set of points in a space (probably many dimensions) *k* Immediately neighbor.

This turned out to be very useful for a variety of applications. Before we explain exactly how the nearest neighbor method works, let’s talk a bit about vector models.

**Why vector models and nearest neighbors are useful**

Vector models are becoming more and more popular in a variety of applications. These have long been used in natural language processing, such as with LDA and PLSA (and previously used TF-IDF in raw space). Recently, there is a new generation model. word2vec, RNN, etc.

In collaborative filtering, vector models have been one of the most popular methods since returning to the Netflix award. Winning works It featured a huge ensemble in which the vector model occupies a large part.

The basic idea is that proximity represents an object in space, which means that two items are similar. If you’re using something like word2vec, it looks like this:

In this case, the similarity between words is determined by the angle between the words. *Apple* And *banana* While they are close to each other *boat* Furthermore.

(Supplementary note: Much has been written about the ability of word2vec to analogize words in vector spaces. This is a powerful demonstration of the structure of these vector spaces, but the idea of using vector spaces is old and old. Similarities are definitely much more convenient).

In the most basic form, the data is already represented as a vector.As an example of this, it is one of the most standard datasets in machine learning. MNIST handwritten numbers data set.

**Building an image search engine for handwritten numbers**

The MNIST dataset features 60,000 images in size 28×28. Each of them features grayscale handwritten numbers. One of the most basic ways you can play with this dataset is to grind each 28×28 array into 784-dimensional vectors. There’s no machine learning to do this, but I’ll come back to some cool things like neural networks and word2vec later.

Let’s define the distance function in this space. Let’s say the two-digit distance is the sum of squares of the pixel differences. This is basically the square of the Euclidean distance (that is, the good old Pythagorean theorem).

This is useful because you can calculate the distance of any digit in the dataset.

This makes it possible to search for neighbors in this 784-dimensional space. Check out some of the samples below. The number on the far left is the seed number, and to the right are the 10 most similar images using pixel distance.

You can see it doing a kind of work. The numbers are very similar visually, but it is clear that some of the neighbors closest to humans are wrong numbers.

This was pretty nice and easy, but it’s also a less extensible approach. How about a big image? How about color images? And a method of judging similarity not only from the viewpoint of visual similarity but also from the viewpoint of what humans actually think are similar. There is plenty of room for improvement in this simple definition of “distance.”

**Dimensionality reduction**

A powerful way to work in a wide range of domains is to take high-dimensional complex items and project them into a compact vector representation.

- Dimensionality reduction from large dimensional space to small dimensional space (10-1000 dimensions)
- Use similarity in this space instead

Dimensionality reduction is a very powerful technique because it allows you to take almost any object and transform it into a small, convenient vector representation in space.This space is generally *potential* There is not always a pre-concept of what the axis is.What we care about *Similar objects will be close to each other.* What does similarity mean? In many cases, you can actually find it in the data.

Now let’s talk about one approach to image dimensionality reduction, deep convolutional neural networks. I had a side project to classify foods about a year ago. This is a pretty ridiculous application, but the ultimate goal was to see if the calorie content could be predicted from the photo, and the secondary goal was to learn how to use a convolutional neural network. I never ended up using this for anything and wasted a lot of money renting a GPU instance on AWS, which was fun.

To train the model, I downloaded 6 million photos from Yelp and Foursquare and trained a network very similar to the one described in. This paper Use Theano.

The last layer of this model is a 1244 way multi-classified output using softmax, so we are training this in a supervised way. These are the words that occur in the descriptive text. The above “spicy ramen”. The good thing, though, is that there is a “bottleneck” layer just before the final layer. This is a 128-dimensional vector that provides exactly what you need.

It uses a neural network as an embedded function and cosine similarity as a metric (this is basically an Euclidean distance, but first normalizes the vector). Get a very cool nearest neighbor.

These similarities look pretty reasonable! The photo on the upper left resembles a bunch of other french fries. The second line shows various white bowls made with Asian dishes. Even more impressive, they are all at different scales and angles, with very low pixel-by-pixel similarity. The last line shows a bunch of desserts with a similar pattern of chocolate sprinkled on it. We are dealing with a space that can express the characteristics of an object well.

So how do you find a similar item? I’m not going to elaborate on dimensionality reduction. There are countless ways to read it.I spent more time thinking *How to find a neighborhood in vector space*.. In fact, Annoy is so fast that it takes a few milliseconds to find the above neighbors. That’s why dimensionality reduction is so useful. It detects high-level structures in the data and at the same time calculates a compact representation of the item. This representation facilitates the calculation of similarity and the search for the nearest neighbor.

**Collaborative filtering vector method**

Of course, reducing dimensions is not only useful in computer vision. As mentioned earlier, it is very useful in natural language processing. Spotify makes extensive use of vector models for collaborative filtering. The idea is to project artists, users, tracks, and other objects into a low-dimensional space where similarities can be easily calculated and recommendations can be made. This is, in fact, the driving force behind almost all of Spotify’s recommendations, especially the recently released Discover Weekly.

I’ve already put together some presentations on this, so if you’re interested, you should check out some of them:

**100% search as a baseline**

So how do you find a similar item? Before we dive into how Annoy works, *Brute force thorough search*.. This means iterating through all possible items and calculating the distance to each item’s query point.

word2vec actually comes with a tool to do a 100% search. Let’s see how it is compared! Querying “River of China” using the GoogleNews-vectors-negative300.bin dataset is about **2 minutes 34 seconds** To output this:

- Qiantang River
- Yangtze River
- Yangtze River
- lake
- River
- Creek
- Mekong_river
- Xiangjiang_River
- Beas_river
- Minjiang_River

I created a similar tool that uses Annoy (Available on Github here). When run for the first time, a large amount of data is pre-computed and may take some time to run. However, the second run will load (mmap) the Annoy index directly from disk into memory. This is very fast because it relies on the magic page cache. Try it and search for “Chinese river”.

- Yangtze River
- Yangtze River
- River
- Creek
- Mekong_river
- Huangpu_River
- Ganges river
- Thu_Bon
- Yangtze River
- Yangtze River

Surprisingly, this is **470 ms**, Probably part of the overhead for loading the Python interpreter etc.This is roughly **300 times faster** Than the exhaustive search provided by word2vec.

Now, some of you have probably noticed that the results are slightly different.This is Annoy’s A *Approximate*.. We deliberately trade off accuracy in exchange for significant speed gains. It turns out that you can actually control this knob explicitly. Tell Annoy that you want to search for 100k nodes (I’ll come back later) and this result is about **2 seconds**:

- Qiantang River
- Yangtze River
- Yangtze River
- lake
- River
- Creek
- Mekong_river
- Xiangjiang_River
- Beas_river
- Minjiang_River

This is exactly the same as an exhaustive search that turned out – and still about **50 times faster**..

**Other uses in the nearest neighbor**

Finally, as a fun example of another usage, nearest neighbors are also useful when dealing with physical space. In a previous blog post, I showed a world map of how long it takes to ping an IP address from an apartment in New York.

this is Simple application The result of the k-NN (nearest neighbor) regression I wrote earlier in this blog. There is no dimensionality reduction here. Only 3D coordinates (latitude / longitude projected on the unit sphere) are handled.

In the next series, we’ll take a closer look at how Annoy works. stay tuned!

Tagging with :, math