Similarity Retrieval Faiss Model

Keywords: Python Machine Learning NLP

1. faiss role

The general solution to the problem of similarity retrieval for TopK is violent retrieval, which iterates through all vectors to compute similarity and derive TopK. However, when the number of vectors is large, this method and its time-consuming, Faiss's appearance solves this problem well.

2. Introduction to faiss

The full name of Faiss is Facebook AI Similarity Search It's a tool developed by FaceBook's AI team for large-scale similarity retrieval problems, written in C++, with a python interface, and can perform millisecond retrieval on a billion-magnitude index.Faiss's job is to encapsulate our own set of candidate vectors into an index database, which can speed up our process of retrieving similar vector TopK s, some of which also support GPU building, which is a strong enhancement.

3. QUICK START

FAISS generally consists of three steps:

  • step1: Construct a vector library, which can be constructed by averaging word vectors or by getting the vector of a sentence directly from a pre-training model such as BERT
import numpy as np
d = 64                                           # Vector Dimension
nb = 100000                                      # The amount of data in the index vector database
nq = 10000                                       # Number of queries to retrieve
np.random.seed(1234)             
xb = np.random.random((nb, d)).astype('float32')   # Vectors of the index vector library
xq = np.random.random((nq, d)).astype('float32')   # The query vector to be retrieved
  • step2: Build the index and add the vector to the index.Here we use the violence retrieval method FlatL2. The similarity measure used by the L2 index is the L2 norm, which is Euclidean distance.
import faiss          
index = faiss.IndexFlatL2(d)    # Dimension d of vector must be specified when creating index
print(index.is_trained)         # The output is True, meaning this type of index does not require training, just add vectors in
index.add(xb)                   # Add a vector from a vector library to an index
print(index.ntotal)             # The total number of vectors contained in the output index is 100000 
  • step3: Retrieve TopK similar query
k = 4                     # K value of topK
D, I = index.search(xq, k)# xq is the vector matrix to be retrieved, returns I as the index list of the most similar TopK for each query to be retrieved, and D as its corresponding distance
print(I[:5])
print(D[:5])

The first matrix is an index, and the first column 0, 1, 2, 3, 4 represents the index of the matrix to be retrieved. As you can see from the second distance matrix, the distance between yourself and yourself is zero.

[[  0 393 363  78] 
 [  1 555 277 364] 
 [  2 304 101  13] 
 [  3 173  18 182] 
 [  4 288 370 531]]  
 
[[ 0.          7.17517328  7.2076292   7.25116253]  
 [ 0.          6.32356453  6.6845808   6.79994535]  
 [ 0.          5.79640865  6.39173603  7.28151226]  
 [ 0.          7.27790546  7.52798653  7.66284657]  
 [ 0.          6.76380348  7.29512024  7.36881447]]

tips 1
In practice, faiss.index_is often used to build indexesThe factory method, which is supported by almost all indexes, can be built as follows in step 2 above:

dim, measure = 64, faiss.METRIC_L2
param = 'Flat'
index = faiss.index_factory(dim, param, measure)

dim:Dimension of specified vector
param: is a parameter passed in index that represents what type of index needs to be built
Measure: a measure that currently supports two types, Euclidean distance and inner product, or inner product.Therefore, to calculate cosine similarity, you only need to normalize the vectors and use the inner product measure.The parameter is faiss.METRIC_INNER_PRODUCT

tips 2
Some indexes can hold integer IDs, and each vector can specify an ID. When similar vectors are queried, the IDs and similarities (or distances) of similar vectors are returned.If not specified, they will be added from 0 in the order they were added.Where IndexFlatL2 does not support the specified ID.IndexFlatL2 does not support specifying an id, but it can be done by IDMAP, as follows

ids = [2,10, 100,...]
ids = np.array(ids)
index = faiss.index_factory(768, "IDMap, Flat")
index.add_with_ids(save_embedding, ids)        # Specify id, save_embedding is a vector library

The following method is similar to the one above

index = faiss.IndexFlatL2(d)
ids = np.arange(100000, 200000)  
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids)

4. Advantages and disadvantages of common Faiss index es and scenarios for their use

4.1 Flat: Violence Retrieval

  • Advantages: This method is the most accurate of all Faiss index es and has the highest recall rate, none of them.
  • Disadvantages: Slow speed, large memory footprint
  • Usage: Very few vector candidate sets, less than 500,000, and low memory
    Construction method:
dim, measure = 64, faiss.METRIC_L2
param =  'Flat'
index = faiss.index_factory(dim, param, measure)
index.is_trained                                   # Output as True
index.add(xb)                                      # Add a vector to the index

4.2 IVFx Flat: Retrieval of Inverted Violence

  • Advantages: This method uses inverted technology to speed up the retrieval of violence faster than Flat
  • Disadvantages: Not very fast yet, and retrieval recalls are falling
  • Usage: Same as Flat,
  • Parameter: x in IVFx is the number of k-means cluster centers
    Construction method:
dim, measure = 64, faiss.METRIC_L2 
param =  'IVF100, Flat'                          # Represents a k-means cluster center of 100,   
index = faiss.index_factory(dim, param, measure)
print(index.is_trained)                          # The output is False because the inverted index requires training k-means.
index.train(xb)                                  # So you need to train the index first, then add the vector
index.add(xb)             

4.3 HNSWx

  • Advantages: This method is an improved graph-based retrieval method with fast retrieval speed, 1 billion levels of seconds to retrieve results, and recall rate almost comparable to Flat, reaching an amazing 97%.The time complexity of the retrieval is log loglogn, and the magnitude of the candidate vectors can be almost ignored.It also supports batch import, which is ideal for online tasks and millisecond experience.
  • Disadvantages: Constructing an index is extremely slow and takes up a lot of memory (the largest in Faiss, which is larger than the size of the memory used by the original vector)
  • Parameters: The X in HNSWx is the maximum number of nodes connected to each point when constructing the graph. The larger the x, the more complex the composition and the more accurate the query. Of course, the slower the index construction time is, the X takes any integer from 4 to 64.
  • Usage: Don't care about memory, and have plenty of time to build an index

Construction method:

dim, measure = 64, faiss.METRIC_L2   
param =  'HNSW64' 
index = faiss.index_factory(dim, param, measure)  
print(index.is_trained)                          # The output is True at this time 
index.add(xb)

4.4 PQx: Product Quantization

  • Advantages: Using the product quantization method, the ordinary k-means is improved. The dimension of a vector is cut into x-segments, and each segment is k-means separately.So it's fast, takes up less memory, and has a relatively high recall rate
  • Disadvantages: Recall rates are much lower than violent retrieval.
  • Usage: Memory and its scarcity, fast retrieval speed, less concern about recall rates
  • Parameter: X in PQx is the number of segments to slice the vector, so x needs to be divisible by the vector dimension, and the larger x is, the finer the slicing is, and the more time complexity is

Construction method:

dim, measure = 64, faiss.METRIC_L2 
param =  'PQ16' 
index = faiss.index_factory(dim, param, measure)
print(index.is_trained)                          # The output is False because the inverted index requires training k-means.
index.train(xb)                                  # So you need to train the index first, then add the vector
index.add(xb)   

4.5 IVFxPQy Inverted Product Quantization

  • Advantages: This method is widely used in the industry and all indicators are acceptable.
  • Disadvantages: Collecting the length of a hundred families, naturally also collecting the shortcomings of a hundred families
  • Usage: Same as PQx
  • Parameter: IVFxPQy, where x and y are the same
    Construction method:
dim, measure = 64, faiss.METRIC_L2  
param =  'IVF100, PQ16'
index = faiss.index_factory(dim, param, measure) 
print(index.is_trained)                          # The output is False because the inverted index requires training k-means. 
index.train(xb)                                  # So you need to train index first, then add vector index.add(xb)       

More References [3]

5. Use GPU

Reference resources Running on GPUs

5.1 Use a single gpu

res = faiss.StandardGpuResources()  # Declare gpu resources
# Building a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# Adding cpu index to gpu
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
# The next steps are similar to the general situation
gpu_index_flat.add(xb)         # add vectors to the index
print(gpu_index_flat.ntotal)
k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index_flat.search(xq, k)  # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

5.2 Use multiple GPU s

ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)

cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus(cpu_index) # build the index

gpu_index.add(xb)              # add vectors to the index
print(gpu_index.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

Reference resources

Introduction and examples of faiss
Faiss Process and Principle Analysis
Faiss Getting Started and Application Experience Record

Posted by hjunw on Sat, 04 Sep 2021 09:27:17 -0700