Faiss

0.faiss Mac m1 install源码安装？

master分支若不支持arm64，则下载使用下面 pr：matsui528:workaround-for-aarch64-gcc源码git clone -b branch_name url分支下载

https://github.com/facebookresearch/faiss/pull/1882

https://github.com/facebookresearch/faiss/wiki/Installing-Faiss#compiling-faiss-on-arm
$ brew install llvm
$ brew install swig
$ cd faiss
$ cmake -B build -DFAISS_ENABLE_GPU=OFF -DFAISS_ENABLE_C_API=ON -DBUILD_SHARED_LIBS=ON -DBUILD_TESTING=OFF -DFAISS_ENABLE_PYTHON=ON .
$ cmake  -B build -DCMAKE_CXX_COMPILER=clang++ -DFAISS_ENABLE_GPU=OFF  -DPython_EXECUTABLE=$(which python3) -DFAISS_OPT_LEVEL=generic -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON
$ make -C build -j faiss
$ make -C build -j swigfaiss
$ cd build/faiss/python/ && python3 setup.py build
$ export PYTHONPATH=$PWD/build/faiss/python/build/lib/

1.faiss add是增量增加吗？

2.faiss index训练是在每次add新数据，都需要重新训练吗？

https://github.com/facebookresearch/faiss/wiki

https://github.com/facebookresearch/faiss

推荐召回算法-6如何从十亿数据找出你喜欢的内容-向量检索技术

使用 Faiss 进行海量特征的相似度匹配

1.生成随机数据

Faiss处理固定维数d的向量集合，通常为 10 到 100 秒。这些集合可以存储在矩阵中。
我们假设行优先存储，即。向量编号 i 的第 j 个分量存储在矩阵的第 i 行、第 j 列中。 Faiss仅使用32位浮点矩阵。

xb用于数据库，它包含所有必须编入索引的向量，我们将在其中进行搜索。大小nb * d
xq用于查询向量，我们需要找到最近的邻居。大小nq * d，单个查询向量nq = 1

import faiss
import numpy as np

d = 64              # 向量维度
nb = 1000           # 数据库大小
nq = 10             # 待查询query大小
np.random.seed(42)  # 可复现

xb = np.random.random((nb, d)).astype('float32')  # 均匀分布，数据类型必须float32
xb[:, 0] += np.arange(nb) / 1000                  # 给向量的第0维（即，索引维度）增加一个小的偏移量

xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000

print(xb.shape, xq.shape)

(1000, 64) (10, 64)

2.创建索引并添加数据

Faiss是围绕Index对象构建的。它封装了一组数据库向量，并可选择对它们进行预处理以提高搜索效率。
索引有很多种，我们将使用最简单的版本，只对它们执行暴力L2距离搜索：IndexFlatL2。
所有索引在创建时都需要知道，操作向量的维数，即d。然后，大多数索引还需要一个训练阶段，以分析向量的分布。
对于全量暴力搜索IndexFlatL2，可以跳过这个操作。
Index创建和训练后，对其都可以执行两个操作：add和search。
index.add向索引添加向量数据，index.is_traned查看索引是否已经训练，index.ntotal索引向量数量
一些索引还可以存储与每个向量对应的整数ID（但不是IndexFlatL2）。如果未提供ID，则add仅使用向量序号作为ID。即，第一个向量为 0，第二个为 1，依此类推。

index = faiss.IndexFlatL2(d)    # 创建索引
print(index.is_trained)
index.add(xb)
print(index.ntotal)

True
1000

3.搜索

可以对索引执行的基本搜索操作是k-最近邻搜索，即。对于每个查询向量，在数据库中找到它的k个最近的邻居。
这个操作的结果可以方便地存储在一个大小为nq * k的整数矩阵中，其中第i行包含查询向量i的邻居的ID，按距离增加排序。除了这个矩阵之外，搜索操作还返回一个具有相应平方距离的nq * k浮点矩阵。(ID矩阵和距离矩阵)

k = 4
D, I = index.search(xb[:5], k)
print(I)
print(D)

[[  0 274 495 509]
 [  1 860 391 676]
 [  2 278 197 144]
 [  3 400 781 106]
 [  4 505  36 483]]
[[0.        6.9914494 7.450281  7.777972 ]
 [0.        7.223036  7.5612698 7.563117 ]
 [0.        6.3398957 7.187016  7.2561207]
 [0.        6.401422  6.5932446 7.007648 ]
 [0.        5.6738653 6.231701  6.7210436]]

D, I = index.search(xq[:5], k)
print(I)
print(D)

[[502  85 606 234]
 [444 699 636 436]
 [432 148 629 524]
 [629  38 515 945]
 [303  33 919 307]]
[[6.813346  6.9684916 7.0068817 7.0398655]
 [6.5172014 6.7739353 6.8253922 6.8459854]
 [7.1742516 8.173426  8.238693  8.29656  ]
 [6.794969  6.981325  7.088125  7.1704335]
 [5.7955394 6.356456  7.205101  7.23569  ]]

4.更快速搜索

nlist = 10
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
print(index.is_trained)
index.train(xb)
print(index.is_trained)

False
True

index.add_with_ids(xb, np.arange(1000, 2000))

index.search(xb[:1], 5)

(array([[0.       , 7.7962465, 8.162846 , 8.279927 , 8.418131 ]],
       dtype=float32),
 array([[1000, 1318, 1541, 1225, 1121]]))

index.add(xb)
# index.nprobe = 5    # 默认为1
D, I = index.search(xq[:5], k)
print(I)
print(D)

[[466 205 137 147]
 [436 371 220 606]
 [629 524 318 402]
 [515 945 733 891]
 [303 606 341 473]]
[[7.2691383 7.389188  7.9478083 7.973056 ]
 [6.8459854 6.9232197 7.1473436 7.42725  ]
 [8.238693  8.29656   8.312122  8.395692 ]
 [7.088125  7.1704335 7.4821615 7.6420956]
 [5.7955394 7.6942344 8.055542  8.17235  ]]

Comments

Faiss

1.生成随机数据

2.创建索引并添加数据

3.搜索

4.更快速搜索

Comments

Related Posts

Published

Category

Tags

Contact