向量检索

最邻近算法

KNN(K最近邻)

ANN(近似最近邻)

LSH(位置敏感哈希)

评估最近邻搜索:速度,内存,准确性

向量数据库的核心在于相似性搜索

faiss

https://github.com/facebookresearch/faiss

https://github.com/DataIntelligenceCrew/go-faiss

openai

获取openai的Embedding: https://github.com/sashabaranov/go-openai/blob/master/embeddings.go

vector database

https://guangzhengli.com/blog/zh/vector-database/

其他

Word Embedding即词向量

检索向量后,给模型,减少了无用的语义。

视频

youtube

How to Choose a Vector Database in 2023:https://www.youtube.com/watch?v=aX_hdQEintc

IVF,HSNW

The Pinecone Vector Database System:https://www.youtube.com/watch?v=8LXotdzX_84

CMU MLDB:https://db.cs.cmu.edu/seminar2023/

b站

向量数据库技术鉴赏

https://www.bilibili.com/video/BV1BM4y177Dk

128维,每个维度32位浮点,一个向量 12832=4096位=512字节,1kw向量512字节=4.77GB

维度灾难问题

速度,质量,内存开销三个维度对近似最近邻很重要。内存是开发感知,质量和速度用户感知,牺牲内存的图结构算法:NSW算法和HNSW(Hierarchical Navigable Small Word)算法,导航小世界

OpenAI Embeddings和向量数据库速成课程

数据经过embeding模型,成了embeding

LLMs and Prompt Engineering

云厂商

腾讯 OLAMA:https://cloud.tencent.com/document/product/1709/94948

paper

Product quantization for nearest neighbor search:https://inria.hal.science/inria-00514462v2/document

Approximate nearest neighbor algorithm based on navigable small world graphs

Three and a half degrees of separation

Billion-scale similarity search with GPUs:faiss paper

向量检索算法一览与业界最新进展

https://km.woa.com/articles/show/595478

https://ann-benchmarks.com/index.html

ES

keyword search vs semantic search:关键字搜索与语义搜索

https://www.elastic.co/what-is/ https://www.elastic.co/cn/what-is/

https://www.elastic.co/what-is/vector-search

https://www.elastic.co/what-is/semantic-search

https://dbdb.io/browse?tag=nearest-neighbor-search