使用向量搜索来搜索嵌入

本教程介绍如何使用 VECTOR_SEARCH 函数以及可选的向量索引对存储在 BigQuery 表中的嵌入执行相似性搜索

VECTOR_SEARCH 与向量索引搭配使用时,VECTOR_SEARCH 会使用近似最近邻方法来提高向量搜索性能,虽然降低召回率,但可返回更接近的结果。如果没有向量索引,VECTOR_SEARCH 会使用暴力搜索来衡量每条记录的距离。

所需权限

如需运行本教程,您需要拥有以下 Identity and Access Management (IAM) 权限:

  • 如需创建数据集,您需要拥有 bigquery.datasets.create 权限。
  • 如需创建表,您需要拥有以下权限:

    • bigquery.tables.create
    • bigquery.tables.updateData
    • bigquery.jobs.create
  • 如需创建向量索引,您需要拥有要创建索引的表的 bigquery.tables.createIndex 权限。

  • 如需删除向量索引,您需要拥有要删除索引的表的 bigquery.tables.deleteIndex 权限。

以下每个预定义的 IAM 角色都包含使用向量索引所需的权限:

  • BigQuery Data Owner (roles/bigquery.dataOwner)
  • BigQuery Data Editor (roles/bigquery.dataEditor)

费用

VECTOR_SEARCH 函数使用 BigQuery 计算价格。您需要使用按需价格或版本价格为相似性搜索付费。

  • 按需:您需要为在基本表、索引和搜索查询中扫描的字节数付费。
  • 版本价格:您需要为在预留版本中完成作业所需的槽付费。规模更大、更复杂的相似度计算会产生更多费用。

如需了解详情,请参阅 BigQuery 价格

准备工作

  1. In the Trusted Cloud console, on the project selector page, select or create a Trusted Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  2. Verify that billing is enabled for your Trusted Cloud project.

  3. Enable the BigQuery API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

创建数据集

创建 BigQuery 数据集:

  1. 在 Trusted Cloud 控制台中,前往 BigQuery 页面。

    转到 BigQuery 页面

  2. 探索器窗格中,点击您的项目名称。

  3. 点击 查看操作 > 创建数据集

    创建一个数据集来包含教程中使用的对象。

  4. 创建数据集 页面上,执行以下操作:

    • 数据集 ID 部分,输入 vector_search

    • 位置类型部分,选择多区域,然后选择 US (multiple regions in United States)(美国[美国的多个区域])。

      公共数据集存储在 US 多区域中。为简单起见,请将数据集存储在同一位置。

    • 保持其余默认设置不变,然后点击创建数据集

创建测试表

  1. 基于 Google 专利公共数据集的子集创建包含专利嵌入的 patents 表:

    CREATE TABLE vector_search.patents AS
    SELECT * FROM `patents-public-data.google_patents_research.publications`
    WHERE ARRAY_LENGTH(embedding_v1) > 0
     AND publication_number NOT IN ('KR-20180122872-A')
    LIMIT 1000000;
  2. 创建包含要查找最近邻的专利嵌入的 patents2 表:

    CREATE TABLE vector_search.patents2 AS
    SELECT * FROM `patents-public-data.google_patents_research.publications`
    WHERE publication_number = 'KR-20180122872-A';

创建向量索引

  1. patents 表的 embeddings_v1 列上创建 my_index 向量索引:

    CREATE OR REPLACE VECTOR INDEX my_index ON vector_search.patents(embedding_v1)
    STORING(publication_number, title)
    OPTIONS(distance_type='COSINE', index_type='IVF');
  2. 等待几分钟,以便向量索引创建完成,然后运行以下查询并确认 coverage_percentage 值为 100

    SELECT * FROM vector_search.INFORMATION_SCHEMA.VECTOR_INDEXES;

使用 VECTOR_SEARCH 函数和索引

创建并填充向量索引后,使用 VECTOR_SEARCH 函数在 patents2 表的 embedding_v1 列中查找嵌入的最近邻。此查询在搜索中使用向量索引,因此 VECTOR_SEARCH 使用近似最近邻方法查找嵌入的最近邻。

VECTOR_SEARCH 函数与索引搭配使用

SELECT query.publication_number AS query_publication_number,
  query.title AS query_title,
  base.publication_number AS base_publication_number,
  base.title AS base_title,
  distance
FROM
  VECTOR_SEARCH(
    TABLE vector_search.patents,
    'embedding_v1',
    TABLE vector_search.patents2,
    top_k => 5,
    distance_type => 'COSINE',
    options => '{"fraction_lists_to_search": 0.005}');

结果类似于以下内容:

+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| query_publication_number |                         query_title                         | base_publication_number |                                                        base_title                                                        |      distance       |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-106599080-B          | A kind of rapid generation for keeping away big vast transfer figure based on GIS                                        | 0.14471956347590609 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-114118544-A          | Urban waterlogging detection method and device                                                                           | 0.17472108931171348 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-20200048143-A        | Method and system for mornitoring dry stream using unmanned aerial vehicle                                               | 0.17561990745619782 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-101721695-B1         | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same         | 0.17696129365559843 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-109000731-B          | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642917 |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+

使用 VECTOR_SEARCH 函数和暴力破解

使用 VECTOR_SEARCH 函数在 patents2 表的 embedding_v1 列中查找嵌入的最近邻。此查询在搜索中不使用向量索引,因此 VECTOR_SEARCH 会查找嵌入的精确最近邻。

SELECT query.publication_number AS query_publication_number,
  query.title AS query_title,
  base.publication_number AS base_publication_number,
  base.title AS base_title,
  distance
FROM
  VECTOR_SEARCH(
    TABLE vector_search.patents,
    'embedding_v1',
    TABLE vector_search.patents2,
    top_k => 5,
    distance_type => 'COSINE',
    options => '{"use_brute_force":true}');

结果类似于以下内容:

+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| query_publication_number |                         query_title                         | base_publication_number |                                                        base_title                                                        |      distance       |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-106599080-B          | A kind of rapid generation for keeping away big vast transfer figure based on GIS                                        |  0.1447195634759062 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-114118544-A          | Urban waterlogging detection method and device                                                                           |  0.1747210893117136 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-20200048143-A        | Method and system for mornitoring dry stream using unmanned aerial vehicle                                               | 0.17561990745619782 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-101721695-B1         | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same         | 0.17696129365559843 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-109000731-B          | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642928 |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+

评估召回率

使用索引执行向量搜索时,它会返回近似结果,代价是召回率降低。您可以通过将向量搜索和索引返回的结果与向量搜索和暴力破解返回的结果进行比较来计算召回率。在此数据集中,publication_number 值唯一标识专利,因此用于比较。

WITH approx_results AS (
  SELECT query.publication_number AS query_publication_number,
    base.publication_number AS base_publication_number
  FROM
    VECTOR_SEARCH(
      TABLE vector_search.patents,
      'embedding_v1',
      TABLE vector_search.patents2,
      top_k => 5,
      distance_type => 'COSINE',
      options => '{"fraction_lists_to_search": 0.005}')
),
  exact_results AS (
  SELECT query.publication_number AS query_publication_number,
    base.publication_number AS base_publication_number
  FROM
    VECTOR_SEARCH(
      TABLE vector_search.patents,
      'embedding_v1',
      TABLE vector_search.patents2,
      top_k => 5,
      distance_type => 'COSINE',
      options => '{"use_brute_force":true}')
)

SELECT
  a.query_publication_number,
  SUM(CASE WHEN a.base_publication_number = e.base_publication_number THEN 1 ELSE 0 END) / 5 AS recall
FROM exact_results e LEFT JOIN approx_results a
  ON e.query_publication_number = a.query_publication_number
GROUP BY a.query_publication_number

如果召回率低于您的预期,您可以提高 fraction_lists_to_search 值,但缺点是延迟时间可能增加且资源使用率可能变高。如需调整向量搜索,您可以尝试使用不同的参数值多次运行 VECTOR_SEARCH,将结果保存到表中,然后比较结果。

清理

  1. In the Trusted Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.