ultralytics 8.0.236 dataset semantic & SQL search API (#7136)
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: Laughing-q <1182102784@qq.com>
This commit is contained in:
parent
40a5c0abe7
commit
aca8eb1fd4
27 changed files with 1749 additions and 192 deletions
297
docs/en/datasets/explorer/api.md
Normal file
297
docs/en/datasets/explorer/api.md
Normal file
|
|
@ -0,0 +1,297 @@
|
|||
---
|
||||
comments: true
|
||||
description: Explore and analyze CV datasets with Ultralytics Explorer API, offering SQL, vector similarity, and semantic searches for efficient dataset insights.
|
||||
keywords: Ultralytics Explorer API, Dataset Exploration, SQL Queries, Vector Similarity Search, Semantic Search, Embeddings Table, Image Similarity, Python API for Datasets, CV Dataset Analysis, LanceDB Integration
|
||||
---
|
||||
|
||||
# Ultralytics Explorer API
|
||||
|
||||
## Introduction
|
||||
|
||||
The Explorer API is a Python API for exploring your datasets. It supports filtering and searching your dataset using SQL queries, vector similarity search and semantic search.
|
||||
|
||||
## Installation
|
||||
|
||||
Explorer depends on external libraries for some of its functionality. These are automatically installed on usage. To manually install these dependencies, use the following command:
|
||||
|
||||
```bash
|
||||
pip install ultralytics[explorer]
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
# Create an Explorer object
|
||||
explorer = Explorer(data='coco128.yaml', model='yolov8n.pt')
|
||||
|
||||
# Create embeddings for your dataset
|
||||
explorer.create_embeddings_table()
|
||||
|
||||
# Search for similar images to a given image/images
|
||||
dataframe = explorer.get_similar(img='path/to/image.jpg')
|
||||
|
||||
# Or search for similar images to a given index/indices
|
||||
dataframe = explorer.get_similar()(idx=0)
|
||||
```
|
||||
|
||||
## 1. Similarity Search
|
||||
|
||||
Similarity search is a technique for finding similar images to a given image. It is based on the idea that similar images will have similar embeddings.
|
||||
One the embeddings table is built, you can get run semantic search in any of the following ways:
|
||||
|
||||
- On a given index / list of indices in the dataset like - `exp.get_similar(idx=[1,10], limit=10)`
|
||||
- On any image/ list of images not in the dataset - `exp.get_similar(img=["path/to/img1", "path/to/img2"], limit=10)`
|
||||
-
|
||||
|
||||
In case of multiple inputs, the aggregate of their embeddings is used.
|
||||
|
||||
You get a pandas dataframe with the `limit` number of most similar data points to the input, along with their distance in the embedding space. You can use this dataset to perform further filtering
|
||||
|
||||
!!! Example "Semantic Search"
|
||||
|
||||
=== "Using Images"
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
# create an Explorer object
|
||||
exp = Explorer(data='coco128.yaml', model='yolov8n.pt')
|
||||
exp.create_embeddings_table()
|
||||
|
||||
similar = exp.get_similar(img='https://ultralytics.com/images/bus.jpg', limit=10)
|
||||
print(similar.head())
|
||||
|
||||
# Search using multiple indices
|
||||
similar = exp.get_similar(
|
||||
img=['https://ultralytics.com/images/bus.jpg',
|
||||
'https://ultralytics.com/images/bus.jpg'],
|
||||
limit=10
|
||||
)
|
||||
print(similar.head())
|
||||
```
|
||||
|
||||
=== "Using Dataset Indices"
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
# create an Explorer object
|
||||
exp = Explorer(data='coco128.yaml', model='yolov8n.pt')
|
||||
exp.create_embeddings_table()
|
||||
|
||||
similar = exp.get_similar(idx=1, limit=10)
|
||||
print(similar.head())
|
||||
|
||||
# Search using multiple indices
|
||||
similar = exp.get_similar(idx=[1,10], limit=10)
|
||||
print(similar.head())
|
||||
```
|
||||
|
||||
### Plotting Similar Images
|
||||
|
||||
You can also plot the similar images using the `plot_similar` method. This method takes the same arguments as `get_similar` and plots the similar images in a grid.
|
||||
|
||||
!!! Example "Plotting Similar Images"
|
||||
|
||||
=== "Using Images"
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
# create an Explorer object
|
||||
exp = Explorer(data='coco128.yaml', model='yolov8n.pt')
|
||||
exp.create_embeddings_table()
|
||||
|
||||
plt = exp.plot_similar(img='https://ultralytics.com/images/bus.jpg', limit=10)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
=== "Using Dataset Indices"
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
# create an Explorer object
|
||||
exp = Explorer(data='coco128.yaml', model='yolov8n.pt')
|
||||
exp.create_embeddings_table()
|
||||
|
||||
plt = exp.plot_similar(idx=1, limit=10)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## 2. SQL Querying
|
||||
|
||||
You can run SQL queries on your dataset using the `sql_query` method. This method takes a SQL query as input and returns a pandas dataframe with the results.
|
||||
|
||||
!!! Example "SQL Query"
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
# create an Explorer object
|
||||
exp = Explorer(data='coco128.yaml', model='yolov8n.pt')
|
||||
exp.create_embeddings_table()
|
||||
|
||||
df = exp.sql_query("WHERE labels LIKE '%person%' AND labels LIKE '%dog%'")
|
||||
print(df.head())
|
||||
```
|
||||
|
||||
### Plotting SQL Query Results
|
||||
|
||||
You can also plot the results of a SQL query using the `plot_sql_query` method. This method takes the same arguments as `sql_query` and plots the results in a grid.
|
||||
|
||||
!!! Example "Plotting SQL Query Results"
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
# create an Explorer object
|
||||
exp = Explorer(data='coco128.yaml', model='yolov8n.pt')
|
||||
exp.create_embeddings_table()
|
||||
|
||||
df = exp.sql_query("WHERE labels LIKE '%person%' AND labels LIKE '%dog%'")
|
||||
print(df.head())
|
||||
```
|
||||
|
||||
## 3. Working with embeddings Table (Advanced)
|
||||
|
||||
You can also work with the embeddings table directly. Once the embeddings table is created, you can access it using the `Explorer.table`
|
||||
|
||||
!!! Tip "Explorer works on [LanceDB](https://lancedb.github.io/lancedb/) tables internally. You can access this table directly, using `Explorer.table` object and run raw queries, push down pre- and post-filters, etc."
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
exp = Explorer()
|
||||
exp.create_embeddings_table()
|
||||
table = exp.table
|
||||
```
|
||||
|
||||
Here are some examples of what you can do with the table:
|
||||
|
||||
### Get raw Embeddings
|
||||
|
||||
!!! Example
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
exp = Explorer()
|
||||
exp.create_embeddings_table()
|
||||
table = exp.table
|
||||
|
||||
embeddings = table.to_pandas()["vector"]
|
||||
print(embeddings)
|
||||
```
|
||||
|
||||
### Advanced Querying with pre and post filters
|
||||
|
||||
!!! Example
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
exp = Explorer(model="yolov8n.pt")
|
||||
exp.create_embeddings_table()
|
||||
table = exp.table
|
||||
|
||||
# Dummy embedding
|
||||
embedding = [i for i in range(256)]
|
||||
rs = table.search(embedding).metric("cosine").where("").limit(10)
|
||||
```
|
||||
|
||||
### Create Vector Index
|
||||
|
||||
When using large datasets, you can also create a dedicated vector index for faster querying. This is done using the `create_index` method on LanceDB table.
|
||||
|
||||
```python
|
||||
table.create_index(num_partitions=..., num_sub_vectors=...)
|
||||
```
|
||||
|
||||
Find more details on the type vector indices available and parameters [here](https://lancedb.github.io/lancedb/ann_indexes/#types-of-index)
|
||||
In the future, we will add support for creating vector indices directly from Explorer API.
|
||||
|
||||
## 4. Embeddings Applications
|
||||
|
||||
You can use the embeddings table to perform a variety of exploratory analysis. Here are some examples:
|
||||
|
||||
### Similarity Index
|
||||
|
||||
Explorer comes with a `similarity_index` operation:
|
||||
|
||||
* It tries to estimate how similar each data point is with the rest of the dataset.
|
||||
* It does that by counting how many image embeddings lie closer than `max_dist` to the current image in the generated embedding space, considering `top_k` similar images at a time.
|
||||
|
||||
It returns a pandas dataframe with the following columns:
|
||||
|
||||
* `idx`: Index of the image in the dataset
|
||||
* `im_file`: Path to the image file
|
||||
* `count`: Number of images in the dataset that are closer than `max_dist` to the current image
|
||||
* `sim_im_files`: List of paths to the `count` similar images
|
||||
|
||||
!!! Tip
|
||||
|
||||
For a given dataset, model, `max_dist` & `top_k` the similarity index once generated will be reused. In case, your dataset has changed, or you simply need to regenerate the similarity index, you can pass `force=True`.
|
||||
|
||||
!!! Example "Similarity Index"
|
||||
|
||||
```python
|
||||
from ultralytics import Explorer
|
||||
|
||||
exp = Explorer()
|
||||
exp.create_embeddings_table()
|
||||
|
||||
sim_idx = exp.similarity_index()
|
||||
```
|
||||
|
||||
You can use similarity index to build custom conditions to filter out the dataset. For example, you can filter out images that are not similar to any other image in the dataset using the following code:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
sim_count = np.array(sim_idx["count"])
|
||||
sim_idx['im_file'][sim_count > 30]
|
||||
```
|
||||
|
||||
### Visualize Embedding Space
|
||||
|
||||
You can also visualize the embedding space using the plotting tool of your choice. For example here is a simple example using matplotlib:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sklearn.decomposition import PCA
|
||||
import matplotlib.pyplot as plt
|
||||
from mpl_toolkits.mplot3d import Axes3D
|
||||
|
||||
# Reduce dimensions using PCA to 3 components for visualization in 3D
|
||||
pca = PCA(n_components=3)
|
||||
reduced_data = pca.fit_transform(embeddings)
|
||||
|
||||
# Create a 3D scatter plot using Matplotlib Axes3D
|
||||
fig = plt.figure(figsize=(8, 6))
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
|
||||
# Scatter plot
|
||||
ax.scatter(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], alpha=0.5)
|
||||
ax.set_title('3D Scatter Plot of Reduced 256-Dimensional Data (PCA)')
|
||||
ax.set_xlabel('Component 1')
|
||||
ax.set_ylabel('Component 2')
|
||||
ax.set_zlabel('Component 3')
|
||||
|
||||
plt.show()
|
||||
```
|
||||
|
||||
Start creating your own CV dataset exploration reports using the Explorer API. For inspiration, check out the
|
||||
|
||||
# Apps Built Using Ultralytics Explorer
|
||||
|
||||
Try our GUI Demo based on Explorer API
|
||||
|
||||
# Coming Soon
|
||||
|
||||
- [ ] Merge specific labels from datasets. Example - Import all `person` labels from COCO and `car` labels from Cityscapes
|
||||
- [ ] Remove images that have a higher similarity index than the given threshold
|
||||
- [ ] Automatically persist new datasets after merging/removing entries
|
||||
- [ ] Advanced Dataset Visualizations
|
||||
0
docs/en/datasets/explorer/dash.md
Normal file
0
docs/en/datasets/explorer/dash.md
Normal file
457
docs/en/datasets/explorer/explorer.ipynb
Normal file
457
docs/en/datasets/explorer/explorer.ipynb
Normal file
|
|
@ -0,0 +1,457 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "aa923c26-81c8-4565-9277-1cb686e3702e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# VOC Exploration Example \n",
|
||||
"<div align=\"center\">\n",
|
||||
"\n",
|
||||
" <a href=\"https://ultralytics.com/yolov8\" target=\"_blank\">\n",
|
||||
" <img width=\"1024\", src=\"https://raw.githubusercontent.com/ultralytics/assets/main/yolov8/banner-yolov8.png\"></a>\n",
|
||||
"\n",
|
||||
" [中文](https://docs.ultralytics.com/zh/) | [한국어](https://docs.ultralytics.com/ko/) | [日本語](https://docs.ultralytics.com/ja/) | [Русский](https://docs.ultralytics.com/ru/) | [Deutsch](https://docs.ultralytics.com/de/) | [Français](https://docs.ultralytics.com/fr/) | [Español](https://docs.ultralytics.com/es/) | [Português](https://docs.ultralytics.com/pt/) | [हिन्दी](https://docs.ultralytics.com/hi/) | [العربية](https://docs.ultralytics.com/ar/)\n",
|
||||
"\n",
|
||||
" <a href=\"https://console.paperspace.com/github/ultralytics/ultralytics\"><img src=\"https://assets.paperspace.io/img/gradient-badge.svg\" alt=\"Run on Gradient\"/></a>\n",
|
||||
" <a href=\"https://colab.research.google.com/github/ultralytics/ultralytics/blob/main/examples/tutorial.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"></a>\n",
|
||||
" <a href=\"https://www.kaggle.com/ultralytics/yolov8\"><img src=\"https://kaggle.com/static/images/open-in-kaggle.svg\" alt=\"Open In Kaggle\"></a>\n",
|
||||
"\n",
|
||||
"Welcome to the Ultralytics Explorer API notebook! This notebook serves as the starting point for exploring the various resources available to help you get started with using Ultralytics to explore your datasets using with the power of semantic search. You can utilities out of the box that allow you to examine specific types of labels using vector search or even SQL queries.\n",
|
||||
"\n",
|
||||
"We hope that the resources in this notebook will help you get the most out of Ultralytics. Please browse the Explorer <a href=\"https://docs.ultralytics.com/\">Docs</a> for details, raise an issue on <a href=\"https://github.com/ultralytics/ultralytics\">GitHub</a> for support, and join our <a href=\"https://ultralytics.com/discord\">Discord</a> community for questions and discussions!\n",
|
||||
"\n",
|
||||
"Try `yolo explorer` powered by Exlorer API\n",
|
||||
"\n",
|
||||
"Simply `pip install ultralytics` and run `yolo explorer` in your terminal to run custom queries and semantic search on your datasets right inside your browser!\n",
|
||||
"\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2454d9ba-9db4-4b37-98e8-201ba285c92f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"Pip install `ultralytics` and [dependencies](https://github.com/ultralytics/ultralytics/blob/main/pyproject.toml) and check software and hardware."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "433f3a4d-a914-42cb-b0b6-be84a84e5e41",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install ultralytics\n",
|
||||
"ultralytics.checks()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ae602549-3419-4909-9f82-35cba515483f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from ultralytics import Explorer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d8c06350-be8e-45cf-b3a6-b5017bbd943c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Similarity search\n",
|
||||
"Utilize the power of vector similarity search to find the similar data points in your dataset along with their distance in the embedding space. Simply create an embeddings table for the given dataset-model pair. It is only needed once and it is reused automatically.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "334619da-6deb-4b32-9fe0-74e0a79cee20",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"exp = Explorer(\"VOC.yaml\", model=\"yolov8n.pt\")\n",
|
||||
"exp.create_embeddings_table()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b6c5e42d-bc7e-4b4c-bde0-643072a2165d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"One the embeddings table is built, you can get run semantic search in any of the following ways:\n",
|
||||
"- On a given index / list of indices in the dataset like - `exp.get_similar(idx=[1,10], limit=10)`\n",
|
||||
"- On any image/ list of images not in the dataset - `exp.get_similar(img=[\"path/to/img1\", \"path/to/img2\"], limit=10)`\n",
|
||||
"In case of multiple inputs, the aggregade of their embeddings is used.\n",
|
||||
"\n",
|
||||
"You get a pandas dataframe with the `limit` number of most similar data points to the input, along with their distance in the embedding space. You can use this dataset to perform further filtering\n",
|
||||
"<img width=\"1120\" alt=\"Screenshot 2024-01-06 at 9 45 42 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/7742ac57-e22a-4cea-a0f9-2b2a257483c5\">\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b485f05b-d92d-42bc-8da7-5e361667b341",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"similar = exp.get_similar(idx=1, limit=10)\n",
|
||||
"similar.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "acf4b489-2161-4176-a1fe-d1d067d8083d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can use the also plot the similar samples directly using the `plot_similar` util\n",
|
||||
"<img width=\"689\" alt=\"Screenshot 2024-01-06 at 9 46 48 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/70e1a4c4-6c67-4664-b77a-ad27b1fba8f8\">\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "9dbfe7d0-8613-4529-adb6-6e0632d7cce7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"exp.plot_similar(idx=6500, limit=20)\n",
|
||||
"#exp.plot_similar(idx=[100,101], limit=10) # Can also pass list of idxs or imgs\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "260e09bf-4960-4089-a676-cb0e76ff3c0d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"exp.plot_similar(img=\"https://ultralytics.com/images/bus.jpg\", limit=10, labels=False) # Can also pass any external images\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "faa0b7a7-6318-40e4-b0f4-45a8113bdc3a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<p>\n",
|
||||
"<img width=\"766\" alt=\"Screenshot 2024-01-06 at 10 05 10 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/faa9c544-d96b-4528-a2ea-95c5d8856744\">\n",
|
||||
"\n",
|
||||
"</p>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "35315ae6-d827-40e4-8813-279f97a83b34",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. Run SQL queries on your Dataset!\n",
|
||||
"Sometimes you might want to investigate a certain type of entries in your dataset. For this Explorer allows you to execute SQL queries.\n",
|
||||
"It accepts either of the formats:\n",
|
||||
"- Queries beginning with \"WHERE\" will automatically select all columns. This can be thought of as a short-hand query\n",
|
||||
"- You can also write full queries where you can specify which columns to select\n",
|
||||
"\n",
|
||||
"This can be used to investigate model performance and specific data points. For example:\n",
|
||||
"- let's say your model struggles on images that have humans and dogs. You can write a query like this to select the points that have at least 2 humans AND at least one dog.\n",
|
||||
"\n",
|
||||
"You can combine SQL query and semantic search to filter down to specific type of results\n",
|
||||
"<img width=\"994\" alt=\"Screenshot 2024-01-06 at 9 47 30 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/92bc3178-c151-4cd5-8007-c76178deb113\">\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8cd1072f-3100-4331-a0e3-4e2f6b1005bf",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"table = exp.sql_query(\"WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10\")\n",
|
||||
"table"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "debf8a00-c9f6-448b-bd3b-454cf62f39ab",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Just like similarity search, you also get a util to directly plot the sql queries using `exp.plot_sql_query`\n",
|
||||
"<img width=\"771\" alt=\"Screenshot 2024-01-06 at 9 48 08 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/332f5acd-3a4e-462d-a281-5d5effd1886e\">\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "18b977e7-d048-4b22-b8c4-084a03b04f23",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"exp.plot_sql_query(\"WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10\", labels=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f26804c5-840b-4fd1-987f-e362f29e3e06",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Working with embeddings Table (Advanced)\n",
|
||||
"Explorer works on [LanceDB](https://lancedb.github.io/lancedb/) tables internally. You can access this table directly, using `Explorer.table` object and run raw queries, push down pre and post filters, etc."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ea69260a-3407-40c9-9f42-8b34a6e6af7a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"table = exp.table\n",
|
||||
"table.schema"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "238db292-8610-40b3-9af7-dfd6be174892",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Run raw queries\n",
|
||||
"Vector Search finds the nearest vectors from the database. In a recommendation system or search engine, you can find similar products from the one you searched. In LLM and other AI applications, each data point can be presented by the embeddings generated from some models, it returns the most relevant features.\n",
|
||||
"\n",
|
||||
"A search in high-dimensional vector space, is to find K-Nearest-Neighbors (KNN) of the query vector.\n",
|
||||
"\n",
|
||||
"Metric\n",
|
||||
"In LanceDB, a Metric is the way to describe the distance between a pair of vectors. Currently, it supports the following metrics:\n",
|
||||
"- L2\n",
|
||||
"- Cosine\n",
|
||||
"- Dot\n",
|
||||
"Explorer's similarity search uses L2 by default. You can run queries on tables directly, or use the lance format to build custom utilities to manage datasets. More details on available LanceDB table ops in the [docs](https://lancedb.github.io/lancedb/)\n",
|
||||
"\n",
|
||||
"<img width=\"1015\" alt=\"Screenshot 2024-01-06 at 9 48 35 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/a2ccdaf3-8877-4f70-bf47-8a9bd2bb20c0\">\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d74430fe-5aee-45a1-8863-3f2c31338792",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dummy_img_embedding = [i for i in range(256)] \n",
|
||||
"table.search(dummy_img_embedding).limit(5).to_pandas()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "587486b4-0d19-4214-b994-f032fb2e8eb5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Inter-conversion to popular data formats"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "bb2876ea-999b-4eba-96bc-c196ba02c41c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = table.to_pandas()\n",
|
||||
"pa_table = table.to_arrow()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "42659d63-ad76-49d6-8dfc-78d77278db72",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Work with Embeddings\n",
|
||||
"You can access the raw embedding from lancedb Table and analyse it. The image embeddings are stored in column `vector`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "66d69e9b-046e-41c8-80d7-c0ee40be3bca",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"embeddings = table.to_pandas()[\"vector\"].tolist()\n",
|
||||
"embeddings = np.array(embeddings)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e8df0a49-9596-4399-954b-b8ae1fd7a602",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Scatterplot\n",
|
||||
"One of the preliminary steps in analysing embeddings is by plotting them in 2D space via dimensionality reduction. Let's try an example\n",
|
||||
"\n",
|
||||
"<img width=\"646\" alt=\"Screenshot 2024-01-06 at 9 48 58 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/9e1da25c-face-4426-abc0-2f64a4e4952c\">\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d9a150e8-8092-41b3-82f8-2247f8187fc8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install scikit-learn --q"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "196079c3-45a9-4325-81ab-af79a881e37a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%matplotlib inline\n",
|
||||
"import numpy as np\n",
|
||||
"from sklearn.decomposition import PCA\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"from mpl_toolkits.mplot3d import Axes3D\n",
|
||||
"\n",
|
||||
"# Reduce dimensions using PCA to 3 components for visualization in 3D\n",
|
||||
"pca = PCA(n_components=3)\n",
|
||||
"reduced_data = pca.fit_transform(embeddings)\n",
|
||||
"\n",
|
||||
"# Create a 3D scatter plot using Matplotlib's Axes3D\n",
|
||||
"fig = plt.figure(figsize=(8, 6))\n",
|
||||
"ax = fig.add_subplot(111, projection='3d')\n",
|
||||
"\n",
|
||||
"# Scatter plot\n",
|
||||
"ax.scatter(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], alpha=0.5)\n",
|
||||
"ax.set_title('3D Scatter Plot of Reduced 256-Dimensional Data (PCA)')\n",
|
||||
"ax.set_xlabel('Component 1')\n",
|
||||
"ax.set_ylabel('Component 2')\n",
|
||||
"ax.set_zlabel('Component 3')\n",
|
||||
"\n",
|
||||
"plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1c843c23-e3f2-490e-8d6c-212fa038a149",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Similarity Index\n",
|
||||
"Here's a simple example of an operation powered by the embeddings table. Explorer comes with a `similarity_index` operation-\n",
|
||||
"* It tries to estimate how similar each data point is with the rest of the dataset.\n",
|
||||
"* It does that by counting how many image embeddings lie closer than `max_dist` to the current image in the generated embedding space, considering `top_k` similar images at a time.\n",
|
||||
"\n",
|
||||
"For a given dataset, model, `max_dist` & `top_k` the similarity index once generated will be reused. In case, your dataset has changed, or you simply need to regenerate the similarity index, you can pass `force=True`.\n",
|
||||
"Similar to vector and SQL search, this also comes with a util to directly plot it. Let's look at the plot first\n",
|
||||
"<img width=\"633\" alt=\"Screenshot 2024-01-06 at 9 49 36 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/96a9d984-4a72-4784-ace1-428676ee2bdd\">\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "953c2a5f-1b61-4acf-a8e4-ed08547dbafc",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"exp.plot_similarity_index(max_dist=0.2, top_k=0.01)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "28228a9a-b727-45b5-8ca7-8db662c0b937",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now let's look at the output of the operation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f4161aaa-20e6-4df0-8e87-d2293ee0530a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"sim_idx = exp.similarity_index(max_dist=0.2, top_k=0.01, force=False)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b01d5b1a-9adb-4c3c-a873-217c71527c8d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sim_idx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "22b28e54-4fbb-400e-ad8c-7068cbba11c4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's create a query to see what data points have similarity count of more than 30 and plot images similar to them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "58d2557b-d401-43cf-937d-4f554c7bc808",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"sim_count = np.array(sim_idx[\"count\"])\n",
|
||||
"sim_idx['im_file'][sim_count > 30]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a5ec8d76-271a-41ab-ac74-cf8c0084ba5e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You should see something like this\n",
|
||||
"<img width=\"897\" alt=\"Screenshot 2024-01-06 at 9 50 48 PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/5d3f0e35-2ad4-4a67-8df7-3a4c17867b72\">\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "3a7b2ee3-9f35-48a2-9c38-38379516f4d2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"exp.plot_similar(idx=[7146, 14035]) # Using avg embeddings of 2 images"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
31
docs/en/datasets/explorer/index.md
Normal file
31
docs/en/datasets/explorer/index.md
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
---
|
||||
comments: true
|
||||
description: Discover the Ultralytics Explorer, a versatile tool and Python API for CV dataset exploration, enabling semantic search, SQL queries, and vector similarity searches.
|
||||
keywords: Ultralytics Explorer, CV Dataset Tools, Semantic Search, SQL Dataset Queries, Vector Similarity, Python API, GUI Explorer, Dataset Analysis, YOLO Explorer, Data Insights
|
||||
---
|
||||
|
||||
# Ultralytics Explorer
|
||||
|
||||
Ultralytics Explorer is a tool for exploring CV datasets using semantic search, SQL queries and vector similarity search. It is also a Python API for accessing the same functionality.
|
||||
|
||||
### Installation of optional dependencies
|
||||
|
||||
Explorer depends on external libraries for some of its functionality. These are automatically installed on usage. To manually install these dependencies, use the following command:
|
||||
|
||||
```bash
|
||||
pip install ultralytics[explorer]
|
||||
```
|
||||
|
||||
## GUI Explorer Usage
|
||||
|
||||
The GUI demo runs in your browser allowing you to create embeddings for your dataset and search for similar images, run SQL queries and perform semantic search. It can be run using the following command:
|
||||
|
||||
```bash
|
||||
yolo explorer
|
||||
```
|
||||
|
||||
### Explorer API
|
||||
|
||||
This is a Python API for Exploring your datasets. It also powers the GUI Explorer. You can use this to create your own exploratory notebooks or scripts to get insights into your datasets.
|
||||
|
||||
Learn more about the Explorer API [here](api.md).
|
||||
Loading…
Add table
Add a link
Reference in a new issue