Update README with all recent changes
- Project structure: add index_images_indb.py - Architecture: fix schema names (VECTORS_USER/VECTOR), HNSW for all three - Database schemas: separate sections for VECTORS_USER and VECTOR, photo storage differences - Indexing scripts: three-way comparison table, measured avg times (12.1s/12.1s/13.6s) - ORA-24816 workaround documented - Performance comparison: real benchmark numbers, HNSW for in-DB, photo storage row - Oracle in-DB section: HNSW index creation, index_images_indb.py for population - Re-index section: add index_images_indb.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -40,8 +40,8 @@ ML library is loaded or called at search time.
|
|||||||
│ PostgreSQL 18 │ │ Oracle 26ai │ │ Oracle 26ai │
|
│ PostgreSQL 18 │ │ Oracle 26ai │ │ Oracle 26ai │
|
||||||
│ + pgvector 0.8.2 │ │ (version 23.26.1) │ │ (version 23.26.1) │
|
│ + pgvector 0.8.2 │ │ (version 23.26.1) │ │ (version 23.26.1) │
|
||||||
│ database: │ │ PDB: FREEPDB1 │ │ PDB: FREEPDB1 │
|
│ database: │ │ PDB: FREEPDB1 │ │ PDB: FREEPDB1 │
|
||||||
│ vectors_demo │ │ user: vectors_user │ │ schema: VECTOR │
|
│ vectors_demo │ │ schema: VECTORS_USER│ │ schema: VECTOR │
|
||||||
│ HNSW index │ │ HNSW index │ │ HNSW not needed │
|
│ HNSW index │ │ HNSW index │ │ HNSW index │
|
||||||
└────────┬─────────────┘ └──────────┬───────────┘ └──────────┬────────────┘
|
└────────┬─────────────┘ └──────────┬───────────┘ └──────────┬────────────┘
|
||||||
│ │ │
|
│ │ │
|
||||||
▼ ▼ │
|
▼ ▼ │
|
||||||
@@ -88,7 +88,8 @@ vector-search-demo/
|
|||||||
│ ├── .env # Oracle credentials, photo path
|
│ ├── .env # Oracle credentials, photo path
|
||||||
│ ├── db_oracle.py # Oracle connection factory
|
│ ├── db_oracle.py # Oracle connection factory
|
||||||
│ ├── embedder.py # CLIP model wrapper (identical to pgvector)
|
│ ├── embedder.py # CLIP model wrapper (identical to pgvector)
|
||||||
│ ├── index_images_oracle.py # One-time indexing script (Python embedding)
|
│ ├── index_images_oracle.py # One-time indexing script (Python embedding, VECTORS_USER)
|
||||||
|
│ ├── index_images_indb.py # One-time indexing script (in-DB embedding, VECTOR schema)
|
||||||
│ ├── main_oracle.py # FastAPI app — Python embedding (port 8001)
|
│ ├── main_oracle.py # FastAPI app — Python embedding (port 8001)
|
||||||
│ └── main_oracle_indb.py # FastAPI app — in-database embedding (port 8002)
|
│ └── main_oracle_indb.py # FastAPI app — in-database embedding (port 8002)
|
||||||
└── frontend/
|
└── frontend/
|
||||||
@@ -130,7 +131,7 @@ The `pgvector/pgvector:pg18` image includes pgvector pre-installed. See the
|
|||||||
| Container name | `oracle.free` |
|
| Container name | `oracle.free` |
|
||||||
| Host port | 37611 (mapped to 1521 inside container) |
|
| Host port | 37611 (mapped to 1521 inside container) |
|
||||||
| Pluggable Database | FREEPDB1 |
|
| Pluggable Database | FREEPDB1 |
|
||||||
| Schema users | `vectors_user`, `VECTOR` |
|
| Schema users | `VECTORS_USER`, `VECTOR` |
|
||||||
|
|
||||||
**Oracle vector memory** — the HNSW index is held entirely in the SGA's Vector
|
**Oracle vector memory** — the HNSW index is held entirely in the SGA's Vector
|
||||||
Memory Area. This is already configured:
|
Memory Area. This is already configured:
|
||||||
@@ -215,10 +216,11 @@ CREATE INDEX images_embedding_idx
|
|||||||
ON images USING hnsw (embedding vector_cosine_ops);
|
ON images USING hnsw (embedding vector_cosine_ops);
|
||||||
```
|
```
|
||||||
|
|
||||||
### Oracle 26ai
|
### Oracle 26ai — schema VECTORS_USER (Python embedding backend)
|
||||||
|
|
||||||
```sql
|
```sql
|
||||||
-- PDB: FREEPDB1, user: vectors_user
|
-- PDB: FREEPDB1, schema: VECTORS_USER
|
||||||
|
-- Photos stored as file paths on the app server filesystem
|
||||||
|
|
||||||
CREATE TABLE images (
|
CREATE TABLE images (
|
||||||
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
|
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
|
||||||
@@ -235,6 +237,36 @@ CREATE VECTOR INDEX images_embedding_idx
|
|||||||
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
|
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Oracle 26ai — schema VECTOR (in-database embedding backend)
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- PDB: FREEPDB1, schema: VECTOR
|
||||||
|
-- Photos stored as BLOBs inside Oracle — no filesystem access at query time
|
||||||
|
|
||||||
|
CREATE TABLE foto_vektor (
|
||||||
|
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
|
||||||
|
filename VARCHAR2(100),
|
||||||
|
foto BLOB, -- full JPEG stored in Oracle
|
||||||
|
foto_vek VECTOR -- embedding computed by CLIP_IMG ONNX model
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE VECTOR INDEX foto_vektor_idx
|
||||||
|
ON foto_vektor(foto_vek)
|
||||||
|
ORGANIZATION INMEMORY NEIGHBOR GRAPH
|
||||||
|
WITH DISTANCE COSINE
|
||||||
|
WITH TARGET ACCURACY 95
|
||||||
|
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key difference between the two Oracle schemas:**
|
||||||
|
|
||||||
|
| Aspect | VECTORS_USER | VECTOR |
|
||||||
|
|---|---|---|
|
||||||
|
| Photo storage | File path (filesystem) | BLOB (inside Oracle) |
|
||||||
|
| Embedding at index time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
|
||||||
|
| Embedding at query time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` |
|
||||||
|
| Indexed by | `index_images_oracle.py` | `index_images_indb.py` |
|
||||||
|
|
||||||
**Key schema differences:**
|
**Key schema differences:**
|
||||||
|
|
||||||
| Aspect | PostgreSQL/pgvector | Oracle 26ai |
|
| Aspect | PostgreSQL/pgvector | Oracle 26ai |
|
||||||
@@ -268,21 +300,29 @@ Runs in **thin mode** — no Oracle Instant Client installation is required on t
|
|||||||
|
|
||||||
### Indexing scripts
|
### Indexing scripts
|
||||||
|
|
||||||
Both scripts are idempotent: they check for existing rows and skip already-indexed
|
All three scripts are idempotent: they check for existing rows and skip already-indexed
|
||||||
photos. Each photo is committed individually so a crash does not lose prior work.
|
photos. Each photo is committed individually so a crash does not lose prior work.
|
||||||
|
|
||||||
| | `index_images.py` | `index_images_oracle.py` |
|
| | `index_images.py` | `index_images_oracle.py` | `index_images_indb.py` |
|
||||||
|---|---|---|
|
|---|---|---|---|
|
||||||
| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` |
|
| Schema | PostgreSQL `vectors_demo` | Oracle `VECTORS_USER` | Oracle `VECTOR` |
|
||||||
| Vector bind | Python `list` passed directly | `array.array("f", embedding)` required |
|
| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` | `python3 index_images_indb.py` |
|
||||||
| Bind style | `%s` placeholders (psycopg2) | `:1`, `:2`, `:3` positional (oracledb) |
|
| Photo data sent | File path | File path | Full JPEG as BLOB |
|
||||||
| Runtime (116 photos, CPU) | ~26 seconds | ~16 seconds |
|
| Embedding | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
|
||||||
|
| Vector bind | Python `list` | `array.array("f", ...)` | Computed inside Oracle |
|
||||||
|
| Avg runtime (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** |
|
||||||
|
|
||||||
**Why `array.array` for Oracle?**
|
**Why `array.array` for `index_images_oracle.py`?**
|
||||||
The `python-oracledb` driver does not accept a plain Python list for a `VECTOR`
|
The `python-oracledb` driver does not accept a plain Python list for a `VECTOR`
|
||||||
column. The data must be a Python `array.array` with typecode `"f"` (32-bit float),
|
column. The data must be a Python `array.array` with typecode `"f"` (32-bit float),
|
||||||
matching the `FLOAT32` declaration in the Oracle column type.
|
matching the `FLOAT32` declaration in the Oracle column type.
|
||||||
|
|
||||||
|
**Why two SQL statements in `index_images_indb.py`?**
|
||||||
|
Oracle raises `ORA-24816` if a BLOB bind variable appears before another bind in the
|
||||||
|
same `VALUES` clause. The script works around this by inserting the BLOB first, then
|
||||||
|
updating the vector in a second statement — letting Oracle read the stored BLOB to
|
||||||
|
compute the embedding internally.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### FastAPI applications
|
### FastAPI applications
|
||||||
@@ -470,16 +510,22 @@ podman cp oravector-demo/sql/setup_vector_schema.sql oracle.free:/tmp/
|
|||||||
podman exec oracle.free bash -c "sqlplus -s / as sysdba @/tmp/setup_vector_schema.sql"
|
podman exec oracle.free bash -c "sqlplus -s / as sysdba @/tmp/setup_vector_schema.sql"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Populate `FOTO_VEKTOR`** with images and their vectors (run as VECTOR user in SQL):
|
**Add HNSW index** (after the table is created):
|
||||||
```sql
|
```bash
|
||||||
-- Example: insert one photo with its CLIP_IMG embedding
|
podman exec oracle.free bash -c "sqlplus -s 'vector/Vektor@localhost:1521/FREEPDB1' <<'EOF'
|
||||||
INSERT INTO vector.foto_vektor (filename, foto, foto_vek)
|
CREATE VECTOR INDEX foto_vektor_idx
|
||||||
VALUES (
|
ON VECTOR.FOTO_VEKTOR(foto_vek)
|
||||||
'photo.jpg',
|
ORGANIZATION INMEMORY NEIGHBOR GRAPH
|
||||||
TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')),
|
WITH DISTANCE COSINE WITH TARGET ACCURACY 95
|
||||||
VECTOR_EMBEDDING(CLIP_IMG USING TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')) AS data)
|
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
|
||||||
);
|
EXIT;
|
||||||
COMMIT;
|
EOF"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Populate `FOTO_VEKTOR`** using the indexing script (reads JPEGs from `PHOTOS_DIR`,
|
||||||
|
sends them as BLOBs to Oracle, which computes embeddings via `VECTOR_EMBEDDING(CLIP_IMG)`):
|
||||||
|
```bash
|
||||||
|
cd oravector-demo/backend && python3 index_images_indb.py
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -519,11 +565,11 @@ cd oravector-demo/backend && uvicorn main_oracle_indb:app --host 0.0.0.0 --port
|
|||||||
# PostgreSQL
|
# PostgreSQL
|
||||||
cd pgvector-demo/backend && python3 index_images.py
|
cd pgvector-demo/backend && python3 index_images.py
|
||||||
|
|
||||||
# Oracle (Python embedding)
|
# Oracle VECTORS_USER (Python embedding)
|
||||||
cd oravector-demo/backend && python3 index_images_oracle.py
|
cd oravector-demo/backend && python3 index_images_oracle.py
|
||||||
|
|
||||||
# Oracle in-database: re-indexing is done in SQL directly
|
# Oracle VECTOR (in-database embedding)
|
||||||
# (the VECTOR schema's FOTO_VEKTOR table is managed by Oracle)
|
cd oravector-demo/backend && python3 index_images_indb.py
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -537,14 +583,15 @@ installation. The setup involved:
|
|||||||
1. Creating a `VECTOR` database user
|
1. Creating a `VECTOR` database user
|
||||||
2. Exporting CLIP (ViT-B/32) to ONNX format and loading the models via
|
2. Exporting CLIP (ViT-B/32) to ONNX format and loading the models via
|
||||||
`DBMS_VECTOR.LOAD_ONNX_MODEL`
|
`DBMS_VECTOR.LOAD_ONNX_MODEL`
|
||||||
3. Creating and populating the `FOTO_VEKTOR` table with images and their vectors
|
3. Creating the `FOTO_VEKTOR` table and HNSW index
|
||||||
|
4. Populating `FOTO_VEKTOR` using `index_images_indb.py`
|
||||||
|
|
||||||
The resulting models and table are:
|
The resulting models and table are:
|
||||||
|
|
||||||
| Object | Type | Input | Output | Purpose |
|
| Object | Type | Input | Output | Purpose |
|
||||||
|---|---|---|---|---|
|
|---|---|---|---|---|
|
||||||
| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries |
|
| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries at search time |
|
||||||
| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed image data |
|
| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed images at index time |
|
||||||
| `VECTOR.FOTO_VEKTOR` | Table | — | — | Stores filenames, image BLOBs, and vectors |
|
| `VECTOR.FOTO_VEKTOR` | Table | — | — | Stores filenames, image BLOBs, and vectors |
|
||||||
|
|
||||||
These are called with the `VECTOR_EMBEDDING()` SQL function. The table
|
These are called with the `VECTOR_EMBEDDING()` SQL function. The table
|
||||||
@@ -591,18 +638,20 @@ Measured on this installation (CPU only, no GPU):
|
|||||||
|
|
||||||
| Metric | PostgreSQL + pgvector | Oracle 26ai (Python embed) | Oracle 26ai (in-DB embed) |
|
| Metric | PostgreSQL + pgvector | Oracle 26ai (Python embed) | Oracle 26ai (in-DB embed) |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| Photos indexed | 116 | 116 | 116 (manually indexed) |
|
| Photos indexed | 116 | 116 | 116 |
|
||||||
| Indexing time | ~26 seconds | ~16 seconds | 0 (indexed separately by admin) |
|
| Avg indexing time (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** |
|
||||||
| Index type | HNSW (on disk) | HNSW (in-memory) | Full table scan (116 rows) |
|
| Index type | HNSW (on disk) | HNSW (in-memory) | HNSW (in-memory) |
|
||||||
| Memory required | None | 512 MB SGA | 512 MB SGA |
|
| Memory required | None | 512 MB SGA | 512 MB SGA |
|
||||||
|
| Photo storage | File path (filesystem) | File path (filesystem) | BLOB (in Oracle) |
|
||||||
| Python CLIP at query time | Yes | Yes | **No** |
|
| Python CLIP at query time | Yes | Yes | **No** |
|
||||||
| Embedding location | Python process | Python process | Inside Oracle SQL |
|
| Embedding at index time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
|
||||||
|
| Embedding at query time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` |
|
||||||
| `VECTOR_EMBEDDING()` used | No | No | **Yes** |
|
| `VECTOR_EMBEDDING()` used | No | No | **Yes** |
|
||||||
|
| Oracle schema | — | `VECTORS_USER` | `VECTOR` |
|
||||||
|
|
||||||
Note: indexing time for backends 1 and 2 is dominated by CLIP inference (CPU),
|
Note: indexing time is dominated by CLIP inference for backends 1 and 2 (CPU, no GPU).
|
||||||
not database write speed. The in-database backend uses the manually loaded CLIP
|
Backend 3 is slightly slower because each photo is transferred as a full JPEG BLOB
|
||||||
models in the `VECTOR` schema; their indexing time is not measured here as it
|
to Oracle over the network before Oracle computes the embedding internally.
|
||||||
was performed separately by the administrator.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user