Update README with all recent changes
- Project structure: add index_images_indb.py - Architecture: fix schema names (VECTORS_USER/VECTOR), HNSW for all three - Database schemas: separate sections for VECTORS_USER and VECTOR, photo storage differences - Indexing scripts: three-way comparison table, measured avg times (12.1s/12.1s/13.6s) - ORA-24816 workaround documented - Performance comparison: real benchmark numbers, HNSW for in-DB, photo storage row - Oracle in-DB section: HNSW index creation, index_images_indb.py for population - Re-index section: add index_images_indb.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -40,8 +40,8 @@ ML library is loaded or called at search time.
|
||||
│ PostgreSQL 18 │ │ Oracle 26ai │ │ Oracle 26ai │
|
||||
│ + pgvector 0.8.2 │ │ (version 23.26.1) │ │ (version 23.26.1) │
|
||||
│ database: │ │ PDB: FREEPDB1 │ │ PDB: FREEPDB1 │
|
||||
│ vectors_demo │ │ user: vectors_user │ │ schema: VECTOR │
|
||||
│ HNSW index │ │ HNSW index │ │ HNSW not needed │
|
||||
│ vectors_demo │ │ schema: VECTORS_USER│ │ schema: VECTOR │
|
||||
│ HNSW index │ │ HNSW index │ │ HNSW index │
|
||||
└────────┬─────────────┘ └──────────┬───────────┘ └──────────┬────────────┘
|
||||
│ │ │
|
||||
▼ ▼ │
|
||||
@@ -88,7 +88,8 @@ vector-search-demo/
|
||||
│ ├── .env # Oracle credentials, photo path
|
||||
│ ├── db_oracle.py # Oracle connection factory
|
||||
│ ├── embedder.py # CLIP model wrapper (identical to pgvector)
|
||||
│ ├── index_images_oracle.py # One-time indexing script (Python embedding)
|
||||
│ ├── index_images_oracle.py # One-time indexing script (Python embedding, VECTORS_USER)
|
||||
│ ├── index_images_indb.py # One-time indexing script (in-DB embedding, VECTOR schema)
|
||||
│ ├── main_oracle.py # FastAPI app — Python embedding (port 8001)
|
||||
│ └── main_oracle_indb.py # FastAPI app — in-database embedding (port 8002)
|
||||
└── frontend/
|
||||
@@ -130,7 +131,7 @@ The `pgvector/pgvector:pg18` image includes pgvector pre-installed. See the
|
||||
| Container name | `oracle.free` |
|
||||
| Host port | 37611 (mapped to 1521 inside container) |
|
||||
| Pluggable Database | FREEPDB1 |
|
||||
| Schema users | `vectors_user`, `VECTOR` |
|
||||
| Schema users | `VECTORS_USER`, `VECTOR` |
|
||||
|
||||
**Oracle vector memory** — the HNSW index is held entirely in the SGA's Vector
|
||||
Memory Area. This is already configured:
|
||||
@@ -215,10 +216,11 @@ CREATE INDEX images_embedding_idx
|
||||
ON images USING hnsw (embedding vector_cosine_ops);
|
||||
```
|
||||
|
||||
### Oracle 26ai
|
||||
### Oracle 26ai — schema VECTORS_USER (Python embedding backend)
|
||||
|
||||
```sql
|
||||
-- PDB: FREEPDB1, user: vectors_user
|
||||
-- PDB: FREEPDB1, schema: VECTORS_USER
|
||||
-- Photos stored as file paths on the app server filesystem
|
||||
|
||||
CREATE TABLE images (
|
||||
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
|
||||
@@ -235,6 +237,36 @@ CREATE VECTOR INDEX images_embedding_idx
|
||||
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
|
||||
```
|
||||
|
||||
### Oracle 26ai — schema VECTOR (in-database embedding backend)
|
||||
|
||||
```sql
|
||||
-- PDB: FREEPDB1, schema: VECTOR
|
||||
-- Photos stored as BLOBs inside Oracle — no filesystem access at query time
|
||||
|
||||
CREATE TABLE foto_vektor (
|
||||
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
|
||||
filename VARCHAR2(100),
|
||||
foto BLOB, -- full JPEG stored in Oracle
|
||||
foto_vek VECTOR -- embedding computed by CLIP_IMG ONNX model
|
||||
);
|
||||
|
||||
CREATE VECTOR INDEX foto_vektor_idx
|
||||
ON foto_vektor(foto_vek)
|
||||
ORGANIZATION INMEMORY NEIGHBOR GRAPH
|
||||
WITH DISTANCE COSINE
|
||||
WITH TARGET ACCURACY 95
|
||||
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
|
||||
```
|
||||
|
||||
**Key difference between the two Oracle schemas:**
|
||||
|
||||
| Aspect | VECTORS_USER | VECTOR |
|
||||
|---|---|---|
|
||||
| Photo storage | File path (filesystem) | BLOB (inside Oracle) |
|
||||
| Embedding at index time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
|
||||
| Embedding at query time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` |
|
||||
| Indexed by | `index_images_oracle.py` | `index_images_indb.py` |
|
||||
|
||||
**Key schema differences:**
|
||||
|
||||
| Aspect | PostgreSQL/pgvector | Oracle 26ai |
|
||||
@@ -268,21 +300,29 @@ Runs in **thin mode** — no Oracle Instant Client installation is required on t
|
||||
|
||||
### Indexing scripts
|
||||
|
||||
Both scripts are idempotent: they check for existing rows and skip already-indexed
|
||||
All three scripts are idempotent: they check for existing rows and skip already-indexed
|
||||
photos. Each photo is committed individually so a crash does not lose prior work.
|
||||
|
||||
| | `index_images.py` | `index_images_oracle.py` |
|
||||
|---|---|---|
|
||||
| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` |
|
||||
| Vector bind | Python `list` passed directly | `array.array("f", embedding)` required |
|
||||
| Bind style | `%s` placeholders (psycopg2) | `:1`, `:2`, `:3` positional (oracledb) |
|
||||
| Runtime (116 photos, CPU) | ~26 seconds | ~16 seconds |
|
||||
| | `index_images.py` | `index_images_oracle.py` | `index_images_indb.py` |
|
||||
|---|---|---|---|
|
||||
| Schema | PostgreSQL `vectors_demo` | Oracle `VECTORS_USER` | Oracle `VECTOR` |
|
||||
| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` | `python3 index_images_indb.py` |
|
||||
| Photo data sent | File path | File path | Full JPEG as BLOB |
|
||||
| Embedding | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
|
||||
| Vector bind | Python `list` | `array.array("f", ...)` | Computed inside Oracle |
|
||||
| Avg runtime (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** |
|
||||
|
||||
**Why `array.array` for Oracle?**
|
||||
**Why `array.array` for `index_images_oracle.py`?**
|
||||
The `python-oracledb` driver does not accept a plain Python list for a `VECTOR`
|
||||
column. The data must be a Python `array.array` with typecode `"f"` (32-bit float),
|
||||
matching the `FLOAT32` declaration in the Oracle column type.
|
||||
|
||||
**Why two SQL statements in `index_images_indb.py`?**
|
||||
Oracle raises `ORA-24816` if a BLOB bind variable appears before another bind in the
|
||||
same `VALUES` clause. The script works around this by inserting the BLOB first, then
|
||||
updating the vector in a second statement — letting Oracle read the stored BLOB to
|
||||
compute the embedding internally.
|
||||
|
||||
---
|
||||
|
||||
### FastAPI applications
|
||||
@@ -470,16 +510,22 @@ podman cp oravector-demo/sql/setup_vector_schema.sql oracle.free:/tmp/
|
||||
podman exec oracle.free bash -c "sqlplus -s / as sysdba @/tmp/setup_vector_schema.sql"
|
||||
```
|
||||
|
||||
**Populate `FOTO_VEKTOR`** with images and their vectors (run as VECTOR user in SQL):
|
||||
```sql
|
||||
-- Example: insert one photo with its CLIP_IMG embedding
|
||||
INSERT INTO vector.foto_vektor (filename, foto, foto_vek)
|
||||
VALUES (
|
||||
'photo.jpg',
|
||||
TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')),
|
||||
VECTOR_EMBEDDING(CLIP_IMG USING TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')) AS data)
|
||||
);
|
||||
COMMIT;
|
||||
**Add HNSW index** (after the table is created):
|
||||
```bash
|
||||
podman exec oracle.free bash -c "sqlplus -s 'vector/Vektor@localhost:1521/FREEPDB1' <<'EOF'
|
||||
CREATE VECTOR INDEX foto_vektor_idx
|
||||
ON VECTOR.FOTO_VEKTOR(foto_vek)
|
||||
ORGANIZATION INMEMORY NEIGHBOR GRAPH
|
||||
WITH DISTANCE COSINE WITH TARGET ACCURACY 95
|
||||
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
|
||||
EXIT;
|
||||
EOF"
|
||||
```
|
||||
|
||||
**Populate `FOTO_VEKTOR`** using the indexing script (reads JPEGs from `PHOTOS_DIR`,
|
||||
sends them as BLOBs to Oracle, which computes embeddings via `VECTOR_EMBEDDING(CLIP_IMG)`):
|
||||
```bash
|
||||
cd oravector-demo/backend && python3 index_images_indb.py
|
||||
```
|
||||
|
||||
---
|
||||
@@ -519,11 +565,11 @@ cd oravector-demo/backend && uvicorn main_oracle_indb:app --host 0.0.0.0 --port
|
||||
# PostgreSQL
|
||||
cd pgvector-demo/backend && python3 index_images.py
|
||||
|
||||
# Oracle (Python embedding)
|
||||
# Oracle VECTORS_USER (Python embedding)
|
||||
cd oravector-demo/backend && python3 index_images_oracle.py
|
||||
|
||||
# Oracle in-database: re-indexing is done in SQL directly
|
||||
# (the VECTOR schema's FOTO_VEKTOR table is managed by Oracle)
|
||||
# Oracle VECTOR (in-database embedding)
|
||||
cd oravector-demo/backend && python3 index_images_indb.py
|
||||
```
|
||||
|
||||
---
|
||||
@@ -537,14 +583,15 @@ installation. The setup involved:
|
||||
1. Creating a `VECTOR` database user
|
||||
2. Exporting CLIP (ViT-B/32) to ONNX format and loading the models via
|
||||
`DBMS_VECTOR.LOAD_ONNX_MODEL`
|
||||
3. Creating and populating the `FOTO_VEKTOR` table with images and their vectors
|
||||
3. Creating the `FOTO_VEKTOR` table and HNSW index
|
||||
4. Populating `FOTO_VEKTOR` using `index_images_indb.py`
|
||||
|
||||
The resulting models and table are:
|
||||
|
||||
| Object | Type | Input | Output | Purpose |
|
||||
|---|---|---|---|---|
|
||||
| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries |
|
||||
| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed image data |
|
||||
| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries at search time |
|
||||
| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed images at index time |
|
||||
| `VECTOR.FOTO_VEKTOR` | Table | — | — | Stores filenames, image BLOBs, and vectors |
|
||||
|
||||
These are called with the `VECTOR_EMBEDDING()` SQL function. The table
|
||||
@@ -591,18 +638,20 @@ Measured on this installation (CPU only, no GPU):
|
||||
|
||||
| Metric | PostgreSQL + pgvector | Oracle 26ai (Python embed) | Oracle 26ai (in-DB embed) |
|
||||
|---|---|---|---|
|
||||
| Photos indexed | 116 | 116 | 116 (manually indexed) |
|
||||
| Indexing time | ~26 seconds | ~16 seconds | 0 (indexed separately by admin) |
|
||||
| Index type | HNSW (on disk) | HNSW (in-memory) | Full table scan (116 rows) |
|
||||
| Photos indexed | 116 | 116 | 116 |
|
||||
| Avg indexing time (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** |
|
||||
| Index type | HNSW (on disk) | HNSW (in-memory) | HNSW (in-memory) |
|
||||
| Memory required | None | 512 MB SGA | 512 MB SGA |
|
||||
| Photo storage | File path (filesystem) | File path (filesystem) | BLOB (in Oracle) |
|
||||
| Python CLIP at query time | Yes | Yes | **No** |
|
||||
| Embedding location | Python process | Python process | Inside Oracle SQL |
|
||||
| Embedding at index time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
|
||||
| Embedding at query time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` |
|
||||
| `VECTOR_EMBEDDING()` used | No | No | **Yes** |
|
||||
| Oracle schema | — | `VECTORS_USER` | `VECTOR` |
|
||||
|
||||
Note: indexing time for backends 1 and 2 is dominated by CLIP inference (CPU),
|
||||
not database write speed. The in-database backend uses the manually loaded CLIP
|
||||
models in the `VECTOR` schema; their indexing time is not measured here as it
|
||||
was performed separately by the administrator.
|
||||
Note: indexing time is dominated by CLIP inference for backends 1 and 2 (CPU, no GPU).
|
||||
Backend 3 is slightly slower because each photo is transferred as a full JPEG BLOB
|
||||
to Oracle over the network before Oracle computes the embedding internally.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user