Update README with all recent changes

- Project structure: add index_images_indb.py
- Architecture: fix schema names (VECTORS_USER/VECTOR), HNSW for all three
- Database schemas: separate sections for VECTORS_USER and VECTOR, photo storage differences
- Indexing scripts: three-way comparison table, measured avg times (12.1s/12.1s/13.6s)
- ORA-24816 workaround documented
- Performance comparison: real benchmark numbers, HNSW for in-DB, photo storage row
- Oracle in-DB section: HNSW index creation, index_images_indb.py for population
- Re-index section: add index_images_indb.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-20 11:17:27 +02:00
parent 3ef43019be
commit f2869d2e01
+87 -38
View File
@@ -40,8 +40,8 @@ ML library is loaded or called at search time.
│ PostgreSQL 18 │ │ Oracle 26ai │ │ Oracle 26ai │ │ PostgreSQL 18 │ │ Oracle 26ai │ │ Oracle 26ai │
│ + pgvector 0.8.2 │ │ (version 23.26.1) │ │ (version 23.26.1) │ │ + pgvector 0.8.2 │ │ (version 23.26.1) │ │ (version 23.26.1) │
│ database: │ │ PDB: FREEPDB1 │ │ PDB: FREEPDB1 │ │ database: │ │ PDB: FREEPDB1 │ │ PDB: FREEPDB1 │
│ vectors_demo │ │ user: vectors_user │ │ schema: VECTOR │ │ vectors_demo │ │ schema: VECTORS_USER│ │ schema: VECTOR │
│ HNSW index │ │ HNSW index │ │ HNSW not needed │ HNSW index │ │ HNSW index │ │ HNSW index
└────────┬─────────────┘ └──────────┬───────────┘ └──────────┬────────────┘ └────────┬─────────────┘ └──────────┬───────────┘ └──────────┬────────────┘
│ │ │ │ │ │
▼ ▼ │ ▼ ▼ │
@@ -88,7 +88,8 @@ vector-search-demo/
│ ├── .env # Oracle credentials, photo path │ ├── .env # Oracle credentials, photo path
│ ├── db_oracle.py # Oracle connection factory │ ├── db_oracle.py # Oracle connection factory
│ ├── embedder.py # CLIP model wrapper (identical to pgvector) │ ├── embedder.py # CLIP model wrapper (identical to pgvector)
│ ├── index_images_oracle.py # One-time indexing script (Python embedding) │ ├── index_images_oracle.py # One-time indexing script (Python embedding, VECTORS_USER)
│ ├── index_images_indb.py # One-time indexing script (in-DB embedding, VECTOR schema)
│ ├── main_oracle.py # FastAPI app — Python embedding (port 8001) │ ├── main_oracle.py # FastAPI app — Python embedding (port 8001)
│ └── main_oracle_indb.py # FastAPI app — in-database embedding (port 8002) │ └── main_oracle_indb.py # FastAPI app — in-database embedding (port 8002)
└── frontend/ └── frontend/
@@ -130,7 +131,7 @@ The `pgvector/pgvector:pg18` image includes pgvector pre-installed. See the
| Container name | `oracle.free` | | Container name | `oracle.free` |
| Host port | 37611 (mapped to 1521 inside container) | | Host port | 37611 (mapped to 1521 inside container) |
| Pluggable Database | FREEPDB1 | | Pluggable Database | FREEPDB1 |
| Schema users | `vectors_user`, `VECTOR` | | Schema users | `VECTORS_USER`, `VECTOR` |
**Oracle vector memory** — the HNSW index is held entirely in the SGA's Vector **Oracle vector memory** — the HNSW index is held entirely in the SGA's Vector
Memory Area. This is already configured: Memory Area. This is already configured:
@@ -215,10 +216,11 @@ CREATE INDEX images_embedding_idx
ON images USING hnsw (embedding vector_cosine_ops); ON images USING hnsw (embedding vector_cosine_ops);
``` ```
### Oracle 26ai ### Oracle 26ai — schema VECTORS_USER (Python embedding backend)
```sql ```sql
-- PDB: FREEPDB1, user: vectors_user -- PDB: FREEPDB1, schema: VECTORS_USER
-- Photos stored as file paths on the app server filesystem
CREATE TABLE images ( CREATE TABLE images (
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
@@ -235,6 +237,36 @@ CREATE VECTOR INDEX images_embedding_idx
PARAMETERS (type HNSW, neighbors 32, efconstruction 200); PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
``` ```
### Oracle 26ai — schema VECTOR (in-database embedding backend)
```sql
-- PDB: FREEPDB1, schema: VECTOR
-- Photos stored as BLOBs inside Oracle — no filesystem access at query time
CREATE TABLE foto_vektor (
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
filename VARCHAR2(100),
foto BLOB, -- full JPEG stored in Oracle
foto_vek VECTOR -- embedding computed by CLIP_IMG ONNX model
);
CREATE VECTOR INDEX foto_vektor_idx
ON foto_vektor(foto_vek)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
WITH DISTANCE COSINE
WITH TARGET ACCURACY 95
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
```
**Key difference between the two Oracle schemas:**
| Aspect | VECTORS_USER | VECTOR |
|---|---|---|
| Photo storage | File path (filesystem) | BLOB (inside Oracle) |
| Embedding at index time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
| Embedding at query time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` |
| Indexed by | `index_images_oracle.py` | `index_images_indb.py` |
**Key schema differences:** **Key schema differences:**
| Aspect | PostgreSQL/pgvector | Oracle 26ai | | Aspect | PostgreSQL/pgvector | Oracle 26ai |
@@ -268,21 +300,29 @@ Runs in **thin mode** — no Oracle Instant Client installation is required on t
### Indexing scripts ### Indexing scripts
Both scripts are idempotent: they check for existing rows and skip already-indexed All three scripts are idempotent: they check for existing rows and skip already-indexed
photos. Each photo is committed individually so a crash does not lose prior work. photos. Each photo is committed individually so a crash does not lose prior work.
| | `index_images.py` | `index_images_oracle.py` | | | `index_images.py` | `index_images_oracle.py` | `index_images_indb.py` |
|---|---|---| |---|---|---|---|
| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` | | Schema | PostgreSQL `vectors_demo` | Oracle `VECTORS_USER` | Oracle `VECTOR` |
| Vector bind | Python `list` passed directly | `array.array("f", embedding)` required | | Run command | `python3 index_images.py` | `python3 index_images_oracle.py` | `python3 index_images_indb.py` |
| Bind style | `%s` placeholders (psycopg2) | `:1`, `:2`, `:3` positional (oracledb) | | Photo data sent | File path | File path | Full JPEG as BLOB |
| Runtime (116 photos, CPU) | ~26 seconds | ~16 seconds | | Embedding | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
| Vector bind | Python `list` | `array.array("f", ...)` | Computed inside Oracle |
| Avg runtime (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** |
**Why `array.array` for Oracle?** **Why `array.array` for `index_images_oracle.py`?**
The `python-oracledb` driver does not accept a plain Python list for a `VECTOR` The `python-oracledb` driver does not accept a plain Python list for a `VECTOR`
column. The data must be a Python `array.array` with typecode `"f"` (32-bit float), column. The data must be a Python `array.array` with typecode `"f"` (32-bit float),
matching the `FLOAT32` declaration in the Oracle column type. matching the `FLOAT32` declaration in the Oracle column type.
**Why two SQL statements in `index_images_indb.py`?**
Oracle raises `ORA-24816` if a BLOB bind variable appears before another bind in the
same `VALUES` clause. The script works around this by inserting the BLOB first, then
updating the vector in a second statement — letting Oracle read the stored BLOB to
compute the embedding internally.
--- ---
### FastAPI applications ### FastAPI applications
@@ -470,16 +510,22 @@ podman cp oravector-demo/sql/setup_vector_schema.sql oracle.free:/tmp/
podman exec oracle.free bash -c "sqlplus -s / as sysdba @/tmp/setup_vector_schema.sql" podman exec oracle.free bash -c "sqlplus -s / as sysdba @/tmp/setup_vector_schema.sql"
``` ```
**Populate `FOTO_VEKTOR`** with images and their vectors (run as VECTOR user in SQL): **Add HNSW index** (after the table is created):
```sql ```bash
-- Example: insert one photo with its CLIP_IMG embedding podman exec oracle.free bash -c "sqlplus -s 'vector/Vektor@localhost:1521/FREEPDB1' <<'EOF'
INSERT INTO vector.foto_vektor (filename, foto, foto_vek) CREATE VECTOR INDEX foto_vektor_idx
VALUES ( ON VECTOR.FOTO_VEKTOR(foto_vek)
'photo.jpg', ORGANIZATION INMEMORY NEIGHBOR GRAPH
TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')), WITH DISTANCE COSINE WITH TARGET ACCURACY 95
VECTOR_EMBEDDING(CLIP_IMG USING TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')) AS data) PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
); EXIT;
COMMIT; EOF"
```
**Populate `FOTO_VEKTOR`** using the indexing script (reads JPEGs from `PHOTOS_DIR`,
sends them as BLOBs to Oracle, which computes embeddings via `VECTOR_EMBEDDING(CLIP_IMG)`):
```bash
cd oravector-demo/backend && python3 index_images_indb.py
``` ```
--- ---
@@ -519,11 +565,11 @@ cd oravector-demo/backend && uvicorn main_oracle_indb:app --host 0.0.0.0 --port
# PostgreSQL # PostgreSQL
cd pgvector-demo/backend && python3 index_images.py cd pgvector-demo/backend && python3 index_images.py
# Oracle (Python embedding) # Oracle VECTORS_USER (Python embedding)
cd oravector-demo/backend && python3 index_images_oracle.py cd oravector-demo/backend && python3 index_images_oracle.py
# Oracle in-database: re-indexing is done in SQL directly # Oracle VECTOR (in-database embedding)
# (the VECTOR schema's FOTO_VEKTOR table is managed by Oracle) cd oravector-demo/backend && python3 index_images_indb.py
``` ```
--- ---
@@ -537,14 +583,15 @@ installation. The setup involved:
1. Creating a `VECTOR` database user 1. Creating a `VECTOR` database user
2. Exporting CLIP (ViT-B/32) to ONNX format and loading the models via 2. Exporting CLIP (ViT-B/32) to ONNX format and loading the models via
`DBMS_VECTOR.LOAD_ONNX_MODEL` `DBMS_VECTOR.LOAD_ONNX_MODEL`
3. Creating and populating the `FOTO_VEKTOR` table with images and their vectors 3. Creating the `FOTO_VEKTOR` table and HNSW index
4. Populating `FOTO_VEKTOR` using `index_images_indb.py`
The resulting models and table are: The resulting models and table are:
| Object | Type | Input | Output | Purpose | | Object | Type | Input | Output | Purpose |
|---|---|---|---|---| |---|---|---|---|---|
| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries | | `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries at search time |
| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed image data | | `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed images at index time |
| `VECTOR.FOTO_VEKTOR` | Table | — | — | Stores filenames, image BLOBs, and vectors | | `VECTOR.FOTO_VEKTOR` | Table | — | — | Stores filenames, image BLOBs, and vectors |
These are called with the `VECTOR_EMBEDDING()` SQL function. The table These are called with the `VECTOR_EMBEDDING()` SQL function. The table
@@ -591,18 +638,20 @@ Measured on this installation (CPU only, no GPU):
| Metric | PostgreSQL + pgvector | Oracle 26ai (Python embed) | Oracle 26ai (in-DB embed) | | Metric | PostgreSQL + pgvector | Oracle 26ai (Python embed) | Oracle 26ai (in-DB embed) |
|---|---|---|---| |---|---|---|---|
| Photos indexed | 116 | 116 | 116 (manually indexed) | | Photos indexed | 116 | 116 | 116 |
| Indexing time | ~26 seconds | ~16 seconds | 0 (indexed separately by admin) | | Avg indexing time (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** |
| Index type | HNSW (on disk) | HNSW (in-memory) | Full table scan (116 rows) | | Index type | HNSW (on disk) | HNSW (in-memory) | HNSW (in-memory) |
| Memory required | None | 512 MB SGA | 512 MB SGA | | Memory required | None | 512 MB SGA | 512 MB SGA |
| Photo storage | File path (filesystem) | File path (filesystem) | BLOB (in Oracle) |
| Python CLIP at query time | Yes | Yes | **No** | | Python CLIP at query time | Yes | Yes | **No** |
| Embedding location | Python process | Python process | Inside Oracle SQL | | Embedding at index time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
| Embedding at query time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` |
| `VECTOR_EMBEDDING()` used | No | No | **Yes** | | `VECTOR_EMBEDDING()` used | No | No | **Yes** |
| Oracle schema | — | `VECTORS_USER` | `VECTOR` |
Note: indexing time for backends 1 and 2 is dominated by CLIP inference (CPU), Note: indexing time is dominated by CLIP inference for backends 1 and 2 (CPU, no GPU).
not database write speed. The in-database backend uses the manually loaded CLIP Backend 3 is slightly slower because each photo is transferred as a full JPEG BLOB
models in the `VECTOR` schema; their indexing time is not measured here as it to Oracle over the network before Oracle computes the embedding internally.
was performed separately by the administrator.
--- ---