Update README with all recent changes

- Project structure: add index_images_indb.py
- Architecture: fix schema names (VECTORS_USER/VECTOR), HNSW for all three
- Database schemas: separate sections for VECTORS_USER and VECTOR, photo storage differences
- Indexing scripts: three-way comparison table, measured avg times (12.1s/12.1s/13.6s)
- ORA-24816 workaround documented
- Performance comparison: real benchmark numbers, HNSW for in-DB, photo storage row
- Oracle in-DB section: HNSW index creation, index_images_indb.py for population
- Re-index section: add index_images_indb.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-20 11:17:27 +02:00
parent 3ef43019be
commit f2869d2e01
+87 -38
View File
@@ -40,8 +40,8 @@ ML library is loaded or called at search time.
│ PostgreSQL 18 │ │ Oracle 26ai │ │ Oracle 26ai │
│ + pgvector 0.8.2 │ │ (version 23.26.1) │ │ (version 23.26.1) │
│ database: │ │ PDB: FREEPDB1 │ │ PDB: FREEPDB1 │
│ vectors_demo │ │ user: vectors_user │ │ schema: VECTOR │
│ HNSW index │ │ HNSW index │ │ HNSW not needed
│ vectors_demo │ │ schema: VECTORS_USER│ │ schema: VECTOR │
│ HNSW index │ │ HNSW index │ │ HNSW index
└────────┬─────────────┘ └──────────┬───────────┘ └──────────┬────────────┘
│ │ │
▼ ▼ │
@@ -88,7 +88,8 @@ vector-search-demo/
│ ├── .env # Oracle credentials, photo path
│ ├── db_oracle.py # Oracle connection factory
│ ├── embedder.py # CLIP model wrapper (identical to pgvector)
│ ├── index_images_oracle.py # One-time indexing script (Python embedding)
│ ├── index_images_oracle.py # One-time indexing script (Python embedding, VECTORS_USER)
│ ├── index_images_indb.py # One-time indexing script (in-DB embedding, VECTOR schema)
│ ├── main_oracle.py # FastAPI app — Python embedding (port 8001)
│ └── main_oracle_indb.py # FastAPI app — in-database embedding (port 8002)
└── frontend/
@@ -130,7 +131,7 @@ The `pgvector/pgvector:pg18` image includes pgvector pre-installed. See the
| Container name | `oracle.free` |
| Host port | 37611 (mapped to 1521 inside container) |
| Pluggable Database | FREEPDB1 |
| Schema users | `vectors_user`, `VECTOR` |
| Schema users | `VECTORS_USER`, `VECTOR` |
**Oracle vector memory** — the HNSW index is held entirely in the SGA's Vector
Memory Area. This is already configured:
@@ -215,10 +216,11 @@ CREATE INDEX images_embedding_idx
ON images USING hnsw (embedding vector_cosine_ops);
```
### Oracle 26ai
### Oracle 26ai — schema VECTORS_USER (Python embedding backend)
```sql
-- PDB: FREEPDB1, user: vectors_user
-- PDB: FREEPDB1, schema: VECTORS_USER
-- Photos stored as file paths on the app server filesystem
CREATE TABLE images (
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
@@ -235,6 +237,36 @@ CREATE VECTOR INDEX images_embedding_idx
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
```
### Oracle 26ai — schema VECTOR (in-database embedding backend)
```sql
-- PDB: FREEPDB1, schema: VECTOR
-- Photos stored as BLOBs inside Oracle — no filesystem access at query time
CREATE TABLE foto_vektor (
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
filename VARCHAR2(100),
foto BLOB, -- full JPEG stored in Oracle
foto_vek VECTOR -- embedding computed by CLIP_IMG ONNX model
);
CREATE VECTOR INDEX foto_vektor_idx
ON foto_vektor(foto_vek)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
WITH DISTANCE COSINE
WITH TARGET ACCURACY 95
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
```
**Key difference between the two Oracle schemas:**
| Aspect | VECTORS_USER | VECTOR |
|---|---|---|
| Photo storage | File path (filesystem) | BLOB (inside Oracle) |
| Embedding at index time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
| Embedding at query time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` |
| Indexed by | `index_images_oracle.py` | `index_images_indb.py` |
**Key schema differences:**
| Aspect | PostgreSQL/pgvector | Oracle 26ai |
@@ -268,21 +300,29 @@ Runs in **thin mode** — no Oracle Instant Client installation is required on t
### Indexing scripts
Both scripts are idempotent: they check for existing rows and skip already-indexed
All three scripts are idempotent: they check for existing rows and skip already-indexed
photos. Each photo is committed individually so a crash does not lose prior work.
| | `index_images.py` | `index_images_oracle.py` |
|---|---|---|
| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` |
| Vector bind | Python `list` passed directly | `array.array("f", embedding)` required |
| Bind style | `%s` placeholders (psycopg2) | `:1`, `:2`, `:3` positional (oracledb) |
| Runtime (116 photos, CPU) | ~26 seconds | ~16 seconds |
| | `index_images.py` | `index_images_oracle.py` | `index_images_indb.py` |
|---|---|---|---|
| Schema | PostgreSQL `vectors_demo` | Oracle `VECTORS_USER` | Oracle `VECTOR` |
| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` | `python3 index_images_indb.py` |
| Photo data sent | File path | File path | Full JPEG as BLOB |
| Embedding | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
| Vector bind | Python `list` | `array.array("f", ...)` | Computed inside Oracle |
| Avg runtime (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** |
**Why `array.array` for Oracle?**
**Why `array.array` for `index_images_oracle.py`?**
The `python-oracledb` driver does not accept a plain Python list for a `VECTOR`
column. The data must be a Python `array.array` with typecode `"f"` (32-bit float),
matching the `FLOAT32` declaration in the Oracle column type.
**Why two SQL statements in `index_images_indb.py`?**
Oracle raises `ORA-24816` if a BLOB bind variable appears before another bind in the
same `VALUES` clause. The script works around this by inserting the BLOB first, then
updating the vector in a second statement — letting Oracle read the stored BLOB to
compute the embedding internally.
---
### FastAPI applications
@@ -470,16 +510,22 @@ podman cp oravector-demo/sql/setup_vector_schema.sql oracle.free:/tmp/
podman exec oracle.free bash -c "sqlplus -s / as sysdba @/tmp/setup_vector_schema.sql"
```
**Populate `FOTO_VEKTOR`** with images and their vectors (run as VECTOR user in SQL):
```sql
-- Example: insert one photo with its CLIP_IMG embedding
INSERT INTO vector.foto_vektor (filename, foto, foto_vek)
VALUES (
'photo.jpg',
TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')),
VECTOR_EMBEDDING(CLIP_IMG USING TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')) AS data)
);
COMMIT;
**Add HNSW index** (after the table is created):
```bash
podman exec oracle.free bash -c "sqlplus -s 'vector/Vektor@localhost:1521/FREEPDB1' <<'EOF'
CREATE VECTOR INDEX foto_vektor_idx
ON VECTOR.FOTO_VEKTOR(foto_vek)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
WITH DISTANCE COSINE WITH TARGET ACCURACY 95
PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
EXIT;
EOF"
```
**Populate `FOTO_VEKTOR`** using the indexing script (reads JPEGs from `PHOTOS_DIR`,
sends them as BLOBs to Oracle, which computes embeddings via `VECTOR_EMBEDDING(CLIP_IMG)`):
```bash
cd oravector-demo/backend && python3 index_images_indb.py
```
---
@@ -519,11 +565,11 @@ cd oravector-demo/backend && uvicorn main_oracle_indb:app --host 0.0.0.0 --port
# PostgreSQL
cd pgvector-demo/backend && python3 index_images.py
# Oracle (Python embedding)
# Oracle VECTORS_USER (Python embedding)
cd oravector-demo/backend && python3 index_images_oracle.py
# Oracle in-database: re-indexing is done in SQL directly
# (the VECTOR schema's FOTO_VEKTOR table is managed by Oracle)
# Oracle VECTOR (in-database embedding)
cd oravector-demo/backend && python3 index_images_indb.py
```
---
@@ -537,14 +583,15 @@ installation. The setup involved:
1. Creating a `VECTOR` database user
2. Exporting CLIP (ViT-B/32) to ONNX format and loading the models via
`DBMS_VECTOR.LOAD_ONNX_MODEL`
3. Creating and populating the `FOTO_VEKTOR` table with images and their vectors
3. Creating the `FOTO_VEKTOR` table and HNSW index
4. Populating `FOTO_VEKTOR` using `index_images_indb.py`
The resulting models and table are:
| Object | Type | Input | Output | Purpose |
|---|---|---|---|---|
| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries |
| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed image data |
| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries at search time |
| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed images at index time |
| `VECTOR.FOTO_VEKTOR` | Table | — | — | Stores filenames, image BLOBs, and vectors |
These are called with the `VECTOR_EMBEDDING()` SQL function. The table
@@ -591,18 +638,20 @@ Measured on this installation (CPU only, no GPU):
| Metric | PostgreSQL + pgvector | Oracle 26ai (Python embed) | Oracle 26ai (in-DB embed) |
|---|---|---|---|
| Photos indexed | 116 | 116 | 116 (manually indexed) |
| Indexing time | ~26 seconds | ~16 seconds | 0 (indexed separately by admin) |
| Index type | HNSW (on disk) | HNSW (in-memory) | Full table scan (116 rows) |
| Photos indexed | 116 | 116 | 116 |
| Avg indexing time (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** |
| Index type | HNSW (on disk) | HNSW (in-memory) | HNSW (in-memory) |
| Memory required | None | 512 MB SGA | 512 MB SGA |
| Photo storage | File path (filesystem) | File path (filesystem) | BLOB (in Oracle) |
| Python CLIP at query time | Yes | Yes | **No** |
| Embedding location | Python process | Python process | Inside Oracle SQL |
| Embedding at index time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` |
| Embedding at query time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` |
| `VECTOR_EMBEDDING()` used | No | No | **Yes** |
| Oracle schema | — | `VECTORS_USER` | `VECTOR` |
Note: indexing time for backends 1 and 2 is dominated by CLIP inference (CPU),
not database write speed. The in-database backend uses the manually loaded CLIP
models in the `VECTOR` schema; their indexing time is not measured here as it
was performed separately by the administrator.
Note: indexing time is dominated by CLIP inference for backends 1 and 2 (CPU, no GPU).
Backend 3 is slightly slower because each photo is transferred as a full JPEG BLOB
to Oracle over the network before Oracle computes the embedding internally.
---