diff --git a/README.md b/README.md index d722104..135a1aa 100644 --- a/README.md +++ b/README.md @@ -40,8 +40,8 @@ ML library is loaded or called at search time. │ PostgreSQL 18 │ │ Oracle 26ai │ │ Oracle 26ai │ │ + pgvector 0.8.2 │ │ (version 23.26.1) │ │ (version 23.26.1) │ │ database: │ │ PDB: FREEPDB1 │ │ PDB: FREEPDB1 │ - │ vectors_demo │ │ user: vectors_user │ │ schema: VECTOR │ - │ HNSW index │ │ HNSW index │ │ HNSW not needed │ + │ vectors_demo │ │ schema: VECTORS_USER│ │ schema: VECTOR │ + │ HNSW index │ │ HNSW index │ │ HNSW index │ └────────┬─────────────┘ └──────────┬───────────┘ └──────────┬────────────┘ │ │ │ ▼ ▼ │ @@ -88,7 +88,8 @@ vector-search-demo/ │ ├── .env # Oracle credentials, photo path │ ├── db_oracle.py # Oracle connection factory │ ├── embedder.py # CLIP model wrapper (identical to pgvector) - │ ├── index_images_oracle.py # One-time indexing script (Python embedding) + │ ├── index_images_oracle.py # One-time indexing script (Python embedding, VECTORS_USER) + │ ├── index_images_indb.py # One-time indexing script (in-DB embedding, VECTOR schema) │ ├── main_oracle.py # FastAPI app — Python embedding (port 8001) │ └── main_oracle_indb.py # FastAPI app — in-database embedding (port 8002) └── frontend/ @@ -130,7 +131,7 @@ The `pgvector/pgvector:pg18` image includes pgvector pre-installed. See the | Container name | `oracle.free` | | Host port | 37611 (mapped to 1521 inside container) | | Pluggable Database | FREEPDB1 | -| Schema users | `vectors_user`, `VECTOR` | +| Schema users | `VECTORS_USER`, `VECTOR` | **Oracle vector memory** — the HNSW index is held entirely in the SGA's Vector Memory Area. This is already configured: @@ -215,10 +216,11 @@ CREATE INDEX images_embedding_idx ON images USING hnsw (embedding vector_cosine_ops); ``` -### Oracle 26ai +### Oracle 26ai — schema VECTORS_USER (Python embedding backend) ```sql --- PDB: FREEPDB1, user: vectors_user +-- PDB: FREEPDB1, schema: VECTORS_USER +-- Photos stored as file paths on the app server filesystem CREATE TABLE images ( id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, @@ -235,6 +237,36 @@ CREATE VECTOR INDEX images_embedding_idx PARAMETERS (type HNSW, neighbors 32, efconstruction 200); ``` +### Oracle 26ai — schema VECTOR (in-database embedding backend) + +```sql +-- PDB: FREEPDB1, schema: VECTOR +-- Photos stored as BLOBs inside Oracle — no filesystem access at query time + +CREATE TABLE foto_vektor ( + id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, + filename VARCHAR2(100), + foto BLOB, -- full JPEG stored in Oracle + foto_vek VECTOR -- embedding computed by CLIP_IMG ONNX model +); + +CREATE VECTOR INDEX foto_vektor_idx + ON foto_vektor(foto_vek) + ORGANIZATION INMEMORY NEIGHBOR GRAPH + WITH DISTANCE COSINE + WITH TARGET ACCURACY 95 + PARAMETERS (type HNSW, neighbors 32, efconstruction 200); +``` + +**Key difference between the two Oracle schemas:** + +| Aspect | VECTORS_USER | VECTOR | +|---|---|---| +| Photo storage | File path (filesystem) | BLOB (inside Oracle) | +| Embedding at index time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` | +| Embedding at query time | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` | +| Indexed by | `index_images_oracle.py` | `index_images_indb.py` | + **Key schema differences:** | Aspect | PostgreSQL/pgvector | Oracle 26ai | @@ -268,21 +300,29 @@ Runs in **thin mode** — no Oracle Instant Client installation is required on t ### Indexing scripts -Both scripts are idempotent: they check for existing rows and skip already-indexed +All three scripts are idempotent: they check for existing rows and skip already-indexed photos. Each photo is committed individually so a crash does not lose prior work. -| | `index_images.py` | `index_images_oracle.py` | -|---|---|---| -| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` | -| Vector bind | Python `list` passed directly | `array.array("f", embedding)` required | -| Bind style | `%s` placeholders (psycopg2) | `:1`, `:2`, `:3` positional (oracledb) | -| Runtime (116 photos, CPU) | ~26 seconds | ~16 seconds | +| | `index_images.py` | `index_images_oracle.py` | `index_images_indb.py` | +|---|---|---|---| +| Schema | PostgreSQL `vectors_demo` | Oracle `VECTORS_USER` | Oracle `VECTOR` | +| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` | `python3 index_images_indb.py` | +| Photo data sent | File path | File path | Full JPEG as BLOB | +| Embedding | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` | +| Vector bind | Python `list` | `array.array("f", ...)` | Computed inside Oracle | +| Avg runtime (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** | -**Why `array.array` for Oracle?** +**Why `array.array` for `index_images_oracle.py`?** The `python-oracledb` driver does not accept a plain Python list for a `VECTOR` column. The data must be a Python `array.array` with typecode `"f"` (32-bit float), matching the `FLOAT32` declaration in the Oracle column type. +**Why two SQL statements in `index_images_indb.py`?** +Oracle raises `ORA-24816` if a BLOB bind variable appears before another bind in the +same `VALUES` clause. The script works around this by inserting the BLOB first, then +updating the vector in a second statement — letting Oracle read the stored BLOB to +compute the embedding internally. + --- ### FastAPI applications @@ -470,16 +510,22 @@ podman cp oravector-demo/sql/setup_vector_schema.sql oracle.free:/tmp/ podman exec oracle.free bash -c "sqlplus -s / as sysdba @/tmp/setup_vector_schema.sql" ``` -**Populate `FOTO_VEKTOR`** with images and their vectors (run as VECTOR user in SQL): -```sql --- Example: insert one photo with its CLIP_IMG embedding -INSERT INTO vector.foto_vektor (filename, foto, foto_vek) -VALUES ( - 'photo.jpg', - TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')), - VECTOR_EMBEDDING(CLIP_IMG USING TO_BLOB(BFILENAME('VEC_DUMP', 'photo.jpg')) AS data) -); -COMMIT; +**Add HNSW index** (after the table is created): +```bash +podman exec oracle.free bash -c "sqlplus -s 'vector/Vektor@localhost:1521/FREEPDB1' <<'EOF' +CREATE VECTOR INDEX foto_vektor_idx + ON VECTOR.FOTO_VEKTOR(foto_vek) + ORGANIZATION INMEMORY NEIGHBOR GRAPH + WITH DISTANCE COSINE WITH TARGET ACCURACY 95 + PARAMETERS (type HNSW, neighbors 32, efconstruction 200); +EXIT; +EOF" +``` + +**Populate `FOTO_VEKTOR`** using the indexing script (reads JPEGs from `PHOTOS_DIR`, +sends them as BLOBs to Oracle, which computes embeddings via `VECTOR_EMBEDDING(CLIP_IMG)`): +```bash +cd oravector-demo/backend && python3 index_images_indb.py ``` --- @@ -519,11 +565,11 @@ cd oravector-demo/backend && uvicorn main_oracle_indb:app --host 0.0.0.0 --port # PostgreSQL cd pgvector-demo/backend && python3 index_images.py -# Oracle (Python embedding) +# Oracle VECTORS_USER (Python embedding) cd oravector-demo/backend && python3 index_images_oracle.py -# Oracle in-database: re-indexing is done in SQL directly -# (the VECTOR schema's FOTO_VEKTOR table is managed by Oracle) +# Oracle VECTOR (in-database embedding) +cd oravector-demo/backend && python3 index_images_indb.py ``` --- @@ -537,14 +583,15 @@ installation. The setup involved: 1. Creating a `VECTOR` database user 2. Exporting CLIP (ViT-B/32) to ONNX format and loading the models via `DBMS_VECTOR.LOAD_ONNX_MODEL` -3. Creating and populating the `FOTO_VEKTOR` table with images and their vectors +3. Creating the `FOTO_VEKTOR` table and HNSW index +4. Populating `FOTO_VEKTOR` using `index_images_indb.py` The resulting models and table are: | Object | Type | Input | Output | Purpose | |---|---|---|---|---| -| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries | -| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed image data | +| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries at search time | +| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed images at index time | | `VECTOR.FOTO_VEKTOR` | Table | — | — | Stores filenames, image BLOBs, and vectors | These are called with the `VECTOR_EMBEDDING()` SQL function. The table @@ -591,18 +638,20 @@ Measured on this installation (CPU only, no GPU): | Metric | PostgreSQL + pgvector | Oracle 26ai (Python embed) | Oracle 26ai (in-DB embed) | |---|---|---|---| -| Photos indexed | 116 | 116 | 116 (manually indexed) | -| Indexing time | ~26 seconds | ~16 seconds | 0 (indexed separately by admin) | -| Index type | HNSW (on disk) | HNSW (in-memory) | Full table scan (116 rows) | +| Photos indexed | 116 | 116 | 116 | +| Avg indexing time (3 runs, CPU) | **12.1 s** | **12.1 s** | **13.6 s** | +| Index type | HNSW (on disk) | HNSW (in-memory) | HNSW (in-memory) | | Memory required | None | 512 MB SGA | 512 MB SGA | +| Photo storage | File path (filesystem) | File path (filesystem) | BLOB (in Oracle) | | Python CLIP at query time | Yes | Yes | **No** | -| Embedding location | Python process | Python process | Inside Oracle SQL | +| Embedding at index time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_IMG)` | +| Embedding at query time | Python CLIP | Python CLIP | Oracle `VECTOR_EMBEDDING(CLIP_TXT)` | | `VECTOR_EMBEDDING()` used | No | No | **Yes** | +| Oracle schema | — | `VECTORS_USER` | `VECTOR` | -Note: indexing time for backends 1 and 2 is dominated by CLIP inference (CPU), -not database write speed. The in-database backend uses the manually loaded CLIP -models in the `VECTOR` schema; their indexing time is not measured here as it -was performed separately by the administrator. +Note: indexing time is dominated by CLIP inference for backends 1 and 2 (CPU, no GPU). +Backend 3 is slightly slower because each photo is transferred as a full JPEG BLOB +to Oracle over the network before Oracle computes the embedding internally. ---