Initial implementation of pgvector and Oracle 26ai vector search demo

Three FastAPI backends comparing PostgreSQL/pgvector and Oracle 26ai for semantic image search using CLIP embeddings: Python-side embedding for both databases, plus Oracle in-database embedding via VECTOR_EMBEDDING(CLIP_TXT). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 11:33:16 +02:00
commit 66f7db40b0
15 changed files with 1347 additions and 0 deletions
@@ -0,0 +1,466 @@
+# Vector Image Search — PostgreSQL/pgvector vs Oracle 26ai
+
+A comparative demo that vectorizes JPEG photos using the CLIP neural network model
+and stores the embeddings in two different databases: **PostgreSQL with pgvector**
+and **Oracle AI Database 26ai**. Users search the photo collection by typing
+plain-text keywords such as "trees" or "water" and receive results ranked by
+semantic similarity.
+
+Three backends are implemented, demonstrating two fundamental approaches to vector
+embedding:
+
+| Backend | Port | Embedding location | Model |
+|---|---|---|---|
+| PostgreSQL + pgvector | 8000 | Python (external) | sentence-transformers CLIP |
+| Oracle 26ai (Python embedding) | 8001 | Python (external) | sentence-transformers CLIP |
+| Oracle 26ai (in-database embedding) | 8002 | Inside Oracle SQL | Oracle native CLIP_TXT |
+
+The key architectural difference: in the third backend, the text query is embedded
+**inside a SQL statement** using Oracle's `VECTOR_EMBEDDING()` function — no Python
+ML library is loaded or called at search time.
+
+---
+
+## Architecture overview
+
+```
+                        115 JPEG photos
+                              │
+                              ▼
+              ┌───────────────────────────────┐
+              │   CLIP model (clip-ViT-B-32)  │
+              │   sentence-transformers lib   │
+              │   → 512-dimensional float vec │
+              └──────────────┬────────────────┘
+                             │
+              ┌──────────────┴──────────────┐
+              │                             │
+              ▼                             ▼
+  ┌──────────────────────┐    ┌──────────────────────┐    ┌───────────────────────┐
+  │  PostgreSQL 16       │    │  Oracle 26ai         │    │  Oracle 26ai          │
+  │  + pgvector 0.6.0    │    │  (version 23.26.1)   │    │  (version 23.26.1)    │
+  │  database:           │    │  PDB: FREEPDB1       │    │  PDB: FREEPDB1        │
+  │  vectors_demo        │    │  user: vectors_user  │    │  schema: VECTOR       │
+  │  HNSW index          │    │  HNSW index          │    │  HNSW not needed      │
+  └────────┬─────────────┘    └──────────┬───────────┘    └──────────┬────────────┘
+           │                             │                            │
+           ▼                             ▼                            │
+  Python CLIP encode          Python CLIP encode          Text stays in Oracle SQL
+  (search query)              (search query)              VECTOR_EMBEDDING(CLIP_TXT
+                                                          USING :q AS data)
+           │                             │                            │
+           ▼                             ▼                            ▼
+  ┌──────────────┐             ┌──────────────┐             ┌──────────────────┐
+  │  FastAPI     │             │  FastAPI     │             │  FastAPI         │
+  │  main.py     │             │  main_oracle │             │  main_oracle_    │
+  │  port 8000   │             │  port 8001   │             │  indb.py         │
+  └──────┬───────┘             └──────┬───────┘             │  port 8002       │
+         │                            │                     └────────┬─────────┘
+         ▼                            ▼                              ▼
+  frontend/index.html       frontend/index.html         frontend/index_indb.html
+  (badge: pgvector)         (badge: Oracle 26ai)        (badge: Oracle In-DB)
+```
+
+---
+
+## Project structure
+
+```
+pgvector-demo/
+├── backend/
+│   ├── .env               # PostgreSQL credentials, photo path
+│   ├── db.py              # PostgreSQL connection factory
+│   ├── embedder.py        # CLIP model wrapper
+│   ├── index_images.py    # One-time indexing script
+│   └── main.py            # FastAPI app (port 8000)
+└── frontend/
+    └── index.html         # Search UI
+
+oravector-demo/
+├── backend/
+│   ├── .env                     # Oracle credentials, photo path
+│   ├── db_oracle.py             # Oracle connection factory (vectors_user)
+│   ├── embedder.py              # CLIP model wrapper (identical to pgvector)
+│   ├── index_images_oracle.py   # One-time indexing script (Python embedding)
+│   ├── main_oracle.py           # FastAPI app — Python embedding (port 8001)
+│   └── main_oracle_indb.py      # FastAPI app — in-database embedding (port 8002)
+└── frontend/
+    ├── index.html               # Search UI (Oracle 26ai, Python embedding)
+    └── index_indb.html          # Search UI (Oracle 26ai, in-database embedding)
+```
+
+---
+
+## System components installed
+
+### Operating system packages
+
+| Package | Version | Purpose |
+|---|---|---|
+| PostgreSQL | 16.13 (Ubuntu) | Relational database |
+| postgresql-16-pgvector | 0.6.0 | Vector data type and indexes for PostgreSQL |
+| Python | 3.12.3 | Runtime for all backend code |
+| Podman | — | Container runtime for Oracle 26ai |
+
+**PostgreSQL pgvector installation:**
+```bash
+sudo apt install postgresql-16-pgvector
+```
+
+**pgvector extension activation** (requires superuser, run once per database):
+```bash
+sudo -u postgres psql -d vectors_demo -c "CREATE EXTENSION vector;"
+```
+
+### Oracle 26ai (Podman container)
+
+| Property | Value |
+|---|---|
+| Product | Oracle AI Database 26ai Free |
+| Version | 23.26.1.0.0 |
+| Container name | `oracle.free` |
+| Host port | 37611 (mapped to 1521 inside container) |
+| Pluggable Database | FREEPDB1 |
+| Schema user | `vectors_user` |
+
+**Oracle vector memory** — the HNSW index is held entirely in the SGA's Vector
+Memory Area. This must be configured before the database starts:
+
+```sql
+-- Connect as SYSDBA to service FREE (CDB root)
+ALTER SYSTEM SET vector_memory_size = 512M SCOPE=SPFILE;
+```
+
+Then restart Oracle inside the container:
+```bash
+podman exec oracle.free bash -c "sqlplus -s / as sysdba <<'EOF'
+SHUTDOWN ABORT;
+EXIT;
+EOF"
+
+podman exec oracle.free bash -c "sqlplus -s / as sysdba <<'EOF'
+STARTUP;
+EXIT;
+EOF"
+```
+
+After restart, the SGA confirms: `Vector Memory Area: 536870912 bytes (512 MB)`.
+
+### Python packages
+
+| Package | Version | Used by | Purpose |
+|---|---|---|---|
+| `sentence-transformers` | 5.3.0 | both | CLIP model loading and inference |
+| `torch` | 2.11.0 | both | Neural network runtime for CLIP |
+| `Pillow` | 10.2.0 | both | JPEG loading and colour conversion |
+| `fastapi` | 0.135.2 | both | REST API framework |
+| `uvicorn` | 0.42.0 | both | ASGI server |
+| `python-dotenv` | 1.0.1 | both | `.env` file support |
+| `psycopg2-binary` | 2.9.11 | pgvector only | PostgreSQL driver |
+| `oracledb` | 3.4.2 | Oracle only | Oracle driver (thin mode, no client libs needed) |
+
+**Install all packages:**
+```bash
+pip3 install fastapi uvicorn psycopg2-binary oracledb sentence-transformers \
+             Pillow python-dotenv --break-system-packages
+```
+
+---
+
+## Vectorization
+
+### Model: CLIP (clip-ViT-B-32)
+
+CLIP (Contrastive Language–Image Pretraining) is a neural network model developed
+by OpenAI. It was trained on hundreds of millions of image–text pairs and maps both
+images and text into the **same 512-dimensional vector space**. This enables
+searching images by plain-text query without any manual labelling or tagging.
+
+| Property | Value |
+|---|---|
+| Architecture | Vision Transformer ViT-B/32 |
+| Output dimension | 512 floats |
+| Similarity metric | Cosine similarity |
+| Weights source | Hugging Face Hub: `sentence-transformers/clip-ViT-B-32` |
+| Downloaded to | `~/.cache/huggingface/hub/` on first run |
+
+**Why cosine similarity?** CLIP vectors have varying magnitudes. Cosine similarity
+normalises for magnitude and measures only the direction — the angle between two
+vectors — which reliably captures semantic relatedness regardless of vector scale.
+
+The `embedder.py` module is identical in both projects. It lazily loads the model
+on first call and exposes two functions:
+
+| Function | Input | Output |
+|---|---|---|
+| `embed_image(path)` | Filesystem path to a JPEG | `list[float]` — 512 values |
+| `embed_text(text)` | Plain-text query string | `list[float]` — 512 values |
+
+At search time, the text query is embedded into the same vector space as the photos.
+The database then finds the photos whose vectors point in the most similar direction.
+
+---
+
+## Database schemas
+
+### PostgreSQL + pgvector
+
+```sql
+-- database: vectors_demo  (PostgreSQL 16)
+CREATE EXTENSION vector;        -- pgvector 0.6.0
+
+CREATE TABLE images (
+    id        SERIAL PRIMARY KEY,
+    filename  TEXT NOT NULL UNIQUE,
+    filepath  TEXT NOT NULL,
+    embedding vector(512)        -- pgvector type, 512 dimensions
+);
+
+CREATE INDEX images_embedding_idx
+    ON images USING hnsw (embedding vector_cosine_ops);
+```
+
+### Oracle 26ai
+
+```sql
+-- PDB: FREEPDB1, user: vectors_user
+
+CREATE TABLE images (
+    id        NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
+    filename  VARCHAR2(255) NOT NULL UNIQUE,
+    filepath  VARCHAR2(1000) NOT NULL,
+    embedding VECTOR(512, FLOAT32)   -- native Oracle type, typed at definition
+);
+
+CREATE VECTOR INDEX images_embedding_idx
+    ON images(embedding)
+    ORGANIZATION INMEMORY NEIGHBOR GRAPH   -- HNSW (in-memory)
+    WITH DISTANCE COSINE
+    WITH TARGET ACCURACY 95
+    PARAMETERS (type HNSW, neighbors 32, efconstruction 200);
+```
+
+**Key schema differences:**
+
+| Aspect | PostgreSQL/pgvector | Oracle 26ai |
+|---|---|---|
+| Extension needed | `CREATE EXTENSION vector` | Built-in, no extension |
+| Vector column | `vector(512)` — dimension only | `VECTOR(512, FLOAT32)` — dimension + element type |
+| Primary key | `SERIAL` (auto-increment) | `NUMBER GENERATED ALWAYS AS IDENTITY` |
+| Text columns | `TEXT` (unlimited) | `VARCHAR2(n)` (length required) |
+| HNSW syntax | `USING hnsw (col vector_cosine_ops)` | `ORGANIZATION INMEMORY NEIGHBOR GRAPH` |
+| IVF syntax | `USING ivfflat (col vector_cosine_ops)` | `ORGANIZATION NEIGHBOR PARTITIONS` |
+| Accuracy target | Implicit (set via index params) | `WITH TARGET ACCURACY 95` (explicit %) |
+| Memory prereq | None | `vector_memory_size > 0` in SGA |
+
+---
+
+## Backend modules
+
+### Connection factories
+
+**`db.py` (PostgreSQL):**
+Reads `DB_HOST`, `DB_PORT`, `DB_NAME`, `DB_USER`, `DB_PASSWORD` from `.env` and
+returns a `psycopg2` connection.
+
+**`db_oracle.py` (Oracle):**
+Reads `ORA_HOST`, `ORA_PORT`, `ORA_SERVICE`, `ORA_USER`, `ORA_PASSWORD` from `.env`
+and returns an `oracledb` connection. The DSN is assembled as `host:port/service`.
+Runs in **thin mode** — no Oracle Instant Client installation is required on the host.
+
+---
+
+### Indexing scripts
+
+Both scripts are idempotent: they check for existing rows and skip already-indexed
+photos. Each photo is committed individually so a crash does not lose prior work.
+
+| | `index_images.py` | `index_images_oracle.py` |
+|---|---|---|
+| Run command | `python3 index_images.py` | `python3 index_images_oracle.py` |
+| Vector bind | Python `list` passed directly | `array.array("f", embedding)` required |
+| Bind style | `%s` placeholders (psycopg2) | `:1`, `:2`, `:3` positional (oracledb) |
+| Runtime (115 photos, CPU) | **26 seconds** | **16 seconds** |
+
+**Why `array.array` for Oracle?**
+The `python-oracledb` driver does not accept a plain Python list for a `VECTOR`
+column. The data must be a Python `array.array` with typecode `"f"` (32-bit float),
+matching the `FLOAT32` declaration in the Oracle column type.
+
+---
+
+### FastAPI applications
+
+Both apps expose identical endpoints at different ports:
+
+| Endpoint | Description |
+|---|---|
+| `GET /search?q=<text>&limit=<n>` | Embed query, run nearest-neighbour search, return ranked results |
+| `GET /stats` | Return count of indexed photos |
+| `GET /photos/<filename>` | Serve original JPEG from the photos directory |
+
+**Search query comparison:**
+
+PostgreSQL (`main.py`, port 8000):
+```sql
+SELECT filename, 1 - (embedding <=> $1::vector) AS score
+FROM images
+ORDER BY embedding <=> $1::vector
+LIMIT $2
+```
+
+Oracle 26ai (`main_oracle.py`, port 8001):
+```sql
+SELECT filename,
+       1 - VECTOR_DISTANCE(embedding, :vec, COSINE) AS score
+FROM images
+ORDER BY VECTOR_DISTANCE(embedding, :vec, COSINE)
+FETCH FIRST :lim ROWS ONLY
+```
+
+**Key query differences:**
+
+| Aspect | PostgreSQL/pgvector | Oracle 26ai |
+|---|---|---|
+| Distance operator | `<=>` (cosine distance operator) | `VECTOR_DISTANCE(col, val, COSINE)` |
+| Cast required | `$1::vector` — explicit cast | No cast, column type is enforced |
+| Top-N clause | `LIMIT n` | `FETCH FIRST n ROWS ONLY` |
+| Bind style | `$1`, `$2` positional (psycopg2) | `:name` named binds (dict) |
+| Repeated param | `$1` can appear multiple times | Same `:name` can appear multiple times; positional `:1` cannot be reused |
+| Score formula | `1 - (embedding <=> val)` | `1 - VECTOR_DISTANCE(...)` |
+
+In both cases `1 − distance` converts cosine distance (0 = identical) into a
+similarity score (1.0 = identical), displayed as a percentage in the frontend.
+
+---
+
+## Frontend
+
+Both frontends are identical single HTML files with no build step. Open directly
+in a browser.
+
+| | pgvector frontend | Oracle 26ai frontend |
+|---|---|---|
+| File | `pgvector-demo/frontend/index.html` | `oravector-demo/frontend/index.html` |
+| Badge label | pgvector | Oracle 26ai |
+| API base URL | `http://localhost:8000` | `http://localhost:8001` |
+
+Features: search box, Enter-key support, suggestion chips (trees, water, people,
+buildings, sky, street, night, cars), result grid with thumbnails and similarity
+scores in percent.
+
+---
+
+## Running the applications
+
+**Start PostgreSQL backend** (Python embedding):
+```bash
+cd pgvector-demo/backend
+uvicorn main:app --host 0.0.0.0 --port 8000
+```
+
+**Start Oracle backend — Python embedding:**
+```bash
+cd oravector-demo/backend
+uvicorn main_oracle:app --host 0.0.0.0 --port 8001
+```
+
+**Start Oracle backend — in-database embedding:**
+```bash
+cd oravector-demo/backend
+uvicorn main_oracle_indb:app --host 0.0.0.0 --port 8002
+```
+
+Open the matching `frontend/index.html` (ports 8000/8001) or
+`frontend/index_indb.html` (port 8002) in a browser. All three can run
+simultaneously.
+
+**Re-index after adding photos:**
+```bash
+# PostgreSQL
+cd pgvector-demo/backend && python3 index_images.py
+
+# Oracle (Python embedding)
+cd oravector-demo/backend && python3 index_images_oracle.py
+
+# Oracle in-database: re-indexing is done in SQL directly
+# (the VECTOR schema's FOTO_VEKTOR table is managed by Oracle)
+```
+
+---
+
+## Oracle in-database embedding
+
+The `VECTOR` schema, its ONNX models, and the `FOTO_VEKTOR` table were manually
+set up by the administrator — they are **not** part of a standard Oracle 26ai
+installation. The setup involved:
+
+1. Creating a `VECTOR` database user
+2. Exporting CLIP (ViT-B/32) to ONNX format and loading the models via
+   `DBMS_VECTOR.LOAD_ONNX_MODEL`
+3. Creating and populating the `FOTO_VEKTOR` table with images and their vectors
+
+The resulting models and table are:
+
+| Object | Type | Input | Output | Purpose |
+|---|---|---|---|---|
+| `VECTOR.CLIP_TXT` | ONNX model | `VARCHAR2` text | `VECTOR(512)` | Embed text queries |
+| `VECTOR.CLIP_IMG` | ONNX model | `BLOB` image | `VECTOR(512)` | Embed image data |
+| `VECTOR.FOTO_VEKTOR` | Table | — | — | Stores filenames, image BLOBs, and vectors |
+
+These are called with the `VECTOR_EMBEDDING()` SQL function. The table
+`VECTOR.FOTO_VEKTOR` stores images as BLOBs alongside their CLIP_IMG-computed
+embeddings.
+
+**The complete in-database search query:**
+```sql
+SELECT filename,
+       1 - VECTOR_DISTANCE(
+               foto_vek,
+               VECTOR_EMBEDDING(CLIP_TXT USING :q AS data),
+               COSINE
+           ) AS score
+FROM VECTOR.FOTO_VEKTOR
+ORDER BY VECTOR_DISTANCE(
+             foto_vek,
+             VECTOR_EMBEDDING(CLIP_TXT USING :q AS data),
+             COSINE
+         )
+FETCH FIRST 12 ROWS ONLY
+```
+
+The Python FastAPI backend (`main_oracle_indb.py`) passes only the raw text string
+to Oracle via a bind variable `:q`. Oracle tokenizes the text, runs the CLIP_TXT
+ONNX model internally, produces the 512-dim vector, and performs the similarity
+search — all within one SQL statement. No Python ML library is involved at
+query time.
+
+**Why Oracle can ship CLIP as an in-database ONNX model:**
+Oracle's `DBMS_VECTOR.LOAD_ONNX_MODEL` requires the model's ONNX graph to use
+`input_ids` in a single `Gather` node (embedding lookup only). CLIP's standard
+export uses `input_ids` additionally in `ArgMax` for EOS-token pooling, which
+Oracle's validator rejects. The manually loaded CLIP_TXT model in the `VECTOR`
+schema uses CLS-token pooling (position 0) instead, which produces a simpler
+graph that Oracle accepts. The
+cosine similarity between EOS-pooling and CLS-pooling variants is ~0.70.
+
+---
+
+## Performance comparison
+
+Measured on this installation (CPU only, no GPU):
+
+| Metric | PostgreSQL + pgvector | Oracle 26ai (Python embed) | Oracle 26ai (in-DB embed) |
+|---|---|---|---|
+| Photos indexed | 115 | 115 | 116 (manually indexed) |
+| Indexing time | 26 seconds | 16 seconds | 0 (indexed separately by admin) |
+| Index type | HNSW (on disk) | HNSW (in-memory) | Full table scan (116 rows) |
+| Memory required | None | 512 MB SGA | 512 MB SGA |
+| Python CLIP at query time | Yes | Yes | **No** |
+| Embedding location | Python process | Python process | Inside Oracle SQL |
+| `VECTOR_EMBEDDING()` used | No | No | **Yes** |
+
+Note: indexing time for backends 1 and 2 is dominated by CLIP inference (CPU),
+not database write speed. The in-database backend uses the manually loaded CLIP
+models in the `VECTOR` schema; their indexing time is not measured here as it
+was performed separately by the administrator.