ScyllaDB Docs Vector Search Build a movie recommendation app with ScyllaDB

Build a movie recommendation app with ScyllaDB¶

This tutorial shows you how to build a vector search application with ScyllaDB.

You’ll build a simple movie recommendation app that takes a text input from the user and performs vector search to recommend a movie to watch.

Source code is available on GitHub.

Prerequisites¶

Install Python requirements¶

Create and activate a new Python virtual environment (you can use virtualenv, Poetry, venv or any other environment management library):
```
virtualenv env && source env/bin/activate
```
Install requirements:
```
pip install scylla-driver pydantic sentence-transformers streamlit
```
This installs:
- ScyllaDB Python driver: needed for ScyllaDB
- Pydantic: to validate data and handle objects
- Sentence Transformers: to create embedding from text
- Streamlit: to build a simple UI

Set up ScyllaDB as a vector store¶

Create a new ScyllaDB Cloud instance with vector search enabled.

Create config.py and add your database connection details (host, username, password, etc…):

SCYLLADB_CONFIG = {
    "host": "node-0.aws-us-east-1.xxxxxxxxxxx.clusters.scylla.cloud",
    "port": "9042",
    "username": "scylla",
    "password": "passwd",
    "datacenter": "AWS_US_EAST_1"
}

Create a helper module called scylladb.py to insert data and query results from ScyllaDB:

from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT
from cassandra.policies import DCAwareRoundRobinPolicy, TokenAwarePolicy
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
import sys, os
import config

class ScyllaClient():
    def __init__(self, keyspace: str = None):
        self.cluster = self._get_cluster(config.SCYLLADB_CONFIG)
        if keyspace:
            self.session = self.cluster.connect(keyspace)
        else:
            self.session = self.cluster.connect()

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.shutdown()

    def shutdown(self):
        self.cluster.shutdown()

    def _get_cluster(self, config: dict) -> Cluster:
        profile = ExecutionProfile(
            load_balancing_policy=TokenAwarePolicy(
                    DCAwareRoundRobinPolicy(local_dc=config["datacenter"])
                ),
                row_factory=dict_factory
            )
        return Cluster(
            execution_profiles={EXEC_PROFILE_DEFAULT: profile},
            contact_points=[config["host"], ],
            port=config["port"],
            auth_provider = PlainTextAuthProvider(username=config["username"],
                                                password=config["password"]))

    def print_metadata(self):
        for host in self.cluster.metadata.all_hosts():
            print(f"Datacenter: {host.datacenter}; Host: {host.address}; Rack: {host.rack}")

    def get_session(self):
        return self.session

    def insert_data(self, table, data: dict):
        columns = list(data.keys())
        values = list(data.values())
        insert_query = f"""
        INSERT INTO {table} ({','.join(columns)}) 
        VALUES ({','.join(['%s' for c in columns])});
        """
        self.session.execute(insert_query, values)

    def query_data(self, query, params=[]):
        rows = self.session.execute(query, params)
        return rows.all()

Create schema.cql. This script creates a keyspace, a table for movies, and a vector index for similarity search in ScyllaDB:

CREATE KEYSPACE recommend WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}
AND TABLETS = {'enabled': 'false'};

CREATE TABLE recommend.movies (
    id INT,
    release_date TIMESTAMP,
    title TEXT,
    tagline TEXT,
    genre TEXT,
    imdb_id TEXT,
    poster_url TEXT,
    plot TEXT,
    plot_embedding VECTOR<FLOAT, 384>,
    PRIMARY KEY (id)
) WITH cdc = {'enabled': 'true'};


CREATE INDEX IF NOT EXISTS ann_index ON recommend.movies(plot_embedding) 
USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Create and run migrate.py script:

import os
from scylladb import ScyllaClient

client = ScyllaClient()
session = client.get_session()

def absolute_file_path(relative_file_path):
    current_dir = os.path.dirname(__file__)
    return os.path.join(current_dir, relative_file_path)

print("Creating keyspace and tables...")
with open(absolute_file_path("schema.cql"), "r") as file:
    for query in file.read().split(";"):
        if len(query) > 0:
            session.execute(query)
print("Migration completed.")

client.shutdown()

Build the vector search module¶

In this step, you’ll build a simple Python module that finds similar movies based on the input text using ScyllaDB Vector Search.

ScyllaDB acts as a persistent storage for your embeddings and an efficient vector search tool.

Create a new Pydantic model in a new file called models.py:

from pydantic import BaseModel
from datetime import datetime
from typing import Optional

class Movie(BaseModel):
    id: int
    title: Optional[str] = None
    release_date: Optional[datetime] = None
    tagline: Optional[str] = None
    genre: Optional[str] = None
    poster_url: Optional[str] = None
    imdb_id: Optional[str] = None
    plot: Optional[str] = None
    plot_embedding: Optional[list[float]] = None

Create the text embedding module, embedding_creator.py:

from sentence_transformers import SentenceTransformer

class EmbeddingCreator:
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        self.embedding_model = SentenceTransformer(model_name, device='cpu')

    def create_embedding(self, text: str) -> list[float]:
        """
        Get an embedding for a single text input using SentenceTransformer.
        Returns the embedding vector.
        """
        return self.embedding_model.encode(text).tolist()

Create a module that recommends similar movies using vector search. Call it recommender.py:

from scylladb import ScyllaClient
from embedding_creator import EmbeddingCreator
from models import Movie

class MovieRecommender:
    def __init__(self):
        self.scylla_client = ScyllaClient()
        self.embedding_creator = EmbeddingCreator("all-MiniLM-L6-v2")

    def similar_movies(self, user_query: str, top_k=5) -> list[Movie]:
        db_client = ScyllaClient()
        user_query_embedding = self.embedding_creator.create_embedding(user_query)
        db_query = f"""SELECT *
                    FROM recommend.movies
                    ORDER BY plot_embedding ANN OF %s LIMIT %s;
                """
        values = [user_query_embedding, top_k]
        results = db_client.query_data(db_query, values)
        return [Movie(**row) for row in results]

Build Streamlit UI¶

At this point, you have all the building blocks needed to power a movie recommendation app. Let’s build a user interface with Streamlit:

app.py:

import streamlit as st
from recommender import MovieRecommender
from models import Movie

recommender = MovieRecommender()

st.set_page_config(
    page_title="Movie Recommender",
    page_icon="🎬",
    layout="wide"
)

# Header
st.title("🎬 Movie recommendation")
st.subheader("ScyllaDB Vector Search DEMO")
st.markdown("Source code: https://github.com/scylladb/vector-search-examples/tree/main/movie-recommendation")

# Input area
col1, col2 = st.columns([3, 1])
with col1:
    user_query = st.text_input("What kind of movie are you looking for?",placeholder="e.g. time travelling")
with col2:
    top_k = st.number_input("Number of recommendations", min_value=3, max_value=15, value=4, step=1)

search_button = st.button("Get Recommendations", width="stretch")

def show_poster(poster: str) -> str:
    if poster:
        base_url = "https://image.tmdb.org/t/p/original"
        url = f"{base_url}{poster}"
        st.image(url, width="content")
    else:
        st.caption("Poster not found")

def display_best_match(best_match: Movie):
    movie_poster = best_match.poster_url
    col1, col2 = st.columns([1, 2])
    with col1:
        show_poster(movie_poster)
    with col2:
        st.markdown(f"### {best_match.title}")
        st.write(best_match.plot[:500] + "...")
        
def display_more_recommendations(movies: list[Movie]):
    cols = st.columns(3)
    for i, movie in enumerate(movies[1:]):
        with cols[i % 3]:
            poster = movie.poster_url
            show_poster(poster)
            st.write(movie.title)


def display_search_results():
    with st.spinner("🔍 Searching for recommendations..."):
        movies = recommender.similar_movies(user_query, top_k)
        if movies:
            st.subheader("⭐ Best Match")
            best_match = movies[0]
            display_best_match(best_match)
            st.divider()
                
            st.subheader("🎥 More Recommendations")
            rest_of_the_movies = movies[1:]
            display_more_recommendations(rest_of_the_movies)
        else:
            st.error("❌ No similar movies found.")

if search_button:
    if not user_query:
        st.warning("⚠️ Please enter a movie to get recommendations.")
    else:
        try:
            display_search_results()
        except Exception as e:
            st.error(f"⚠️ Error: {str(e)}")

Go ahead and run streamlit:

streamlit run app.py

movies app

Insert sample data¶

Now that ScyllaDB is properly set up and your vector search module and Streamlit app are running smoothly, let’s insert some sample data (100k movies from this dataset) so you can start exploring your app.

Download sample CSV file from GitHub:

wget https://github.com/scylladb/vector-search-examples/raw/refs/heads/main/movie-recommendation/data/movies_sample.csv

Create a new file called ingest.py:

import csv
from datetime import datetime
from scylladb import ScyllaClient
from embedding_creator import EmbeddingCreator

class MovieLoader:
    def __init__(self):
        self.scylla_client = ScyllaClient()
        self.embedding_creator = EmbeddingCreator("all-MiniLM-L6-v2")

    def create_embedding(self, text: str) -> list[float]:
        return self.embedding_creator.create_embedding(text)

    def ingest_csv(self, csv_file, table_name):
        with ScyllaClient() as client:
            with open(csv_file, encoding="utf-8") as f:
                reader = csv.DictReader(f)
                for row in reader:
                    data = {
                        "id": int(row["id"]),
                        "release_date": datetime.strptime(row["release_date"], "%Y-%m-%d"),
                        "title": row["title"],
                        "tagline": row["tagline"],
                        "genre": row["genres"],
                        "poster_url": row["poster_path"],
                        "imdb_id": row["imdb_id"],
                        "plot": row["overview"],
                        "plot_embedding": self.create_embedding(row["overview"]),
                    }
                    client.insert_data(table_name, data)


if __name__ == "__main__":
    CSV_FILE = "movies_sample.csv"
    loader = MovieLoader()
    print("⏳ Ingestion started...")
    loader.ingest_csv(CSV_FILE, "recommend.movies")
    print(f"✅ Finished ingesting {CSV_FILE}")

Start running this app and the database will get populated with movies:

python ingest.py

⏳ Ingestion started...

movies app

The complete application is available on GitHub.

Relevant resources¶

Was this page helpful?