MySQL + IMDB Dataset Guide
The optional machine-learning workflow enriches recommendations with IMDB metadata (genres, ratings, cast). Rather than embedding the TSV dumps directly in the repository, EMBRACE relies on a local MySQL instance that you control. The helper scripts under scripts/mysql/ orchestrate the full lifecycle: provisioning the container, creating tables, and loading TSV data.
When you need MySQL
You only need this stack if you plan to run PreProcess.imdb_get_encoded_genres() or any downstream tasks that require one-hot encoded IMDB genres. The preprocessing step queries MySQL for titles and genres, aggregates them, and writes a cached Parquet/PKL file under data/cache/ for reuse.
Directory layout
scripts/mysql/
├── Dockerfile
├── docker-compose.yml
├── .env.example
├── README.md
├── imdb_create-schema.py
├── imdb_load-from-tsv.py
├── imdb_reset.py
└── init_scripts/
└── set_usr_perms.sh
Dockerfilebuilds a thin wrapper around the official MySQL image.docker-compose.ymlwires storage, the.envcredentials, and a bind mount that exposes the raw TSV dumps at/importinside the container.imdb_create-schema.pyandimdb_load-from-tsv.pyare the two primary scripts for first-time setup.imdb_reset.pytruncates all tables so you can perform a clean reload without dropping the database.
Prerequisites
- Download the IMDB TSV dumps from https://datasets.imdbws.com/. Extract them somewhere on your host machine.
- Set up credentials by copying both
.envtemplates and filling in strong passwords:
sh
cp .env.example .env
cp scripts/mysql/.env.example scripts/mysql/.env
- Export
IMDB_DATA_DIRbefore starting docker-compose so the TSV directory mounts at/importinside the container:
sh
export IMDB_DATA_DIR=/absolute/path/to/imdb-tsv
Bootstrapping the database
- Build and start MySQL
sh
cd scripts/mysql
docker compose up -d --build
The compose file enables --secure-file-priv=/import and --local-infile=ON, allowing the loader scripts to stream TSVs directly into staging tables.
- Grant file permissions (first run only)
sh
./init_scripts/set_usr_perms.sh
This wrapper runs GRANT FILE for the application user using the root credentials you supplied.
- Create schemas
sh
python3 imdb_create-schema.py
The script connects via MySQLConnector, ensures the database exists, and creates the target tables (titles, genres, directors, writers, episodes, roles, ratings, persons) along with staging tables used during data loads.
- Load TSV data
sh
python3 imdb_load-from-tsv.py
Each block wraps a table load in a timing context, bulk imports the TSV into a staging table (where required), and then inserts the normalized rows into the destination table. Expect this step to take several minutes depending on disk speed.
- Verify row counts
After the loader finishes, you can run quick sanity checks:
sh
docker compose exec db mysql -u${MYSQL_USER} -p${MYSQL_PASSWORD} ${MYSQL_DATABASE} \
-e "SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.tables WHERE TABLE_SCHEMA='${MYSQL_DATABASE}' ORDER BY TABLE_ROWS DESC;"
- Shut down (optional)
sh
docker compose down
Maintenance workflow
- Refresh data: rerun
imdb_load-from-tsv.pyafter downloading new IMDB dumps. Useimdb_reset.pybeforehand if you want to truncate tables instead of appending. - Cache management: delete or refresh
data/cache/imdb_genres_ohe.parquetif you need to rebuild the encoded dataset after a reload. - Connection reuse: the Python code instantiates
MySQLConnectorwithscripts/mysql/.envto discover credentials, so keep that file updated if hosts or ports change.
With these pieces in place, PreProcess.imdb_get_encoded_genres() can connect to MySQL, fetch the latest title/genre associations, and generate the sparse one-hot vectors used by the k-NN example in src/main.py.