Manga recommender 01: Infrastructure and scraping

What do I want to say in this blog post?

This is one part in a series
I want to show the work I've done
Context: how I initially wanted to design the scraper and where it ended up.
Goals for the project
- Want it to be hostable off of a single process (embedded duckdb)
Share learnings about scraping, duckdb, ducklake, postgres that are probably obvious but was not clear until actually in the weeds.
- Scraping the Jikan API (easier than MAL, but some annoying issues!)
- MERGE INTO can be debilitatingly slow.
- My stupid constraints left me without FOR UPDATE SKIP LOCKED
- No data inlining on the duckdb version I was running -> way too many files
- Part of me wishes I just stuck with plain old postgres.
- Assumptions really go out the window once you have a significant amount of lag (e.g. multi-hop transaction trying to get user off the queue)

This is part 1 in a series (hopefully!) of posts about exploring recommendation algorithms. I sometimes get into manga-reading phases and haven't been super satisfied with being able to find recommendations. My objectives are to learn about recommendation algorithms, and to have something deployed and released to the world.

This first part is going to talk about the data.

Existing datasets

There are some existing datasets out there, but for some reason or another, they didn't fit the requirements I had in mind.

Truong-Binh Duong's Manga & Anime Dataset 2024:

Not manga-specific
The manga data contained is lacking:
- Only the top 10,000 entries
- No semantic data, like reviews, synopses, etc Andreu Vall Hernandez's MyAnimeList Anime and Manga Datasets
Also not manga specific
The manga data is similarly high-level

In general, I wanted the data to have a significant amount of text, in the form of synopses, reviews, recommendations, etc. My thought is that this will provide an avenue to use embeddings for recommendations. I also wanted to make use of character data, as often times my favorite part of a story is the characters.

MyAnimeList and Jikan

MyAnimeList is a website with tons of data on anime and manga. It contains not only "factual" data (like titles, authors, publication range, publishers, etc) but also is a hub where enthusiasts post reviews and share recommendations. I think it's a great starting point for my project. It's not very easy to scrape the site though -- the public API is very lacklustre, and getting an API key involves an approval process. Much of the data is rendered server-side before being sent to the client, so there are few endpoints with convient JSON responses to parse.

Enter, Jikan. This is an unofficial API that mirror the data on MAL with a more developer friendly API. I tried briefly to self-host the indexer but I couldn't get it working, so I figured I'd just scrape Jikan directly. Adhering to the rate limits has been difficult at times (~1 rps).

Project constraints

Some constraints I had for this project are:

I didn't want to pay extra money for anything (i.e., paying for an RDS instance that I didn't already have!)
I want to be able to publish the data in a clean and simple way. I had recently read the Ducklake article about Frozen Ducklakes so this approach was top of mind.
I wanted development for "local" and "prod" to be as simple (parsimonious?) as possible -- minimum of differences between "local" code and "prod" code.
The final deployed project needs to be dead simple. To me, that means as little infrastructure as possible. I'm going to try and embed everything into a single process if I can.

Initial architecture

Usually, there wouldn't be too much to think of in terms of architecture. Generate a list of targets, queue them up, and go through the queue until you're done. Put stuff in a database. Easy?

Well, yes, and this is more or less what I started with.

I wrote a script in F# to do this, as I've been liking the language as of late. I used DuckDB for the database. For the API, I initially used Hawaii to generate an F# client but ran into some annoying issues. I ended up switching to SwaggerProvider and I found that the generated client fit my expectations a lot better. One issue I had with this type provider was that it was really slow to compile and bogged down my LSP. This was mostly alleviated by moving the client to a separate project.

This worked well, but I thought ... "What if I could double or triple the throughput? I have a few machines to use.". In hindsight, I think the right move would have been to host a Postgres instance and call it a day, but I've been known to be distracted by new and shiny objects. And this is a project for me, so why not?

Multi-scraper ducklake

Ducklake is a data lake format that I've found pretty easy to set up. At it's simplest, you don't even need a catalogue server (which has always turned me off from Iceberg). Unfortunately for a multi-writer setup, we were going to need to coordinate and that necessitates a catalogue service. So I could have gone with Iceberg, but I really liked the experience with DuckDB and saw no reason to switch.

First: I set up a postgres server. I have access to an account with the UW Computer Science Club. I feel like this is the membership that keeps on giving. If you're a UW student I highly recommend getting a membership. Though, this account doesn't come without limitations: you don't have root access, and your home folder's storage space is capped at 4GB, though there is an NFS-mounted scratch space. I set up my postgres instance with some Nix and Tailscale shenanigans which might be better suited for a different article.

I already have an S3 bucket with Wasabi. I pay $7/month/TB of storage, and since I haven't hit the first terabyte, using this service doesn't incur any marginal costs.

So my infrastructure was sorted, I just needed to implement. Some goals for swtiching to this more convoluted stack:

I wanted to increase throughput
I wanted to be able to query the database while writers were writing: when using an embedded duckdb database, I had to stop the writing process to be able to query the database (e.g. getting an idea of progress!).

This required rewriting a lot of my queries. In particular, ducklake comes with these limitations:

No primary keys or indexes: I had to remove indexes and PK-constraint DDLs from all my tables. So naturally, all of my INSERT OR REPLACE ... no longer worked as there were no primary keys to check against. The closest approach would be MERGE INTO (foreshadowing)
Also no foreign keys, so forget about those
The version of DuckDB I was using (1.4.1?) did not have inlining for postgres catalogues (more foreshadowing)

Jikan issues

I ran into some issues with the Jikan API. Nothing incredibly show stopping, but definitely annoying. It also made my code a lot uglier I think, having to put in safeguards for all these edge cases that I didn't think would happen. This isn't a dig on the Jikan team (in fact, I'm overall very happy with the API), I'm just documenting my experiences.