What do I want to say in this blog post?
MERGE INTO can be debilitatingly slow.FOR UPDATE SKIP LOCKEDThis is part 1 in a series (hopefully!) of posts about exploring recommendation algorithms. I sometimes get into manga-reading phases and haven't been super satisfied with being able to find recommendations. My objectives are to learn about recommendation algorithms, and to have something deployed and released to the world.
This first part is going to talk about the data.
There are some existing datasets out there, but for some reason or another, they didn't fit the requirements I had in mind.
Truong-Binh Duong's Manga & Anime Dataset 2024:
In general, I wanted the data to have a significant amount of text, in the form of synopses, reviews, recommendations, etc. My thought is that this will provide an avenue to use embeddings for recommendations. I also wanted to make use of character data, as often times my favorite part of a story is the characters.
MyAnimeList is a website with tons of data on anime and manga. It contains not only "factual" data (like titles, authors, publication range, publishers, etc) but also is a hub where enthusiasts post reviews and share recommendations. I think it's a great starting point for my project. It's not very easy to scrape the site though -- the public API is very lacklustre, and getting an API key involves an approval process. Much of the data is rendered server-side before being sent to the client, so there are few endpoints with convient JSON responses to parse.
Enter, Jikan. This is an unofficial API that mirror the data on MAL with a more developer friendly API. I tried briefly to self-host the indexer but I couldn't get it working, so I figured I'd just scrape Jikan directly. Adhering to the rate limits has been difficult at times (~1 rps).
Some constraints I had for this project are:
Usually, there wouldn't be too much to think of in terms of architecture. Generate a list of targets, queue them up, and go through the queue until you're done. Put stuff in a database. Easy?
Well, yes, and this is more or less what I started with.
I wrote a script in F# to do this, as I've been liking the language as of late. I used DuckDB for the database. For the API, I initially used Hawaii to generate an F# client but ran into some annoying issues. I ended up switching to SwaggerProvider and I found that the generated client fit my expectations a lot better. One issue I had with this type provider was that it was really slow to compile and bogged down my LSP. This was mostly alleviated by moving the client to a separate project.
This worked well, but I thought ... "What if I could double or triple the throughput? I have a few machines to use.". In hindsight, I think the right move would have been to host a Postgres instance and call it a day, but I've been known to be distracted by new and shiny objects. And this is a project for me, so why not?
Ducklake is a data lake format that I've found pretty easy to set up. At it's simplest, you don't even need a catalogue server (which has always turned me off from Iceberg). Unfortunately for a multi-writer setup, we were going to need to coordinate and that necessitates a catalogue service. So I could have gone with Iceberg, but I really liked the experience with DuckDB and saw no reason to switch.
First: I set up a postgres server. I have access to an account with the UW Computer Science Club. I feel like this is the membership that keeps on giving. If you're a UW student I highly recommend getting a membership. Though, this account doesn't come without limitations: you don't have root access, and your home folder's storage space is capped at 4GB, though there is an NFS-mounted scratch space. I set up my postgres instance with some Nix and Tailscale shenanigans which might be better suited for a different article.
I already have an S3 bucket with Wasabi. I pay $7/month/TB of storage, and since I haven't hit the first terabyte, using this service doesn't incur any marginal costs.
So my infrastructure was sorted, I just needed to implement. Some goals for swtiching to this more convoluted stack:
This required rewriting a lot of my queries. In particular, ducklake comes with these limitations:
INSERT OR REPLACE ... no longer worked as there were no primary keys to check against. The closest approach would be MERGE INTO (foreshadowing)I ran into some issues with the Jikan API. Nothing incredibly show stopping, but definitely annoying. It also made my code a lot uglier I think, having to put in safeguards for all these edge cases that I didn't think would happen. This isn't a dig on the Jikan team (in fact, I'm overall very happy with the API), I'm just documenting my experiences.
Other blog post outlines?