Xyon (xyon) wrote,

More Movie Database Stuff

So, after I add a movie into the personal_movies table I like to get the IMDb id's for the cast. The only reason I really do it now is to have a cloud capability; originally when I was trying to repair truncation put in place by the JMDb import I was doing it so I would be able to look up the actor quickly.

Regardless of the reson, I have this process to import the IDs, but it requires that I know who to import. I only ever bothered with the one that checks for them all, not one that checked for a particular movie:
select distinct p.person_id, p.imdb_name from personal_movies pm inner join movies m on m.imdb_id=pm.imdb_id inner join movie_credits mc on mc.movie_id=m.movie_id inner join persons p on p.person_id=mc.person_id where p.imdb_id is null and mc.credit_type=1

That was taking a really long time (~5 minutes), so I used MySQL's explain to show what it was doing with the query, and since I knew some assumptions about the data I knew that it was joining it wrong:
select straight_join distinct p.person_id, p.imdb_name from personal_movies pm inner join movies m on m.imdb_id=pm.imdb_id inner join movie_credits mc on mc.movie_id=m.movie_id inner join persons p on p.person_id=mc.person_id where p.imdb_id is null and mc.credit_type=1 (straight_join tells the query builder to not try to re-order the join). This brought the time down a bit made me think it brought down the time, but what really happened is the data was just in-cache, it had no real statistical difference.

So I left it at that for a while; until the update time started to annoy me again. Read through some more of the MySQL reference manual, and read about dirty indicies and how to fix them with optimize table (rebuild the indicies). This actually did make an improvement; the same query now ran in about 1.1 minutes with stale cache. However, explain still showed that a temporary table was being created to work with persons; and nowhere could I find anything that suggested how to avoid that, just that it was a bad thing.

Now we come to earlier today. I'm working on a version that does it for just a particular movie, so that when I add in a movie it automagically gets the missing person ID entries and everything works well. So I had pretty much the same query, but also filtering on m.movie_id (or m.imdb_id depending on how it was encountered). The problem? The filtered query, which was using 1/500 the space (didn't have everything from personal_movies, which is presently around 500 entries) was taking ~1.1 minutes as well. That struck me as really odd, since I can pull up the movie cast nearly instantaneously.
Query 1: select p.imdb_name, p.person_id from persons p natural join movie_credits mc where mc.movie_id=id: ~0.2 seconds
Query 2: select p.imdb_name, p.person_id from persons p natural join movie_credits mc where mc.movie_id=id and p.imdb_id is null: ~1.1 minutes.

Now, wait a second. Query 2 is clearly a subset of query 1. Query 1 returned (let's say) 30 rows; checking for null in that space isn't hard at all.

This is when it hits me. What I'm telling MySQL to do in my mind and what I'm telling MySQL to do in my query are not the same thing. In my mind the second query is a filter on the first, but it's not. Here's what MySQL thinks I told it:
"Find all persons entries where the imdb_id is null that have corresponding movie_credits entries that have a movie_id of id" And here's what I think I told it: "Find all the persons entries that have movie_credits entries linking them to movie id; but only show me the ones that have a null imdb_id"

The "using temporary" that explain was warning me about was this: MySQL was building a temporary table with only persons entries with a null imdb_id and using that in the join. So the first step was making a temporary table and copying in 979568 out of 999704 rows. That's a lot of data being copied. I thought that it was making a temporary table of the query without the null filter, then null filtering. But, alas, no.

My fault lie in telling MySQL to pre-filter the data. It erroneously assumed that I knew what I was telling it, and it went on about its merry way to do just that. In fact, it assured me, via explain, that really, only 4 rows needed to be examined after the whole using temporary thing was done (this bothered me, and the straight_join told me ~5000; but that seemed more reasonable, which is why I left in the straight_join). What I really wanted to do is perform the 0.2 second operation, cull out some data, and be happy; possibly taking a massive 0.3 seconds.

You've all quit caring now, but here's what the magical fix was:
select distinct p.person_id, p.imdb_name, p.imdb_id from sixteen.personal_movies natural join movies m natural join movie_credits mc inner join persons p on p.person_id=mc.person_id where mc.credit_type=1 having imdb_id is null

So now the extreme database novices are asking "having?, what's having?"; and those who are in the know are asking "isn't having the thing you use when you want to do a where on the result of an aggregate function (max, sum, count, etc)?"; and those who are really in the know are thinking "Of course, that's what you said in your English description: perform the easy query then post-filter it".

where is the keyword that novices use to do everything, until they want to do something like select name, count(*) as cnt from foo where cnt > 10 group by name and find they can't. Then they do select name, count(*) as cnt from foo group by name having cnt > 10; then remember to use having as the filter on an aggregate. The next thing they should do is remember why they use it on an aggregate, and why they -have- to use the alias (as opposed to in the where, where you can't use the alias): having operates on the dataset returned by everything before it.

So the new query, the one with having, operates in near-zero time; even across all movies. In fact, the time to launch perl, parse the code, invoke mysql, have mysql read the queryfile, have mysql write the (empty) resultset, have the perl program write a few status messages, and it to exit is statistically 1 second.

Normal: ~5 minutes once personal_movies broke about 400 rows
straight_join: This may have been saving some trivial amount of time, but it wasn't much. Call it ~4 minutes 58 seconds.
optimize table: This saved a bunch of time, since movie_credits has some 4.2 million entries (since my web display pages never incorporated anything but actors (credit_type=1) I no longer build the linking entries for anything else, otherwise this would be much bigger) and the indicies were scattered.: ~1 minute 6 seconds.
having: Since MySQL can only use one index per table, and the index it decided was better to use was person_id (for the credits join) it had to filter out the imdb_id some other way. Since I told it to pre-filter it had to create a temporary table which was big enough to exceed its ram allotment, which made disk I/O necessary. Using having made it be a post-filter, which let it do all the pre-filtering on indicies; and the post-filtering on a "small" dataset.: ~1 s (I ran it about 15 times just now, and they seemed to alternate between 0.40s and 0.41s)

So, while your individual mileage may vary; remember that there's more than one way to write a query to give back the exact same data. If you're really good you'll remember all of them and try all of them before commiting any one of them to your process.
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.