Xyon (xyon) wrote,

IMDb parsing blues

So I've been trying to write an import utility for the IMDb files that they publish (http://www.imdb.com/interfaces).

I started with movies.list, that was no problem.

actors.list... My first parsing run was at 2:20, it's now 4:20, and I'm only getting 2% in (83,000 lines successfully read -- max was 780,000 successfully read).

I guess I started looking at a bad place... because this was my reference example
2 Dope, Shaggy      Backyard Wrestling 2: There Goes the Neighborhood (2004) (VG)  (voice)  [Himself]  <19>
            Big Money Hustlas (2000) (V)  [Sugar Bear]  <2>
            Bowling Balls (2004) (V)  [Shaggy]  <2>
            Juggalo Championship Wrestling Volume 1 (2000) (V)  (as Shaggy 2 Dope of Insane Clown Posse)  [Himself]
            Summerslam (1998) (V)  [Himself]
            WCW Road Wild '99 (1999) (V)  [Himself]
            "Raw Is War" (1997)  [Himself (1997)]
            "Sunday Night Heat" (1998)  [Himself (1998-1999)]
            "WCW Monday Nitro" (1995)  [Himself (1999-2000)]  <90>
            "WCW Saturday Night" (1991)  [Himself (1998-1999)]
            "WCW Thunder" (1998)  [Himself (1998-1999)]
            "WCW Worldwide Wrestling" (1991)  [Himself (1998-1999)]

So, what can we learn?

LastName, FirstName(tab)Title(space-space)(\(annotation data\))* \[CharacterName] <CreditPositon>? (I'm going to admit that I missed (voice) initially, so I thought that after (space-space) would only be (as AKAName))
Successive entries:
(tabs)Title(space-space)(\(annotation data\))* \[CharacterName] <CreditPosition>?

Okay, well, all's going well..

Data line 7:
'El Francés', José  Alma gitana (1996)  <45>

Oh, no character name. Make that be optional.

Data line 11:
            "Querida Concha" (1992)

No character name... nothing after the title at all.

Data line 186918:
Bear, Robert        Abby Singer (2003)  JD  <8>

Character name doesn't have square brackets??

Data line 781540:
Into the Breach: 'Saving Private Ryan' (1998) (V)  (as Capt. Dale Dye, USMC (Ret.))  [Himself]  <13>

Does anyone see what I didn't like about this? I had been shady and not written a depth-aware parser; so the "as" section read as "as Capt. Dale Dye, USMC (Ret." and then it had an illegal character under the cursor. (There was also one with a ']' in the character name, but I can't find the right example now... it happens often. ah, data line 3049
            Life and Legend of Bruce Lee (1973)  (archive footage) (uncredited)  [Hakim [from "Game of Death"]]

Actually, 3049 didn't break because of "]]" it broke because of (archive footage) followed by (uncredited). I hadn't seen any with two sets of annotations at this point.

So I just wrote the depth-aware parser, and made sure to include this as a special case, and here's where it first went off:
Data line 82358:
            Hombre despiadado, Un (1991)  (as Raúl Araiza (II)

IM conversation I was having while doing this...
tekman: Well, they're almost certainly machine generated, so they've got to be consistent, you just have to figure out exactly what format they are consistent to.
xyon: well, I made it over 180,000 lines before I ran into that one
xyon: and the JD is his character name, which is usually in []'s
tekman: That's bizarre.
tekman: How can that be true?
tekman: I mean, surely whatever program was generating these text files didn't just decide to leave out the brackets.
xyon: I checked his name on IMDb, he's listed as JD in Abby Singer
tekman: Yeah, I know, I just looked.

At this point I'm thinking that they're hand-written.
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.