I started with movies.list, that was no problem.
actors.list... My first parsing run was at 2:20, it's now 4:20, and I'm only getting 2% in (83,000 lines successfully read -- max was 780,000 successfully read).
I guess I started looking at a bad place... because this was my reference example
2 Dope, Shaggy Backyard Wrestling 2: There Goes the Neighborhood (2004) (VG) (voice) [Himself] <19> Big Money Hustlas (2000) (V) [Sugar Bear] <2> Bowling Balls (2004) (V) [Shaggy] <2> Juggalo Championship Wrestling Volume 1 (2000) (V) (as Shaggy 2 Dope of Insane Clown Posse) [Himself] Summerslam (1998) (V) [Himself] WCW Road Wild '99 (1999) (V) [Himself] "Raw Is War" (1997) [Himself (1997)] "Sunday Night Heat" (1998) [Himself (1998-1999)] "WCW Monday Nitro" (1995) [Himself (1999-2000)] <90> "WCW Saturday Night" (1991) [Himself (1998-1999)] "WCW Thunder" (1998) [Himself (1998-1999)] "WCW Worldwide Wrestling" (1991) [Himself (1998-1999)]
So, what can we learn?
(tabs)Title(space-space)(\(annotation data\))* \[CharacterName] <CreditPosition>?
Okay, well, all's going well..
Data line 7:
'El Francés', José Alma gitana (1996) <45>
Oh, no character name. Make that be optional.
Data line 11:
"Querida Concha" (1992)
No character name... nothing after the title at all.
Data line 186918:
Bear, Robert Abby Singer (2003) JD <8>
Character name doesn't have square brackets??
Data line 781540:
Into the Breach: 'Saving Private Ryan' (1998) (V) (as Capt. Dale Dye, USMC (Ret.)) [Himself] <13>
Does anyone see what I didn't like about this? I had been shady and not written a depth-aware parser; so the "as" section read as "as Capt. Dale Dye, USMC (Ret." and then it had an illegal character under the cursor. (There was also one with a ']' in the character name, but I can't find the right example now... it happens often. ah, data line 3049
Life and Legend of Bruce Lee (1973) (archive footage) (uncredited) [Hakim [from "Game of Death"]]
Actually, 3049 didn't break because of "]]" it broke because of (archive footage) followed by (uncredited). I hadn't seen any with two sets of annotations at this point.
So I just wrote the depth-aware parser, and made sure to include this as a special case, and here's where it first went off:
Data line 82358:
Hombre despiadado, Un (1991) (as Raúl Araiza (II)
IM conversation I was having while doing this...
tekman: Well, they're almost certainly machine generated, so they've got to be consistent, you just have to figure out exactly what format they are consistent to.
xyon: well, I made it over 180,000 lines before I ran into that one
xyon: and the JD is his character name, which is usually in 's
tekman: That's bizarre.
tekman: How can that be true?
tekman: I mean, surely whatever program was generating these text files didn't just decide to leave out the brackets.
xyon: I checked his name on IMDb, he's listed as JD in Abby Singer
tekman: Yeah, I know, I just looked.
At this point I'm thinking that they're hand-written.