Doom9's Forum - View Single Post - SubExtractor

Tappen · 20th March 2012, 23:00

Ah, I finally see the problem. This is the same as the issue with l and I having the same bit pattern in many subtitle fonts making accurate matching impossible. I added the entire spellcheck step just to solve that issue.

I'll have to make an option in the spellcheck step to discriminate between i and ¡ to fix this. I suppose the rule is that if it's not at the beginning of a word, or just after a ¿ at the beginning of a word, I can assume it's an i (eye) and not an inverted exclamation point. Otherwise I'll have to ask and build up a dictionary of words that really begin with i. Quite a bit of work, but I'll see what I can do.

For now, I'd remove the training and when you next run the OCR choose i (eye) and not the inverted exclamation because there are likely more of the former than the latter making cleanup easier.

20th March 2012, 23:00	#229 \| Link
Tappen Registered User Join Date: Dec 2006 Posts: 196	Ah, I finally see the problem. This is the same as the issue with l and I having the same bit pattern in many subtitle fonts making accurate matching impossible. I added the entire spellcheck step just to solve that issue. I'll have to make an option in the spellcheck step to discriminate between i and ¡ to fix this. I suppose the rule is that if it's not at the beginning of a word, or just after a ¿ at the beginning of a word, I can assume it's an i (eye) and not an inverted exclamation point. Otherwise I'll have to ask and build up a dictionary of words that really begin with i. Quite a bit of work, but I'll see what I can do. For now, I'd remove the training and when you next run the OCR choose i (eye) and not the inverted exclamation because there are likely more of the former than the latter making cleanup easier.