Real OCR possible? [Archive]

kagoru

18th April 2002, 20:42

I have been wondering for a long time if it was possibly to use REAL OCR when ripping subtitles. There are many nice Programs out there that are really good at reading and converting scanned stuff to editable text.
When one could use that Programs to OCR DVD subtitles you would not only have a real time decrease but also more likely 99% less spelling mistakes. The best thing would be if we could only use the enginge of those OCR programs and create a front-end that would make ripping subtitles a breeze. Put a spell checker in there and you would have a ultimate subtitle machine that is times more succesful than SubRIP. Anyone ever had similiar thoughts?

gabest

19th April 2002, 07:03

Yeah, it would be pretty easy to make a real OCR prog for DVD subtitles. Since the matching letters are all identical to the last pixel, we only need to define a HUGE list of different features (center of weight, width-height ratio, ...), store them in a database and use it look up the text. The only problem is that making that kind of a db needs many ppl, their dvd discs, and somebody who will be collecting and merging the results.

kagoru

19th April 2002, 09:41

I think you didn't get what I meant or I expressed myself not clearly ;-).
Each subtitle "frame" is stored as a bitmap. Therefore you could extract that picture and open it with FineReader i.e. and let FineReader do all the work. You wouldn't be needing any kind of database at all!

ppera2

22nd April 2002, 15:58

Good idea... However, I think that it's not so simple.
There is about 1000-1500 subpics in one movie, with timings.

So, need to write special program which will send it to OCR, get back text and compose all with timings.

kagoru

22nd April 2002, 22:03

I figured that it wouldn't be that simple. You're right. we need to extract the picture and sent it to finereader or whatever that converts it to text and sends it back to our program. It would be very simple if there was a OCR program that worked with command line parameters and returns the ASCII. The timings shouldn't be a problem; we just add them to the text we converted.

Does anybody know if there is a good open-source OCR program or an OCR program that accepts command line paramters?