SubExtractor - New Sub Ocr App [Archive] - Page 7

MokrySedeS

30th August 2012, 21:10

Hi Tappen. I've encountered an error with this (http://www.sendspace.com/file/rsjml0) file. In subtitle #1524 the text "Listen!!!" is recognized as "Listen"..!"
Also e.g. in #84 "tt" is recognized as "lt".

Betsy25

1st September 2012, 21:07

I don't know which timings are the correct ones, for all movies I processed by both DvdSubExtractor & SubRip, they always return different timings for the content. Sometimes difference up to half a second. I don't know which one would actually produce the "correct" timings ? :helpful:

Tappen

4th September 2012, 21:42

MokrySedeS: I was able to reproduce the problem with tt being recognized as lt. I'm not sure it's fixable, though. The text in this sup file is strangely small for a Blu-ray and the cross on the t is just small enough to be fuzzy logic'ed away by the code that automatically looks to split unrecognized characters in 2.

I wasn't able to reproduce the Listen!!! being recognized as Listen"..! problem. I think you have a " character in your database that I don't have. Again I can see how this would happen but can't think of a way to fix it that wouldn't create other recognition problems. After all !! could really be an exact match to a double quote above 2 periods in some other subtitle's font.

This are both errors but I can't think how to fix them. Thank you for bringing them to my attention and I'll try to come up with ideas in the future.

Tappen

4th September 2012, 21:57

Betsy25: One of the reasons I wrote SubExtractor is because the timings were wrong on a bunch of Subrip IDX files I had made. There is no perfect timing for many DVDs, which are made up of discreet pieces (cells, programs, program chains) of video+audio+subtitles designed to be concatenated together at the whim of a hardware DVD player. Within a cell and most program chains (made up of 1 or more cell) there are reliable timestamps, but once you have more than 1 program chain making up a movie - or the program chain has timing discontinuities - exactly how to keep the time continuous is an art more than a science. Watch how various video player software deals with MPEG PS files (.mpg usually) and you'll see how timing is pretty random.

Short answer: I spent a lot of time trying to match the timestamps SubExtractor creates to what you get when you just simply stick all the video and audio pieces together for the main program(s) of a DVD. I think it's more reliable than SubRip. If you find where that's not true let me know the DVD and I'll see if there's a problem I can fix.

Betsy25

6th September 2012, 02:09

Thanks Tappen. Guess that's just one more reason to stick with SubExtractor.:)

MokrySedeS

10th September 2012, 09:09

Tappen, could you add Polish characters to "Advanced Word Spacing Adjustment"?
"ą" in particular is causing me problems and I can't fix it.
Also very annoying thing - Subtitle Extractor is stealing focus when it finds unrecognized character. I've screwed up my ocr just now writing this post and had to hunt down the error. Option to disable focus stealing would be nice.

Tappen

10th September 2012, 20:34

MokrySedeS: OK I'll look into both of these issues.

MokrySedeS

10th September 2012, 22:20

Thanks in advance :bow:

deco20

11th September 2012, 07:51

Tappen, is it possible to create editor of OcrMap.bin? Because once you make a mistake and don't notice that just after OCR process, you must delete this file and start collectioning chars again.

Tappen

12th September 2012, 21:37

deco20: you can run the same OCR again and then delete the mistakes when it gets to the end of the file using the "Review and Correct OCR Matches" button. SubExtractor will recognize that the file name is the same and correctly edit the part of the database that was created when you ran the OCR the first time on the file.

I don't think I could write an editor big enough to let you find mistakes in the full database: there are just too many characters. Maybe if I let you choose from the list of movie names and brought up the "Review and Correct OCR Matches" dialog without making you go through the OCR again. I'll think about adding it as a feature.

Tappen

30th September 2012, 23:43

Version 1029 is out on http://subextractor.codeplex.com/ with the following fixes - mostly requests from this thread:

Feature: Added option to make i and ¡ characters movie-specific for improved OCR on Spanish subs (Special Characters tab in Options)
Feature: Allow switch to Word Spacing dialog directly from Spell Check dialog
Fix: Added more default word spacings for accented characters
Fix: Changed Word Spacing dialog to show all OCR'd characters in current sub
Fix: Removed application focus grab during OCR
Fix: Tightened HD subs fuzzy logic to reduce false matches in small characters
Fix: Improved Arrow key selection during OCR

errantkkn

25th October 2012, 14:32

Tappen, could you add a feature in SubExtract like, press buttons, then Enter, then ORC. 'Cause in my languague, I've to press two or three buttons for each letter (Ex: "e" + "6" + "3" = "ể"). Thank you.

Tappen

26th October 2012, 01:20

errantkkn: Would a button that controlled whether an "Enter" key is required to finish a character fix your problem? By default the first character would be used like it is today, but if this new button was in the toggled state SubExtractor would wait for the Enter key before reading the character in the text box.

We could make a hotkey like F2 toggle the state as well so it would be easy to switch modes if you have a section of Latin characters in the subtitle.

errantkkn

27th October 2012, 04:22

GrofLuigi

29th October 2012, 18:10

Hi Tappen,

First time user here. The program hangs if confronted with DVD (unprotected) ripped with DVD Decrypter with option "File Splitting: None". I create all my rips in such a way to have only 1 Vob and 1 Ifo file per DVD; most of media players treat it as DVD if I open the Ifo file, so I know it works. I can also rip subs from them with other sub rippers.

GL

Tappen

29th October 2012, 23:50

GrofLuigi: this is just something that I never considered. It's not allowed on real disks of course so I had no reason to think someone would extend the IFO file format this way. Maybe I can look into it. DVD Decryptor stopped development 8 years ago and doesn't work on many disks these days so it's hard to get too interested in supporting it. Also, SubExtractor does support idx/sub files.

GrofLuigi

30th October 2012, 01:57

Tappen, this was not a negative comment, just a FYI. I wanted to try your program and maybe give some comments/suggestions/praises, but I couldn't even start. I will try with DVD disks later.

Most other tools parse correctly this kind of ifo files (and yes, few also choke/act buggy, but don't crash), so I keep doing it this way since forever. :p

Examples of parsing and OCR-ing properly: as old as SubRip and as new as Subtitle Edit. Playing properly: VideoLan. Playing with bugs (not always able to select/display subtitles): Mpc-HC.

*Edit: It started working with 1Gb vobs. The produced file is excellent, I see no errors in it! This is a very good program! :eek:

GL

Chetwood

30th October 2012, 13:47

most of media players treat it as DVD if I open the Ifo file
Usually it's only needed for standalones cause software players handle regular vobs just fine.

errantkkn

30th October 2012, 19:46

Hi Tappen,
I have a question, that is could I backup the ORC library, so I can bring it to another computer without typing again? And how?

Thanks.

GrofLuigi

30th October 2012, 21:19

Another FIY, for anyone interested: It worked with .sub/idx extracted with VsRip from those non-split VOBs, but didn't accept the .sup extracted with PgcDemux.

The result of OCR was again outstanding.

GL

Tappen

31st October 2012, 00:41

errantkkn: On the bottom of the first page of options there's a checkbox labeled "Use Program Exe Location". Check this and hit OK. This will move the OCR database to the directory where you unzipped the install file. You can then copy this OcrMap.bin file to other machines. Either rename it to OcrMapOrig.bin and substitute it into the zip file before you install, or copy it on top of the OcrMap.bin file on the other machine AFTER unzipping, running the program for the first time, and checking "Use Program Exe Location" on the new machine.

speedoflight

1st November 2012, 02:21

Hi, im new here and i just discovered this great ocr tool.

I just have 1 "tiny" problem. My languaje is spanish (well, not exactly, it is Castillan, that is not the same as universal spanish but anyways it has the same alphabet) and in some subtitles (yup, i dunno the reason but it happened in 2 / 4 subtitles i tried), i have the problem of the "i" and "¡" characters, even if im using the last version of the program. So far, thats maybe the only problem i encountered, but when it happens, i need to use another ocr program because sometimes it is almost impossible to correct all the "¡" for "i" or viceversa. Btw, im using subextractor to ocr blu-ray subtitles.

I dunno if there are still problems with that fix, or it is not yet 100% fixed.

O , another thing, it will be great to be allowed to make the preview screen bigger (or to be allowed to maximize the window or something like that), i work on a big resolution pc screen and it is soooo small that i cant see anything at all

Anyways, thx for this OCR tool, the best i tried so far.

Tappen

1st November 2012, 06:19

errantkkn: I've put a test release up on CodePlex (http://subextractor.codeplex.com/releases/view/97076) with the option to wait for an Enter key before taking OCR matches. It is toggled with the check box or F2. Please try it out.

Tappen

1st November 2012, 06:21

speedoflight: Did you check the "i and ¡ per movie" checkbox in Options? The fix for this issue isn't always on as it would slow down other users so I made it an option.

speedoflight

1st November 2012, 14:15

speedoflight: Did you check the "i and ¡ per movie" checkbox in Options? The fix for this issue isn't always on as it would slow down other users so I made it an option.

Not exactly, i had another version of the program before i downloaded the last one posted in this post, that had the option implemented, and it was checked, but still didnt work.

So, i discovered this thread and i donwloaded the last version posted in the link of the first post, thinking that maybe i had the wrong one. It happens the last version i downloaded from here doesnt have the option, like the other one. I assumed the fix was just cored in the program and there was no option anymore.

There must be an option anyways??

Thx in advance (and btw sorry for my cute english ^^).

errantkkn

1st November 2012, 17:31

Perfect. Thanks Tappen.

Tappen

1st November 2012, 18:54

speedoflight: The only release that has this fix is "Release 1029", which is currently what you get from the big download button from the main page of the project (http://subextractor.codeplex.com/). Once you unzip this download, run the exe, and in Options, on the "Special Characters" page, check the option "Check for i and ¡ per movie (Spanish Specific Fix)" before you OCR.

Also, if you'd send me the sup files that are giving a problem I can try to work on this myself. I'll send you a private message.

speedoflight

1st November 2012, 19:43

Yes, thats the one i tried, and im still having issues with "i" and "¡". Well, since right now i dont have any sub working on, when i had one making trouble i will send it to you =), but one thing for sure, the version of the program im using is 1.0.29, the one donwloaded from the codeplex page.

Thank u.

fantasmanegro

14th November 2012, 16:59

first try, just to say, i really like the app, and the results, anybody knows how to parse just idx/sub file?.

Tappen

14th November 2012, 21:41

fantasmanegro: On the menu "Jump to" -> "Load Subtitle File" I think it's named (will be different next release so I can't look it up right now). Click Browse and open the IDX file (can multi-select if you want). OCR, Spelling and Save steps as usual.

fantasmanegro

15th November 2012, 14:24

thank you very much!

Tappen

20th November 2012, 01:07

Thanks for the great comments, TheRancher. I agree with you about the Cons and am working on it.

"Program window is kind of big": This is a UI style that I prefer (not burying features in menus) but it gives things a cluttered look. People have compared my form designs to 747 cockpits. I'd like to simplify but I'm worried that any feature not on the main window will never be found.

"Lack of localization": On the todo list some day soon.

"You have to extract all the content from the DVD": Not really true. On the "Re-Encode Tracks" dialog by default the "Create Movie File(s)", "Create Subtitle Data File(s)" and "Create DgIndex Files(s)" options are all checked. If you just want to OCR the subtitles and are not going to re-encode the video and audio just choose "Create Subtitle Data File(s)" and de-select the other 2.

"Unknown application name.": Yep this is a source of confusion. After I added BluRay support leaving DVD in the name seemed misleading. I'm trying to move to just SubExtractor but it takes a while.

Chetwood

20th November 2012, 07:08

Is there an option to skip an entire subtitle when batch OCR'ing?

Tappen

21st November 2012, 01:31

The "Preview and Correct OCR Matches" looks at the entire database but only lists the matches that were used in OCRing the current movie. Deleting a match (or split) will apply to the current and all future movies. If you find a mistake in a sub after exiting SubExtractor you can run the OCR again on that movie's bin/idx/sup file and then use this feature to fix the mistake in the database so it doesn't appear in future movies.

Hard-subbed video files are a whole different issue. Separating the subtitles from the background would be at least as difficult as all the work done so far on this project. So sorry there's no plans to start work on that.

Tappen

21st November 2012, 01:34

Chetwood: there currently isn't a way to skip a file when doing batch OCR. Maybe in the "Jump To" menu I could put an item to "End OCR and Skip to Next" if there is a next file to go to (part of a batch).

Tappen

22nd November 2012, 00:06

That means if I lose a DVD and .bin file or I don't know on which DVD I made the mistakes, I won't be able to fix them? I know I am exaggerating, but it would be nice if you could preview the whole database and eventually edit some wrong characters, apart from showing the subtitles in the current movie. A mistake can easily be made without even knowing it.

Most letters have over 1000 matches, some over 2000 just in the default database. If I showed them all you'd probably never find an error. It just seems like a pointless feature without applying a filter such as per-movie to narrow down the list.

speedoflight

6th December 2012, 14:57

Hi, about the spanish problem with ¡ and i characters, i need to say that i tried with other ocr tool (the one that comes withing subtitle edit) and even if the tool itself is awful, i changed the color palette and the tool perfect recognizes ¡ and i. The only problem is that the ocr tool is maybe one of the slowest i tried (i spent about 5 hours and i didnt finish half of the movie...).

I tried that in dvdsubextractor, but the change of the palette didnt work. So im saying this to see if there is something we can do about it. Since its very weird, but this is the only ocr tool i tried that works fine in most cases. Subrip doesnt work with the subs im trying, and other ocr tools cant just open idx files or the results are awful epic.

Cheers.

Tappen

7th December 2012, 02:02

speedoflight: I've been looking at the problem. I haven't given up. Someone sent me a Bluray subtitle where I reproduced it, and if you could send me a sub/idx with the same problem it could help me come up with an answer.

speedoflight

7th December 2012, 02:45

I already sent u a pm days ago with the subtitle, and u said u will look into it, but that it will be hard to fix. It was on sub/idx format...

deco20

7th December 2012, 12:32

1) In step "Choose Tracks" there's a listbox "Subtitle Tracks". Scrollbar doesn't work, because control is disabled. If there's more subtitle tracks, I can't see if there's a track I want to OCR.

2) In step "Spelling and Spacing" after OCR when spellchecking is complete, there's no way to undo if we made a mistake in last word.

3) I don't use options "Create movie file" and "Create DgIndex file", so maybe there's chance to add to the settings setting in which user will be able to choose what kind of files program should create, so I won't uncheck it every time.

Tappen

8th December 2012, 00:32

deco20:
1) I always save all the subtitle tracks so there's no reason to select them. It's working as designed
2) I'll look into it. I suspect you're right though.
3) Good point. I should make those remembered checkboxes from movie to movie.

deco20

8th December 2012, 08:05

deco20:
1) I always save all the subtitle tracks so there's no reason to select them. It's working as designed.

I wasn't writing about selecting, but simple scrolling the listbox. Now it's disabled. Sometimes I don't know if there's my language and I want to be sure it exists.

pball

8th December 2012, 21:58

Firstly, I love your program. I've done a bunch of encoding and subs have always been a pain and your program makes it much better.

There are just a few things I've noticed while ocring this sup file.

Would it possible to let you edit matches you have entered. I've hit the wrong key a few times and it be nice if it was possible to change the matched character instead of just deleting it.

Also after making changes to the manual matches could there be an option to not rescan from the beginning. I screwed up more than a few times and it'd be nice to wait and rescan after doing a complete first run.

I did a spell check of the subs I just ripped and other than having f in place of some g's (which was probably my bad) it did quite a good job. Very few I and L mistakes.

One last thing, would it be possible to have user adjustable sensitivity of the ocr? I had to enter around 10 different matches for just about every letter and the sub character can't of had more than a few pixels difference some times.

speedoflight

9th December 2012, 19:24

I agree with the first pball point, but due to the fast ocr engine of this program, usually u dont need to wait that much until u encounter the required match again.

Agree with the second point as well.

About the last point, i think the sensivity of the program is just amazing, the problem is, vobsub subtitles , for example are a pain no matter wat ocr program u use. I am in the same case scenario, sometimes, i cant use the program because it doesnt recognize the "¡" and "i" characters of my languaje alphabet, and maybe with more sensitivity , as u say this can be solved, but i dont think is that easy. And its very normal the need to enter the same character over and over again, specially on vobsub subtitles. Happened to me a lot of times. But still, this program is incredibly fast. With other ones, u will need even 1 hour to make a ocr recognition (not vobsub, with vobsub subtitles u will need even more lol), with this program u only need 10 minutes or not much more... amazing. It will be perfect if the problem im having wasnt there =)

pball

10th December 2012, 03:41

I just want to add the first thing I used this to ocr was a bluray sub file and as I said before I had to enter in many duplicates. But just now I did an older dvd with LOADS of signs and I was amazed at how well it did. Not only did I only have to enter 20-30 or characters but something I didn't realize at first, it locates signs. The last time I attempted to ocr these subs with subrip I had to practically type the whole thing out and the end format was horrible with all the signs and such.

speedoflight

10th December 2012, 21:38

I noticed something that seems to be a little bug. When i save the subtitles in .ASS format, the program always produce some duplicate lines adding the position the subtitles should have on the screen, instead of just adding the position counter on the proper lines.

When they are only 4 or 5 lines, its easy to fix, but when it comes out 50 lines, its almost a frustrating task. In those cases i need to save in on .SRT cuz i waste too much time on em. But of course i prefer the ASS format with the correct positions on the screen.

Need to say, it doesnt matter if i deactivate the program default subtitle positon correcion or the other options, the result is always the same.

Im a doing something wrong?