Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
25th September 2011, 23:54 | #1 | Link |
Registered User
Join Date: Dec 2006
Posts: 196
|
SubExtractor - New Sub Ocr App
I've released an app to extract subs from (non-encrypted, on hard drive) DVDs and convert to Advanced Substation Alpha or SRT format. It can also convert sup (PGS) and sub/idx formats to same. I wrote this because I hate the blocky, too-high-on-the-screen look of regular DVD subtitles and wanted to re-encode my DVD collection in h264/aac/assa with mkv containment.
http://subextractor.codeplex.com/ It's a wizard-style app, allowing you to pick program chains, angles, audio and subtitle tracks from a DVD folder and create mpg, d2v and bin (my own data format similar to sub/idx combined) files for each. DGIndex is used to help line up the subs to the video since DVD programs often have discontinuities that mess up sync. The mpg and d2v files created is great for further re-encoding of DVDs to h264 using a tool like MeGui. The OCR is pretty basic, just exact pattern matching of the characters. The starting OCR database is good though so most DVDs should require manual matching of just a few characters. Some characters like i, l, I, '.', and o must be manually matched for every DVD since they have a lot of false positives. Some Bluray sup files can be tedious to OCR since the Bluray authors used scaled-up fonts, which means there ends up being 5 or more bit pattern matches for each character. Persistence pays off though if you get one of those files, just keep matching. The line and word layout functions are pretty sophisticated and should give good results unless the characters are very unusual (vertical or upside-down text is bad). Last edited by Tappen; 28th December 2012 at 21:18. |
27th September 2011, 01:59 | #3 | Link |
Registered User
Join Date: Dec 2006
Posts: 196
|
Yes it can also export to srt, though of course that's a much more limited format (no colors, positioning, etc).
Also, the first 3 steps of the wizard are kind of like a easier to use version of ifoedit: they produce an mpg (mpeg-2 program stream) file of just the angles and tracks from the dvd you want to re-encode. |
27th September 2011, 08:41 | #5 | Link |
Telewhining
Join Date: Mar 2010
Posts: 272
|
I ran Ice Age 3 through it and I must say, it was painless. Worked extremely quick and I can't find any OCR errors. This is definitely my favorite subtitle OCR utility! Well done!
A few ideas - 1) Being able to type the text instead of clicking it would be nice, but not a huge deal as the recognition is excellent. 2) My default "save" directory was in the "My Videos" folder. It would probably be easier if it defaulted to the current working directory. 3) The other issue is on some subtitles the alignment is a little off. Not a huge deal - but it would be nice if there was a feature that allowed you to "align" text blocks to the same left-side position. Here's an example: edit: also the ability to set the OCR bin file to the program directory for portable use. Last edited by nibus; 27th September 2011 at 08:53. |
27th September 2011, 12:01 | #6 | Link | |
Registered User
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
|
Quote:
Also both .ass files can't be loaded in aegisub. I get "error processing line: style: blah blah blah". Samples: http://www.mediafire.com/?eh78xxcdoc9siw0 Finally: What about subtitles in other than English languages? How do I insert foreign letters? |
|
27th September 2011, 13:06 | #7 | Link |
Registered User
Join Date: Dec 2006
Posts: 196
|
nautilus7: I'll look into your issues, thanks for the source files. srt output is a feature I didn't work on much so I'm not surprised I missed some things. Should be easy fixes though
nibus: good suggestions. 1. I've thought about adding a "enter matching text manually" textbox myself. Hopefully I can do it without messing up the flow of the ocr 2. I worry that the files will be installed in a directory where the user doesn't have write access without Windows bringing up a UAC dialog so I went with My Videos. Maybe I should check if the current directory is writable and make that the default if so. 3. I have an option to "Exactly Position every Line" when creating ass files which will turn off the processing that allows text which is centered and in the lower 3rd of the screen to use the default position of ass renderers. But that doesn't solve the left alignment problem. I could add a "left-align" checkbox but then all text would be left aligned and probably (since the source and dest will have different widths) make things look bad in a different way. |
27th September 2011, 16:44 | #10 | Link |
Registered User
Join Date: Dec 2006
Posts: 196
|
nibus:
I added the ability to enter the OCR character match manually in 1008. See how you like the UI. I couldn't figure out how to deal with UAC in Windows Vista and 7 reliably to change the output and OcrMap directories to the current app directory when it's sensible to do so. If you install (copy) into a "Program Files" sub-directory for those operating systems Windows secretly moves files created by the program elsewhere - very hard for the user to find. So I haven't changed the default directories yet. |
27th September 2011, 16:58 | #12 | Link |
Registered User
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
|
Ok, I see. But let me say this: Your program is currently the only in development ocr program that can read blu-ray subs (the other is suprip but is dead) and from the 2nd public release it can output perfect english subs, at least, something that suprip is not able to do till now... So i think response will go high! :P
|
28th September 2011, 16:13 | #13 | Link | |
Registered User
Join Date: Sep 2008
Posts: 365
|
Quote:
|
|
28th September 2011, 16:51 | #14 | Link |
Registered User
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
|
Nice! Wasn't aware of this. I'll have a look.
@Tappen a few suggestions: Almost every time SubExtractor finds ." or ," letter combination in italic writing, it puts a space between them. Maybe some optimization can be done there so the user don't have to fix the space with the "advanced word spacing" feature. Also some times an unwated space is placed after 1 (also in italic). |
28th September 2011, 18:39 | #15 | Link |
Registered User
Join Date: Dec 2006
Posts: 196
|
The accuracy of the space detection depends on the font kerning, and so is different for every font the disc subtitle authors use. I haven't found . , or 1 characters in italics to have a lot of problems with the samples I've OCR'd, but it's fair to say that it's very rare for there to be a space in front of . or , and very common to have a space after the same, so maybe I'll tilt the base adjustments by 1 pixel in that direction.
How much are you having to "advanced word spacing" the left and right adjustments around those characters to fix the problem? |
29th September 2011, 00:38 | #17 | Link |
Registered User
Join Date: Dec 2006
Posts: 196
|
nautilus7 I don't see the problems when I run your .sup files. I don't see any extra spaces in front of periods or commas, or after 1s. Did you accidentally un-check "1080p Adjustments" on the Create Subtitles page? I notice that your *.ass files have the DVD (480p) default margins and font sizes instead of Bluray (doubled) values.
|
29th September 2011, 00:55 | #18 | Link |
Registered User
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
|
In eng.sup file i sent you, you can see the following:
Space between . and " in italic writing. Space after 1 in italic writing. In watchmen.dc.eng.sup file i sent you, you can see: Space between . and " in italic writing. Space between , and " in italic writing. Last edited by nautilus7; 29th September 2011 at 00:58. |
29th September 2011, 01:16 | #19 | Link |
Registered User
Join Date: Dec 2006
Posts: 196
|
I see. I think the problem is with " rather than . or , for the first issue. Not much I can do as many subtitle fonts (whatever the Bluray or DVD authors used to generate the bitmaps that I'm OCRing, not the font we're using in the output files) have tighter spacing around " and 1 italic characters than we're seeing here. Fixing your problem would probably break a bunch of other sup files. I'm just going to have to admit that I can't do perfect word spacing. Personally I usually run the subs I produce through the Aegisub spell checker to catch and fix any repeated errors. It would be great to be perfect but I don't think it's going to happen.
One thing I'm considering is that I've seen quite a few errors with numbers. I might auto-adjust the spacing rules so that 2 numbers next to each other can't have a space in between. It's a really visually jarring error that may be worth some extra work to avoid. I've also considered a rule where I automatically add a space before the 1st, 3rd, etc. double-quotes, and remove any space after them, and do the reverse for the 2nd, 4th etc. double-quotes. But sometimes quotes don't work exactly like that - they're continued from the previous subtitle and the order is reversed. I'd hate to deliberately mess up those cases. Last edited by Tappen; 29th September 2011 at 02:13. |
2nd October 2011, 11:26 | #20 | Link |
Registered User
Join Date: Sep 2006
Posts: 2,197
|
would it be possible to change the order in which the different characters gets asked to be orc'ed sticks to horizontal lines?
e.g. when a subtitle consists of two or more lines, then all characters from the words of the first line are asked to be recognized first and only then characters from the next line. atm, the program keeps going on a vertical axis and this is quite irritating. also, the programm seems to halt when I choose a character from the windows character map which is not listed in your programm among those few characters presented on screen (it does not crash, but I cannot seems to proceed unless I undo that choice and choose one of those characters you suggest with your list; in this special case its the 'em dash')
__________________
Laptop Lenovo Legion 5 17IMH05: i5-10300H, 16 GB Ram, NVIDIA GTX 1650 Ti (+ Intel UHD 630), Windows 10 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64) (K-lite codec pack) Last edited by Thunderbolt8; 2nd October 2011 at 11:31. |
Thread Tools | Search this Thread |
Display Modes | |
|
|