View Full Version : SubExtractor - New Sub Ocr App
Pages :
1
2
3
4
[
5]
6
7
8
9
10
Thunderbolt8
12th December 2011, 21:15
you mean the rule has to be the same in both cases?
dunno, you'd have to manually correct one of these cases anyway then. just by numbers, that case in which the one hyphen is wrong happens way more often that a double hyphen occurs in a movie. but in subtitleworkshop, fixing the one hyphen is just one click, no matter how many of them.
not sure whats with the double hyphen, afaik it would be accidentally "fixed" automatically in the same way. so you'd have to "re"correct that one manually afterwards. not sure what happens though when check for "-" in subtitles with one line is ticked for automatic correction as well, if one of the two hyphens is removed, then you most likely wont that line again after that. so you'd have to look out carefully for what settings etc.etc.
anyway, when saving directly to .ass with keeping exact position settings, not sure if subtitleworkshop can be used then without breaking some of stuff like position or fonts properties. in that case, having the one hyphen with space fix should ensure that less time for manual fixing would be needed than in case of not doing the space automatically, because that two hyphens after another case happens less frequently.
so I'd say that vote goes for automatically adding the space between hyphen and the beginning of the rest of the line.
in case I misunderstood and the question was whether I generally think that a space belongs between the one hyphen and the rest of the line or not: yes, it does.
Tappen
13th December 2011, 01:13
you mean the rule has to be the same in both cases?
dunno, you'd have to manually correct one of these cases anyway then. just by numbers, that case in which the one hyphen is wrong happens way more often that a double hyphen occurs in a movie. but in subtitleworkshop, fixing the one hyphen is just one click, no matter how many of them.
not sure whats with the double hyphen, afaik it would be accidentally "fixed" automatically in the same way. so you'd have to "re"correct that one manually afterwards. not sure what happens though when check for "-" in subtitles with one line is ticked for automatic correction as well, if one of the two hyphens is removed, then you most likely wont that line again after that. so you'd have to look out carefully for what settings etc.etc.
anyway, when saving directly to .ass with keeping exact position settings, not sure if subtitleworkshop can be used then without breaking some of stuff like position or fonts properties. in that case, having the one hyphen with space fix should ensure that less time for manual fixing would be needed than in case of not doing the space automatically, because that two hyphens after another case happens less frequently.
so I'd say that vote goes for automatically adding the space between hyphen and the beginning of the rest of the line.
in case I misunderstood and the question was whether I generally think that a space belongs between the one hyphen and the rest of the line or not: yes, it does.
Your last paragraph is right: I was asking what you generally prefer as a style. I'm in agreement, so next release I'm going to force a space after any hyphen that starts a line anywhere on the screen, and make sure the hyphen is italicized the same as the next (non-space) character in the line. I don't think it's worth an option; this way just looks better.
What about double-hyphens and em dash characters that start a line? Force no space after them? Or maybe 1 space after double hyphens but no space after em dash (wide hyphen).
Thunderbolt8
13th December 2011, 06:29
em dash and double hyphen should be treated the same way, because sometimes one of them is being used in case of the other (depending on the specific subtitle file), but bot serve the same function in the end.
which means at the end of a line there can be both, a space between the last word/letter and the em dash/double hyphen or not. this is due to the decision how the subtitle track is created and imho this can be kept the way it is.
for the beginning, theres usually only the case in which theres no space, as it indicates an abrupt change of speaker or being able to listen what one speaker has to say only in the middle of his sentence.
Chetwood
13th December 2011, 07:25
Concerning hyphens/dashes, I'd think having an option to keep as close to the original would be nice. Most subs I've seen do have a dash followed by space but some don't and they are still readable cause they usually have a decent font that allows for this. Concerning folders: my output folder is set to c:\test and "Store sup/idx file outputs in source directory" is checked. However, when I open a Vobsub in e:\movies it does save back to e:\movies but next time I open Subextractor it still wants to open from c:\test instead of e:\movies.
Tappen
13th December 2011, 08:14
Chetwood:
Since we're changing fonts when we do the OCR I think adding spaces after line-beginning hyphens is a good idea. Certainly if people keep the Tahoma font I have as the default for SubExtractor a space looks much better than no space.
I see the problem with directories. I don't remember the directory when you stop and re-start SubExtractor, just while it's running. Wouldn't be hard to remember between restarts as well I suppose.
Tappen
14th December 2011, 02:25
Various fixes are in the test release: http://subextractor.codeplex.com/releases/view/78727
Thunderbolt8
14th December 2011, 23:46
regarding the latest version and this "program will always add a space after a hyphen or em dash that begins a line", the em dash is basically like a -- (double hyphen, some tracks use the em dash, some use a double hyphen), so if possible, there shouldnt be a space at the beginning of a line if an em dash is used, because there wouldnt be a space for a double hyphen either.
Tappen
15th December 2011, 00:47
OK I'll take your word for it, Thunderbolt8. I'll change it so there's always a space after a single hyphen but not after double hyphen or em dash.
Thunderbolt8
15th December 2011, 00:59
thanks
Tappen
15th December 2011, 07:15
http://subextractor.codeplex.com/releases/view/78808
has the next test release code.
Confucio's Post-OCR bug fix is in (issue 757 on codeplex). No forced space added when text changes from italic to normal or vice versa.
Leading double hyphens and em dashes now will have no spaces after them, while single hyphens will always have a space.
Confucio
19th December 2011, 16:32
Thanks Tappen. Release 1024 works perfectly regarding the space on italics.
JoseFina54
27th December 2011, 13:21
:thanks:
Tappen
22nd January 2012, 21:40
1025 is out: A minor fix for italic/non-italic spacing, a new option for Create Subtitle that keep the original line breaks in all cases, more tooltip help text.
Next up is more documentation: I'm thinking 2 tutorials for DVDs using either Handbrake or MeGui, and 1 tutorial for Blurays with Eac3to and MeGui
Also I'd like a proper install program. Some users are asking for at least an automatically created desktop icon.
Thunderbolt8
22nd January 2012, 23:16
I dont have much time atm to spent on blu-rays/subs, but will continue to report errors in case I should meet some.
thanks for the new update
Chetwood
23rd January 2012, 10:34
Also I'd like a proper install program. Some users are asking for at least an automatically created desktop icon.
Personally I hate automatically installed icons. You should give Inno Setup (http://www.jrsoftware.org/isinfo.php) a try which I used for the German version of DVD Shrink. It's free and you can set a lot of options that people can choose from during installation like desktop icon, quick launch icon, etc. Cool thing is, you can check them as default so they get installed automatically but people can also simply uncheck them if they don't need em.
And what about tutorials on Handbrake and MeGUI? There are a lot of tutorials already covering how to import subs ripped/converted with an external program already or am I missing something? Please include info on how to batch rip several subs (not necessarily from the same TV show). I'm still not entirely sure why some chars already OCR'ed for the German stream of a VobSub have to be re-OCRed for the English stream of the very same sub despite having the same font/color/size.
ben_franklin
22nd February 2012, 03:18
Just used it. Awesome job Tappen!!! Thank you very much for this app!
wilfried
22nd February 2012, 10:12
Thanks for a great program!
However, in 1025 (and 1024) there is a bug.
If you work on a dvd which doesn't have a language code set on the sub track, you get an exception on the last page and you can't save the ass file.
Exception thrown at
Illegal characters in path.
mscorlib
at System.IO.Path.CheckInvalidPathChars(String path)
at System.IO.Path.Combine(String path1, String path2)
at DvdSubExtractor.CreateSubtitleFileStep.subtitleStyle_SelectedIndexChanged(Object sender, EventArgs e)
at DvdSubExtractor.CreateSubtitleFileStep.Initialize(ExtractData data)
at DvdSubExtractor.SubWizard.LoadCurrentStep()
at DvdSubExtractor.SubWizard.nextButton_Click(Object sender, EventArgs e)
Tappen
23rd February 2012, 00:48
OK wilfried, an easy fix. I'll release a new version soon.
ben_franklin
25th February 2012, 00:53
Apparently I spoke too soon. After doing several subs almost effortlessly I tried to do the subs for "black mask" bluray. They don't seem to be any different that other subs, yet I have to do identification on letters in every single sentence..... :(
Chetwood
25th February 2012, 08:39
I've seen something like this mentioned on a German forum too, can't remember the movie though. Does changing the palette make a difference?
Tappen
26th February 2012, 20:46
Apparently I spoke too soon. After doing several subs almost effortlessly I tried to do the subs for "black mask" bluray. They don't seem to be any different that other subs, yet I have to do identification on letters in every single sentence..... :(
My OCR fu is still pretty basic I'm afraid. If whatever authoring tool the bluray disc creators used for subtitles produces inconsistent characters (due to scaling most likely) then my OCR step will require a lot of manual work. This is why I spent so much time getting the manual entry user interface to be easy and efficient.
Basically I look for an exact match of the pixels between the characters in the database and the test character on the screen. If it's a bluray sup file I also shrink both the database and the test character by a factor of 3 in the x and y dimensions to try to find an approximate match. This results in 9 smaller patterns for each match and 9 for each test character (the shrinking can be done done 9 different ways with different starting positions) for a total of 81 chances at a match. But there's nothing more complicated than that going on - I don't in any way understand the shape of the characters.
nibus
12th March 2012, 08:19
I've had a weird issue where the program detects all "i" characters with "¡" (the upside down !). I can't figure out how to fix this. Is there a way to manually set a character assignment?
Tappen
12th March 2012, 21:34
When doing OCR there's a button in the bottom right labelled "Review and Correct OCR Matches". Press that.
Open the OCR Training drop-down list and find the bad match between i and the upside-down ! and press "Remove a Training".
Pressing "Remove all Trainings for this Character" would also work. Assuming you don't OCR Spanish much it would probably prevent future problems since it would eliminate all matches for the upside-down ! character currently in the database.
nibus
13th March 2012, 17:07
I actually tried that, but strangely there is no listing for the letter i. I also tried starting from a brand new OcrMap.bin file.
Tappen
14th March 2012, 20:49
If you delete your OcrMap.bin file the program will re-initialize it with the OcrMapOrig.bin file so that might explain why the 2nd fix didn't work.
The bad match would be for the upside down "!" character, not for "i". That character is probably at the beginning or end of the list - outside of the alphabet.
nibus
15th March 2012, 06:16
Strangely, the upside down "!" is matched correctly - I think it's the regular letter i that is being seen as the upside down "!". But like I said there is no letter i (lower case) listed, except in italics. I guess I could always just do a search and replace.
http://dl.dropbox.com/u/5637223/Clipboard02.jpg
http://dl.dropbox.com/u/5637223/Clipboard03.jpg
Tappen
17th March 2012, 07:04
Can't you just remove the training error (hilited in the 2nd picture) and your problem is fixed? I don't understand what the problem is.
nibus
17th March 2012, 08:32
The training in that second screenshot isn't an error - it's correctly identified the upside down ! mark. So removing it has no effect on the incorrect training of matching a lowercase letter i with the same character. I would remove the training of the letter i with the upside down ! but it is not listed, as shown in the first screenshot.
Tappen
20th March 2012, 23:00
Ah, I finally see the problem. This is the same as the issue with l and I having the same bit pattern in many subtitle fonts making accurate matching impossible. I added the entire spellcheck step just to solve that issue.
I'll have to make an option in the spellcheck step to discriminate between i and ¡ to fix this. I suppose the rule is that if it's not at the beginning of a word, or just after a ¿ at the beginning of a word, I can assume it's an i (eye) and not an inverted exclamation point. Otherwise I'll have to ask and build up a dictionary of words that really begin with i. Quite a bit of work, but I'll see what I can do.
For now, I'd remove the training and when you next run the OCR choose i (eye) and not the inverted exclamation because there are likely more of the former than the latter making cleanup easier.
aMvEL
22nd March 2012, 10:07
Any chance for a possibility to select more than one language at a time when ripping, or at least being able to filter out or prioritize languages from the Subtitle track selection list?
It would speed up the ripping for me, since I usually rip English and Norwegian subtitles. The problem is usually that the Norwegian subtitle track is near the bottom of the list which makes it tiresome when ripping several seasons of tv-series, seeing as how I need to scroll down the subtitle track list every time.
Tappen
22nd March 2012, 21:46
aMvEL: you guys and your batch processing: always surprises me what my customers want to do. But this is a reasonable request, and shouldn't be too difficult, so I'll see what I can do.
I suppose some sort of option that lets you choose 2 languages to put at the top of the sorted subtitle track list would be a start. Also the list currently only shows 5 items. Perhaps I can change the layout of the dialog to make the list taller and allow more to be visible without scrolling.
aMvEL
22nd March 2012, 22:08
:) It wouldn't have been as much of a problem if I only rip movies, but since I rip mostly tv-series with a lot of episodes, I'd like to make it as easy as possible.
Both of your suggested solutions seems like excellent to me, as it would simplify things alot. :)
Chetwood
23rd March 2012, 08:12
Same here. There could be some tweaking done to streamline the process but I'd thought to hold back with suggestions till more pressing issues (like proper vertical positioning when OCRing to ASS or changing palettes to counter blurred outlines) are solved.
Slasher
27th March 2012, 22:17
Hi Tappen,
Thanks again for all your work.
I want to point out Chetwood's request about proper vertical positioning when exporting to ass. Other than this issue the app worked fine for me.
Tappen
28th March 2012, 00:04
Could you guys explain again what exactly you want to change with vertical positioning that isn't done when the "Keep Source Lines and Positions" option is set on the "Create Subtitle File" page? Do you just want a "left-align" or "center-align" option for the ASS tags?
I have to say the reason I wrote this app in the first place was because I didn't like the vertical positioning of DVD subtitles (too high on the screen, with too many line-breaks on 16:9 film) so for me allowing the ASS rendering software to place and line-break the subtitles wherever it wants is the main reason I use my own program. This is why it's hard for me to understand other points of view on the subject and you have to spell it out repeatedly and in simple language.
Chetwood
28th March 2012, 17:36
It's like I wrote in my last mails to you: I want word-wrap and horizontal/vertical positioning to be identical to the Vobsub's:
http://www.dvdshrink.info/chetwood/stuff/tgw-vobsub.jpg
But when outputting to ASS and having 'Keep Source Line Breaks' selected it looks like this:
http://www.dvdshrink.info/chetwood/stuff/tgw-ass.jpg
The SRT has the proper horizontal placement but vertical is off and thus blocking credits:
http://www.dvdshrink.info/chetwood/stuff/tgw-srt.jpg
Tappen
28th March 2012, 18:16
You have to understand that ASS won't use identical fonts to the DVD subtitles so the width of a line of text won't be the same.
This means you have to choose between left-aligned and center-aligned for all text. The pictures above are left-aligned so their centers don't match. I can put in an option to center-align but then any text that is left-aligned will look strange. Early versions of SubExtractor worked that way and people reported it as a bug.
I'm sorry to say you'll never be completely happy with any solution that a computer can produce.
Slasher
28th March 2012, 23:33
I have to say the reason I wrote this app in the first place was because I didn't like the vertical positioning of DVD subtitles (too high on the screen, with too many line-breaks on 16:9 film).
Let me better explain my issue. When ocring a bluray subtitle and outputting to ass with the "Keep Source Lines and Positions" option enabled, the resulting ass subtitle is too high on the screen compared to the original bluray subtitle position.
I think this is linked to the fact that when calculating the positions the program assumes 1080p video. But I want these positions adapted to 720p. Could you do that? Maybe offer an option like scaling positions to a preset set of resolutions or maybe custom resolutions? I already tried the "resample resolution" option in Aegisub but with no effect, the subtitle stays the same, even though it changes the values for positions.
Chetwood
29th March 2012, 07:24
I'm sorry to say you'll never be completely happy with any solution that a computer can produce.
I get that. Since the Vobsub does not provide info on what font is used (and even if it were), it's not certain that an ASS set to that font would display identically cause there's no way of knowing how the standalone/software player will render the font.
Still, AFAIK many Vobsubs do not use the whole screen and position it at coordinates 0,0 but have a bitmap as small as the rendered text and position this accordingly. Could Subextractor translate this position info to the ASS? If not, it should default to horizontally centered items since the overwhelming majority of the subs I've seen so far are centered.
I've seen very few DVDs that place color-coded items off-center close to the person speaking and only one TV show that does this so far (and this only on the US DVD, the English sub of the German DVD has centered subs only):
http://www.dvdshrink.info/chetwood/stuff/m-vobsub.jpg
which looks like this in ASS:
http://www.dvdshrink.info/chetwood/stuff/m-ass.jpg
and like this in SRT:
http://www.dvdshrink.info/chetwood/stuff/m-srt.jpg
I agree with Slasher that scaling options would be awesome. When saving to ASS, the save dialogue would offer to pop up a preview window where the longest item is displayed and people could change size, color and position. If that is too much work, please add the option to center-align the subs (a mouseover would explain the implications so people would not report this as a bug again). Thanks!
Tappen
29th March 2012, 21:18
SDH subtitles (for the deaf and hard of hearing) position the text near the speaker and left-align. A lot of people use this type of subtitle for various reasons and they were the ones complaining about the center-aligned text. I have a lot of sympathy for their viewpoint so I'll have to add centered as an option and leave left as the default.
In the DVD screenshots above the font used to create the DVD subtitle bitmaps was clearly unusually tall and narrow. If you changed the font used by SubExtractor to something like Arial Narrow instead of Tahoma it would probably line up better. Just a tip.
Thunderbolt8
30th March 2012, 20:36
Could you guys explain again what exactly you want to change with vertical positioning that isn't done when the "Keep Source Lines and Positions" option is set on the "Create Subtitle File" page? Do you just want a "left-align" or "center-align" option for the ASS tags?
I have to say the reason I wrote this app in the first place was because I didn't like the vertical positioning of DVD subtitles (too high on the screen, with too many line-breaks on 16:9 film) so for me allowing the ASS rendering software to place and line-break the subtitles wherever it wants is the main reason I use my own program. This is why it's hard for me to understand other points of view on the subject and you have to spell it out repeatedly and in simple language.Not completely sure whether I understand it correctly, but I also agree with too high on screen positioning and too many line breaks with DVD subtitles. There shouldnt be no more than 2 lines on the screen, if avoidable.
as for the left or center align option, standard is center align position, right? Thats fine with me. Dont like left align, unless its necessary (well, it looks better then) to use in case of SDH titles spread over the screen for different speakers.
Tappen
30th March 2012, 23:01
The standard is currently to use the left-aligned tag in ASS. This is because there tends not to be many positioned text entries, and even less that is 2 or more lines tall where the choice of left or center alignment is noticeable. The exception is SDH subtitle tracks where left-alignment is essential so they got priority.
I'm thinking that rather than trying to add another option I'd just find text that is centered and near the top of the screen and use the ASS center-align tag for those entries and continue with left-aligned tags for the rest. I already have working code to find text that is centered and near the bottom of the screen after all.
Chetwood
31st March 2012, 10:12
SDH subtitles (for the deaf and hard of hearing) position the text near the speaker and left-align.
Not on German DVDs.
In the DVD screenshots above the font used to create the DVD subtitle bitmaps was clearly unusually tall and narrow. If you changed the font used by SubExtractor to something like Arial Narrow instead of Tahoma it would probably line up better.
Right. But since fiddling in ASS is inconvenient it would be nice if we could change these settings from within Subextractor like this (http://www.dvdshrink.info/chetwood/stuff/subcextractorgui.png) (this mockup shows a maximum of settings, basics like font and position would be fine).
I'm thinking that rather than trying to add another option I'd just find text that is centered and near the top of the screen and use the ASS center-align tag for those entries and continue with left-aligned tags for the rest.
Can it be both? Cause as you wrote ealier, depending on font and font size even left-aligned subs may look terrible. I mean, you already have this coded as well so adding another option (again see linked screenshot) might not be too much work?
Tappen
31st March 2012, 21:13
This is going too far into Dvd-specific (non-Bluray) options for my taste. I just want my subs to look good compared to the ugly, blocky Vobsub bitmaps on an HD monitor with a minimum of effort. Perfect layout and accuracy has never been the goal for me, sorry, and I'm not interested in doing to work needed to get there.
Some sort of centered vs. left-aligned option is coming in the next release. I'm just trying to determine if there's an intermediate state where an intelligent decision can be made instead of forcing all text to be either left-aligned or centered.
Betsy25
16th April 2012, 18:03
After stuggling and extremely bad sub recognition experiences with SubtitleEdit, this tool simply ROCKS !
it took me 10 minutes to teach it some few things, and now it works wonders ! :)
1) Just a tiny feature request : Is it possible for the last step (Save As....) to make it default to the actual DVD location instead of the default "library" location please ?
2) Is it normal that ignored characters end up being trained as a blank space ?
If I press "ignore" for a line of characters at the start of the movie, they end up in the "Training" list from the "Review and correct OCR Matches" window, trained as blank space ?
Thanks for the brilliantly clever tool.
Tappen
17th April 2012, 00:15
Where would the actual DVD location be? I suppose the VIDEO_TS folder is the most reasonable, or maybe the directory 1 up from that? It's not a hard option to code just hard to design.
Yes, ignored characters are trained as mapping to no character. This allows you to specify patterns which will have no output. However, like o, l, I and a few others these patterns only apply to the current DVD so don't worry about it messing up the OCR of other DVDs. If instead you just don't want to try to split a big lump of stuff for OCR - happens sometimes - I'd recommend using a Greek character so you can find it in the output file easily and insert what you actually want to appear in the subs.
Betsy25
17th April 2012, 20:42
That for the clear reply Tappen, I'm just a little lost about which training rules are global and which are source specific.
Tappen
17th April 2012, 21:01
I move training rules from global to source specific when I find they cause errors in my own OCR'ing. It's pretty unscientific but that's what has happened. Characters l, I, period, commas, apostrophes were obvious, but I moved o and O into the source-specific category only when I found a few DVDs where mistakes were being made because of trainings from other DVDs.
I could try to indicate which characters are in which category but I don't think it would help people much: they'd have to follow the same process anyway. It's only folks like you who are curious and interested enough to want to know how things work who are affected. The current list (from the source code) is:
public static readonly char[] MovieSpecificChars = new char[] { '1', 'l', 'I', '.', '\'', ',', '-', '—', '_', '\\', '/', '|', 'o', 'O', '°', OcrCharacter.UnmatchedValue };
(OcrCharacter.UnmatchedValue is what you get with the "Ignore" button)
aMvEL
30th April 2012, 15:35
I have a slightly bad vobsub that when using the default pallette it doesnt show the dot above the letter 'å', so that it recognizes the letter as 'a'.
If I change to pallette to 1,2 it detects it just fine, however it always changes back to pallette 1 (default). Is there anything I could do like force a certain pallette for the entire subtitle?
I've added a sample of the first episode. The nordic languages nor, dan, sve, are the ones with trouble.
http://dl.dropbox.com/u/2914045/FAMILY_GUY_S1_D1%20Track%201.zip
Tappen
1st May 2012, 00:01
aMvEL, you need to hit "Start Over for the Whole Movie" and change the palette to 1,2 from the first subtitle. The program will stay with 1,2 from then on (with 1 exception in the middle where you have to manually choose 1,2 again in my test).
The problem is that the characters on this subtitle are unusually fat and the program thinks palette 1 is more likely to be correct because of the average number of pixels per character. Once a few of the characters are in the OCR database it no longer depends on the size of the characters to choose the palette but bases the decision on how many of the characters can be immediately identified.
This is a hard problem to fix in general and I'd likely break more subtitles than I fix if I changed the algorithm for palette-choosing now so I'm not sure whether I'll change anything for the next release.
vBulletin® v3.8.11, Copyright ©2000-2025, vBulletin Solutions Inc.