Log in

View Full Version : SubExtractor - New Sub Ocr App


Pages : 1 2 3 [4] 5 6 7 8 9 10

Tappen
25th November 2011, 21:57
All good ideas Slasher. I'll see how many I can implement next release.

As to when that will happen: had a forced 2 week break from development recently but should be back at it this weekend.

Thunderbolt8
26th November 2011, 23:01
would it be possible to adjust the spacing of the sentence final quotation mark " in combination with italics? I realise correct spacing is hard to achieve and there will always be some problems with some words (especially in case of italics), but at least in case with the sentence final quotation mark in combination with italics, very often if not almost always theres a space between the fullstop or the exclamation mark at the end of the sentence and the quotation mark, like:

blaa. "

since this happens so frequently, maybe its possible to fix this independently from making other changes? maybe something like a rule to bind a sentence final " always directly next to the last character of the line?

Thunderbolt8
27th November 2011, 15:06
heres another interesting one: http://www.mediafire.com/?ytequ63d67dvm3q


- Oh, Mr. Blume, this is
my chapel partner, Dirk Calloway.
- [ Guggenheim Whistles ]

-->

- Oh, Mr. Blume, this is
my chapel partner, Dirk Calloway.
-


not sure why both hyphens remain, whether its due to this being a 3 line centred subtitle type or because theres always a space between the SHD brackets [] and the first and last letter in between for this subtitle file.

Tappen
27th November 2011, 19:10
I just need to remove the entire line if all that's left is hyphens and whitespace. Same problem as you showed in your post on 12th November 2011, 12:36

Tappen
29th November 2011, 05:35
Slasher's list of features are in release 1021
Thunderbolt8's fixes are in as well, except the case where hyphen is differently italic from the text which follows

nautilus7
29th November 2011, 12:48
Thanks!

aMvEL
29th November 2011, 17:29
Is there anything to be done about spacing in italic? I've stumbled upon some subtitles where most of the sentences in italic are without spaces.
I'd like to be able to adjust the spacing only in italic-sentences during or after OCR, or any other solution really...

Except that minor issue, this app has really improved my OCR and workflow, nice job Tappen :)

Tappen
29th November 2011, 17:57
aMvEL: on the last page (Create Subtitles) there's a button that says "Adjust Word Spacing" which allows you to adjust the adjustments made around characters. If you select the Italics button you'll only be changing the italics spacing. Once you select italics, then the character, you can add (if you're missing spaces) or subtract from the default adjustment to the left and right of various characters.

However, if you're missing all spaces, the problem isn't with the individual character adjustments. The program creates a histogram of the size of the spaces between characters (2 histograms, one normal and one italic) in pixels and tries to find 2 peaks in the graph. The first peak is when the characters are next to each other, the 2nd peak is when they are separated by a space. Then it runs through the OCR'd lines and inserts spaces where the separation is closer to the 2nd peak than the first. I've tweaked this algorithm a number of times, but when the sample size is small (italics, typically) there can still be problems accurately finding the 2nd peak. Can you put up a sample file which shows the problem somewhere for me to look at?

aMvEL
30th November 2011, 00:55
Actually I somehow hadn't noticed that button on the program, so I'm a little embarrassed right now...:P
I've gotten it tweaked now to fix almost all my space-related errors... I run it through a spell-check afterwards to fix the remaining errors.

Thunderbolt8
30th November 2011, 01:09
thanks.

any chance to implement the fix for the sentence final " in case of italics thing in the future? because that one not being wrong is rather the exception than the rule.

Tappen
30th November 2011, 02:10
I'm thinking about implementing a rule, something like: If there are an even number of quotation marks in the lines of a block of text, remove any spaces after the odd numbered ones and before the even numbered ones. Seems safe enough. I think it would fix many more mistakes than it creates.

Thunderbolt8: could you try the new release I've just put out - named "Manual Entry Improvement Test". It also has a slight change in italic double quote spacing, so see if it fixes your problem. It's at http://subextractor.codeplex.com/releases/view/77807

Thunderbolt8
30th November 2011, 23:38
looks good so far. with a track in which every single final italicised " was wrong, each them is correct now.

will report back in case I should encounter problems

Tappen
30th November 2011, 23:44
If anyone else reading this wants to try out the new Manual Entry user interface in the test release (http://subextractor.codeplex.com/releases/view/77807) go ahead and let me know how it feels.

Basically you can select the OCR match just by typing the character - you don't have to hit the Enter key any longer - and can switch to Italic and back just by a quick tap (press and release) of either Alt key. I find it's much faster than clicking on the glyph in the Character Selector box with the mouse.

Thunderbolt8
30th November 2011, 23:50
one thing I noticed: the letters newly added with this version are bigger in the ORC training review table than all the others. this is quite irritating, especially because before you had the possibility to scroll fast through the whole list and spot any mistake, because then that one different identified character stick outs notably, even when scrolling very fast. that is not possible now, because all these new additions with a different size create an obstacle for the eyes now. spoting mistakes would take way longer that way.

is there any way to revert this back to normal or to have all characters of the same letter with the same size again?

Thunderbolt8
1st December 2011, 01:00
that one is fixed:

- Rushmore.
- [ Whispers ]
- Shh.

-->

- Rushmore.
- Shh.


but that one still remains:


- Oh, Mr. Blume, this is
my chapel partner, Dirk Calloway.
- [ Guggenheim Whistles ]

-->

- Oh, Mr. Blume, this is
my chapel partner, Dirk Calloway.
-


similar/same as this one:


- But the fools first.
<i>- [Seagull Squawking]</i>

-->

- But the fools first.


just saying, because you said only that one hyphen italic thing remained unfixed. not sure if you covered this one somewhere, am losing track with so much stuff :p and its definately getting less :D

Chetwood
1st December 2011, 09:06
go ahead and let me know how it feels.
I love it! What I would like though, is an additional notifier that italics is ON in the window where the current item is listed. Maybe atop the window or better yet, have the usual black frame around the window turn blue when italics is on. Cause now that I don't have to use the mouse anymore, I can focus on that window alone. Having to look down below all chars just to see whether the italic checkbox is checked or not, takes time.

http://img528.imageshack.us/img528/2104/italicson.th.png (http://imageshack.us/photo/my-images/528/italicson.png/)

Tappen
1st December 2011, 17:19
Thunderbolt8: I'm not getting that error. In the sup file you uploaded for srt with SDH removed I see

47
00:04:05,622 --> 00:04:07,081
- Thank you.
- Hello.

48
00:04:07,165 --> 00:04:10,668
- Oh, Mr. Blume, this is
my chapel partner, Dirk Calloway.

49
00:04:10,752 --> 00:04:13,087
Nice to meet you, Dirk.

And for ASS I get:

Dialogue: 0,0:04:05.62,0:04:07.08,Dialogue1,Unknown,0000,0000,0000,,- Thank you.\N- Hello.
Dialogue: 0,0:04:07.16,0:04:10.66,Dialogue1,Unknown,0000,0000,0000,,- Oh, Mr. Blume, this is my chapel partner, Dirk Calloway.
Dialogue: 0,0:04:10.75,0:04:13.08,Dialogue1,Unknown,0000,0000,0000,,Nice to meet you, Dirk.

Neither of which has the extra line with the lonely hyphen that you list. What am I doing wrong?

Thunderbolt8
1st December 2011, 18:24
sorry, I copy pasted that wrong,

- Oh, Mr. Blume, this is
my chapel partner, Dirk Calloway.
-

the last hyphen is actually not there any more. but anyway, what I meant by posting this is that the other hyphen is still there :p not sure if its safe to fix, though, since both lines belong to the same speaker.

Tappen
1st December 2011, 18:28
Ah I see the issue: I never remove hyphens, only add them. I don't think I'll change that behavior for now.

Tappen
1st December 2011, 21:56
one thing I noticed: the letters newly added with this version are bigger in the ORC training review table than all the others. this is quite irritating, especially because before you had the possibility to scroll fast through the whole list and spot any mistake, because then that one different identified character stick outs notably, even when scrolling very fast. that is not possible now, because all these new additions with a different size create an obstacle for the eyes now. spoting mistakes would take way longer that way.

is there any way to revert this back to normal or to have all characters of the same letter with the same size again?

I don't see this. I added some new OCR trainings from an sup file I was part-way through and they all look about the same size in the OCR review table. I checked the code and there's been no change in that dialog except for adding the "Remove all trainings for this character" button since July.

Tappen
1st December 2011, 23:27
I love it! What I would like though, is an additional notifier that italics is ON in the window where the current item is listed. Maybe atop the window or better yet, have the usual black frame around the window turn blue when italics is on. Cause now that I don't have to use the mouse anymore, I can focus on that window alone. Having to look down below all chars just to see whether the italic checkbox is checked or not, takes time.


A 2nd test build of the new Manual UI: http://subextractor.codeplex.com/releases/view/77913 with an indicator to the right of the Ocr box. Also now the Backspace key works as Undo.

Thunderbolt8
2nd December 2011, 01:47
I don't see this. I added some new OCR trainings from an sup file I was part-way through and they all look about the same size in the OCR review table. I checked the code and there's been no change in that dialog except for adding the "Remove all trainings for this character" button since July.

http://thumbnails38.imagebam.com/16231/3e905c162301999.jpg (http://www.imagebam.com/image/3e905c162301999)

http://thumbnails46.imagebam.com/16231/14a769162302012.jpg (http://www.imagebam.com/image/14a769162302012)

here, in pic 1 one B is bigger than the others and in pic 2, two Cs. thats quite a bit irritation when scrolling fast through the list spotting for mistakes.

(havent checked small case letters or italics)

Tappen
2nd December 2011, 02:14
I think it's just that your subtitle file uses two different sized fonts. Often signs are drawn differently from dialogue. The characters on the left side of the list aren't scaled: they're exactly as I found them in the Sup or Sub bitmaps. I just draw the items in the drop-down list at whatever size I find them so you can see the exact pixel pattern.

With DVDs the difference can be huge - some text is twice or 3x as big as others - but if you've only been working with Sup files you might have gotten the impression that the subtitle authors only use one font per movie since that's pretty common on Bluray.

On another note - I had some good success yesterday with a new "fuzzy logic" OCR method. It seems to match 80-95% of the characters that are now treated as different because they've got a few different pixels on the edges. Hopefully by Monday I can finally solve the problem of Sup files where you have to make 1000s of OCR matches because the characters have been stretched or shrunk during Bluray authoring.

Chetwood
2nd December 2011, 08:19
Thanks for implementing this, Tappen but I'm still in favour of a differently coloured bar around the window or something. Cause the way it is right now, I can only see the notifier in the corner of my eye and |||| is quite similar to |||| (<- should be italics).

Tappen
2nd December 2011, 19:20
Well this is why I didn't release this feature immediately. Thanks for the feedback Chetwood, I'll think about it some more.

Thunderbolt8
2nd December 2011, 22:23
has that sencence final " in case of italics change been (accidentally) reverted again in 1021c? all the " at the end of italicised sentences have a space between them and the last letter of the sentence again.

Chetwood
4th December 2011, 14:17
Some more observations on 1.0.2.1c:

To enter [ and ] on a German keyboard you gotta press ALTGr (which is right of the space bar) plus 8, 9 respectively. This of course interferes with ALT for toggling italics. How about using space bar as the toggle instead?


When reviewing and correcting OCR Matches I see all chars listed, but not [ or ], despite me having entered them in this particular session.


I got an 'Character Matched but Off-Baseline when trying to enter a question mark in italics where only the upper half was recognized. Clicking on the dot and then typing ? did not work. I had to ignore the char. (Weird, after deleting OcrMapOrig.bin and doing the whole sub from scratch, the ? was recognized properly)


IIRC ASS subs allow for vertical positioning with the MarginV style tag. Would be nice if SubExtractor would translate the positions from VobSubs so credits do not overlap with subs.


How do I delete all trainings for a sub, wasn't it possible pre 1.0.21? When checking the finished sub, I realized several errors so I openend the original VobSub again (selecting 'previous step' was impossible since I'd already closed SubExtractor). So I reopened it to start the training on the whole sub but since all chars were already trained, it was zapping through all items without me being able to interrupt this. And when it finished I also could not select to redo the training for the complete sub. Please put this feature back in. Thx.

In case you need someone to do the German translation of the GUI, count me in.

Thunderbolt8
4th December 2011, 17:47
maybe only using the left alt key could already do it

Tappen
5th December 2011, 21:36
Some more observations on 1.0.2.1c:

To enter [ and ] on a German keyboard you gotta press ALTGr (which is right of the space bar) plus 8, 9 respectively. This of course interferes with ALT for toggling italics. How about using space bar as the toggle instead?


When reviewing and correcting OCR Matches I see all chars listed, but not [ or ], despite me having entered them in this particular session.


I got an 'Character Matched but Off-Baseline when trying to enter a question mark in italics where only the upper half was recognized. Clicking on the dot and then typing ? did not work. I had to ignore the char. (Weird, after deleting OcrMapOrig.bin and doing the whole sub from scratch, the ? was recognized properly)


IIRC ASS subs allow for vertical positioning with the MarginV style tag. Would be nice if SubExtractor would translate the positions from VobSubs so credits do not overlap with subs.


How do I delete all trainings for a sub, wasn't it possible pre 1.0.21? When checking the finished sub, I realized several errors so I openend the original VobSub again (selecting 'previous step' was impossible since I'd already closed SubExtractor). So I reopened it to start the training on the whole sub but since all chars were already trained, it was zapping through all items without me being able to interrupt this. And when it finished I also could not select to redo the training for the complete sub. Please put this feature back in. Thx.

In case you need someone to do the German translation of the GUI, count me in.


I'll switch the Italics toggle to the Space bar. Much safer.

I don't know why [ and ] aren't showing. I've never seen that and have no idea why but I'll look into it.

There's currently a problem with lines that only contain matched characters that aren't trusted for baseline identification (- is most common) showing the "Ignore Baseline" message when it really should just move on to identifying the rest of the characters on the line. Should be fixed in next release. Just hit "Ignore Baseline" for now - it won't affect the final result (the character isn't ignored, only its contribution to defining the lines).

So you want the program to notice when positioned subs are low enough on the screen that they would conflict with the default dialogue subs and move the dialogue up for those cases? Good idea for a future feature.

You can open the Review OCR button and and just spam the "Remove all Trainings for this Character" button till the list is empty. Putting the "Restart OCR for this Movie" button on the main window was too dangerous. I can add it back into the Review OCR dialogue if you think we need it.

Tappen
5th December 2011, 23:47
has that sencence final " in case of italics change been (accidentally) reverted again in 1021c? all the " at the end of italicised sentences have a space between them and the last letter of the sentence again.

I don't think I adjusted it back. Check the "Adjust Word Spacing" page you can get to from the "Create Subtitle File" page. Select Italic, then select the " character. I changed the default left adjustment from "-1" to "-2". Your subtitle might need it to be "-3" or "-4" if the tails on the quotes are really long and/or stretch far to the left.

Chetwood
6th December 2011, 08:15
I don't know why [ and ] aren't showing. I've never seen that and have no idea why but I'll look into it.
Do you have an FTP or something? Whenever I stumble on something like this, I could upload the sub and a text file explaining the issue. Using Megaupload is too much of a hassle for small files like these.

So you want the program to notice when positioned subs are low enough on the screen that they would conflict with the default dialogue subs and move the dialogue up for those cases?
Actually I'd like it to recognize Vobsub item position. Usually you have two lines at the bottom of the screen. Often however, a few items at the beginning of a TV show are displayed at the top of the screen so they do not overlap the show's credits displayed at the bottom. The info on vertical position must be encoded either in the idx or the sub (BDSup2sub displays this info). But even if it weren't, since it's an OCR tool you should be able to determine vertical position and copy over this info to the ASS file.

Putting the "Restart OCR for this Movie" button on the main window was too dangerous.
Was it? How about adding 'are you sure?' and a checkbox to never ask this question again? In any case, I do think it's necessary so please at least add it back into the Review OCR dialogue. Thx.

Tappen
6th December 2011, 10:23
A new test build is up, with the new OCR and better Manual Entry features implemented (changed to Space bar for Italics, with blue indicators when on).
Also the OCR Review dialog has "Remove All ..." buttons to allow a full restart on a movie. I hate "Are you sure?" buttons.

http://subextractor.codeplex.com/releases/view/78152

Thunderbolt8
6th December 2011, 10:42
I don't think I adjusted it back. Check the "Adjust Word Spacing" page you can get to from the "Create Subtitle File" page. Select Italic, then select the " character. I changed the default left adjustment from "-1" to "-2". Your subtitle might need it to be "-3" or "-4" if the tails on the quotes are really long and/or stretch far to the left.the night of the hunter subs need -9 which seems to be quite a lot.
I guess as good would be to look for ". "" (or ! ? instead of the fullstop )and replace it with "."" in your subtitle editor.

Tappen
6th December 2011, 20:20
the night of the hunter subs need -9 which seems to be quite a lot.
I guess as good would be to look for ". "" (or ! ? instead of the fullstop )and replace it with "."" in your subtitle editor.

OK, I'll look into making a special rule to do this. I think the problem is that there are left double-quotes and right double-quotes in the subs but I only allow you to choose the non-left-or-right version during OCR. The right double-quotes in italics have a REALLY long tail which messes up the spacing.

It's not as easy as you'd think because different languages have different rules for quotes.

Thunderbolt8
6th December 2011, 21:27
thanks for trying ;)

Tappen
7th December 2011, 05:37
Thunderbolt8: I added a simple "don't put a space after .? or ! and before italic double-quotes" rule in the next release

Chetwood: I'll have to look into the issue with vertical position on IDX/Sub files. Must be some part of the file format I am missing.

Chetwood
7th December 2011, 07:09
Have you seen this (http://sam.zoy.org/writings/dvd/subtitles/)? VobSub layout should be similar. I'm not exactly sure how it's implemented. If the item is as large as the screen (720x576 on PAL) and all but the text is transparent, you'd need to OCR to determine position. However, if they use bitmaps only as large as two lines of text, they would have to store vertical position information somewhere. Probably in the header of each bitmap in the sub file?

Tappen
7th December 2011, 20:09
Have you seen this (http://sam.zoy.org/writings/dvd/subtitles/)? VobSub layout should be similar. I'm not exactly sure how it's implemented. If the item is as large as the screen (720x576 on PAL) and all but the text is transparent, you'd need to OCR to determine position. However, if they use bitmaps only as large as two lines of text, they would have to store vertical position information somewhere. Probably in the header of each bitmap in the sub file?

Yes that's the normal DVD subtitle format. If the bitmap is smaller than the video size and has an x,y origin other than (0,0) I definitely use it during OCR and in ASS file creation.

On the Choose Subtitle page there's a checkbox "Scale Image" that toggles between showing just the subtitle rectangle bitmap (unchecked) and placing the bitmap in its correct origin and size in the full video window (checked) (the checkbox doesn't change the final output, it's for user convenience only). I've tried a few IDX files and when the rectangle is not at the bottom middle of the screen I definitely pick it up and use the {\an4\pos(x,y)} tag to position the lines in ASS output files. So I'm not sure what the problem is you're seeing.

Tappen
8th December 2011, 05:00
I've put up what I think is the final test before release of the new Manual Entry and Fuzzy logic OCR features.

http://subextractor.codeplex.com/releases/view/78306

Currently you can toggle between Italics and Normal for Manual Entry using the Space bar. When using the mouse and the character selection box getting italics still requires holding the Ctrl key down. I'm thinking I should combine these and use just the Space bar for both entry methods to avoid confusion. Opinions?

Chetwood
8th December 2011, 10:18
Yes, please use the Space bar for both entry methods.
So I'm not sure what the problem is you're seeing.
Well, I didn't get them to display properly in VLC. Now, after some more tests with your new version, vertical positioning works just fine, even without me having checked 'Exactly Position Every Line' (what does it do?). Sorry.

BTW, marking the characters window with blue stripes as an italics indicator is an improvement over the old |||-method but looks a little tame to me. Why not go all out and make the whole window framed 2 pixels wide? Also, how about adding a color selection dialogue (not necessarily as complex as in Subtitle Creator) for the ASS format? Green subs look kinda tacky ;)

Do you plan on adding (customizable) shortcuts and batch functionality for future releases? Ripping entire TV shows every unecessary click saved would speed up the process considerably.

Tappen
8th December 2011, 19:01
I'm thinking of making the Space bar toggle both entry methods, but still allow the Ctrl key to temporarily reverse the character selection box. That way anyone used to using the mouse won't have to re-learn their workflow.

Normally I look for text that's centered in the bottom third of the video window and don't use positioning tags on it in ASS. That means it's displayed in the standard position for dialogue text by the ASS renderer in your video player rather than where the DVD author put it. 'Exactly Position Every Line' puts a positioning tag on every entry so it exactly matches the DVD. Usually this puts dialogue text too high on the screen because DVDs have to compensate for TV overscan. This option is normally useful if the subtitles are SDH (for the deaf and hard of hearing) where the dialogue is positioned all over the screen based on who is speaking. It can be very distracting if some of that text is moved to the lower, default position because the speaker happens to be in the middle of the screen.

I'll add blue horizontal bars to the top and bottom edges of the window when you're in italics as well. Should be clear enough.

The color of the subs is by default whatever the DVD used. Typically white or yellow but occasionally yes, a horrible green. On the "Subtitles" page in Options you can over-ride that. Choose "Use Custom Color" to set the color of all text, or "DVD Colors Except Default Text" will over-ride just the color of the text that is in the lower center part of the screen (any text that has a positioning tag will still use the DVD colors). This third option can give you a nice, consistent look for your dialogue but allow translated signs and text at the top of the screen to use the DVD colors. The only issue with this is that sometimes the DVD uses 2 different colors to indicate 2 different speakers. I add hyphens to the dialogue to indicate the speaker change if you've overridden the default colors.

Tappen
8th December 2011, 21:10
Do you plan on adding (customizable) shortcuts and batch functionality for future releases? Ripping entire TV shows every unecessary click saved would speed up the process considerably.

It is possible to select multiple idx or sup files from a directory. The button on the last page labelled "OCR Next Encoded Title" will then be available to quickly go through the list. Also, if you name the idx files consistently for a batch ("TV Show Season 1 Disc 1.idx", "TV Show Season 2 Disc 3.idx") the program won't ask for iolI,.' characters to be re-matched.

Tappen
10th December 2011, 20:05
For the next stage in the project I'm thinking of doing a documentation pass. Some better tooltips over various options and controls, also a couple tutorials: 1 for DVDs and 1 for Sup/Idx files. And a proper Installer rather than just a zip file.

I still want to allow multiple character sets/databases and support full localization/internationalization of the program but I feel those should be done after the more obviously missing English pieces are complete.

Chetwood
11th December 2011, 11:27
I really appreciate they way you're going with this. Adding these features will make SubExtractor more useful and like I said before, I'd be happy to do the German localization of the GUI. However, I think first and foremost the OCR capabilities should take priority. There are still some things that need ironing out:


Since you don't support mutitple streams in a VobSub yet, I'm stripping them down to one with VobSubstrip before opening them in SubExtractor. Still, they do stem from the same sub and thus have size, colors, font, positioning in common. So how come that when I enter all chars in the German stream, some of those have to be entered again when OCRing the English stream? It's particularly strange that these are usually simple non-italic chars like , - O . ' o l


The leading dash that marks two people speaking in one item sometimes translates to dash with a space behind (or a dash that has a space behind translates to dash only):

- Du brauchst Geld für diesen Monat?
- Ja.

is displayed in BDSUP2SUB exactly like this (dash + space) but is OCR'ed as

- Du brauchst Geld für diesen Monat?
-Ja.

It mostly happens with German subs but I can't tell from looking at the bitmap sub what causes this (as usual I can upload demo files, if you like).



The current beta does not seem to remember the last dir opened.


Please add an option to match the SRT/ASS filename to the original, dropping additional info like 'T1 Deutsch (German) Wide'.


What exactly is the meaning of '1080p Font Adjustments (2x Options Values)'?


Also, if you name the idx files consistently for a batch ("TV Show Season 1 Disc 1.idx", "TV Show Season 2 Disc 3.idx") the program won't ask for iolI,.' characters to be re-matched.
Are you referring to the OCR match process or the spellchecking as the last step before saving? Cause this dialogue pops up even if the files are named consistently.

BTW, is it possible to use another font in this spellchecking window? Cause with the current monospaced font the capital I and the capital l look almost identical and are hard to differentiate between with words like INITIATIVE which get 16 option listed (please check attached image). Thx.


http://img819.imageshack.us/img819/6742/spellcheck.th.png (http://imageshack.us/photo/my-images/819/spellcheck.png/)

Thunderbolt8
11th December 2011, 13:51
missing out the space after a dash happens quite often during OCR, but can be corrected via subtitleworkshop easily (though downside is that sometimes a space is added then in between a sentence with a dash, e.g. I-I din't know. -> I- I didn't know.)

Tappen
11th December 2011, 18:32
Chetwood:

I support multiple language streams in a VobSub idx. I just tried it and it works fine. If I take off the extra text (e.g. T1 English Wide) in the filename then each stream will over-write the previous if you try to do more than 1. You can just use "Save As" now and change the filename yourself.

Spacing before/after hyphens is more error-prone than other characters. I don't do anything special with it currently but perhaps I should. The easiest would be to make an option that lets you choose whether all hyphens at the start of a line have 0 or 1 spaces after them (it might be wrong but at least it's consistent). UPDATE: see my next forum post

The current release (1.0.2.3) remembers both the "Choose Subtitles" directory and "Save As" directory correctly.

1080p font adjustments means I double the font size, vertical and horizontal margins specified in Options when outputting an ASS file.
For example:
Style: Dialogue1,Tahoma,32,&H0000FFDB,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,1.4,1.7,2,80,80,15,1
becomes:
Style: Dialogue1,Tahoma,64,&H0000FFDB,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,1.4,1.7,2,160,160,30,1

If you name files similarly (i.e. same except for numbers and symbols) the error-prone OCR characters (iIl1oO etc.) won't stop the OCR process as they do for differently named files. I saw many cases of these characters being mistaken during OCR before I put this in. Multiple subtitle streams in a single bin or idx file never ask for you to repeat the error-prone characters. Also the spell-check step is completely unrelated to this and remembers its word database for all movies (except for choosing between AI and Al)

If you can find a font that distinguishes between l and I better than the one I use in the "l & I Spellcheck" step let me know. I chose it because it was clearer than the Arial, Tahoma or Microsoft Sans Serif I use elsewhere.

Tappen
12th December 2011, 02:04
I could force all lines that begin with - (hyphen) to never have a space after the hyphen. It seems the safest way to achieve consistency. Em dash (long hyphen) would be treated similarly. I would italicize the hyphen to match the next character.

I could also force a space after a hyphen at the start of a line if that's more to people's liking. Probably best to force no space after an Em dash even so since it's such a wide character.

I'd prefer not to add another hard-to-explain option to the program and just make a decision (always no space or always 1 space) that will make the output consistent. Personally I'd prefer forcing 1 space even if the OCR doesn't find one because I think that looks better.

Opinions?

Chetwood
12th December 2011, 07:57
missing out the space after a dash happens quite often during OCR
Weird. Given the nice and simple shape of the dash, I'd never expected it to cause any trouble. I use Ultraedit with regular expressions to replace it only at the beginning of a sentence.

I'd prefer not to add another hard-to-explain option to the program and just make a decision (always no space or always 1 space) that will make the output consistent.
Well, I'm more of an 'lot of options' guy. Granted, I do not have to code this stuff. Still, adding an option in the settings menu (not on the ripping page so you need to make a conscious decision to fiddle with it) with a reasonable default setting would be cool:

When OCRing lines beginning with hyphens:
(°) try to recognize spaces behind and add them accordingly
( ) always add space behind hyphen
( ) always strip space behind hyphen

I support multiple language streams in a VobSub idx. I just tried it and it works fine.
I just tried it with 1.0.2.3 and several subs I ripped with VSrip, all of which result in an error message: 'subtitle file out of date or corrupted'. It happens to any VobSub > 1 stream.

If I take off the extra text (e.g. T1 English Wide) in the filename then each stream will over-write the previous if you try to do more than 1. You can just use "Save As" now and change the filename yourself.
Well, it's your tool but I still think it would be best for you to adhere to best practices where the default is the other way round: you save under the same name and if you want to name it differently, you use "Save As". Also, when a file is to be overwritten, you get a window asking you about it where the default is 'no' rather than having the save dialogue default to overwrite.

You know, I'm always considering batch ripping and I don't know how much effort it would be to change functionality later on. I'd also like to be able to write to both ASS and SRT cause some TV's media players only render SRTs.

1080p font adjustments means I double the font size, vertical and horizontal margins specified in Options when outputting an ASS file.
That was my guess though I'm not sure it might be better to leave any adjustmenst to the player. Standalones these days often come with options to alter size and color for playback. I'm not familiar with the ASS format but when (un)checking this setting I got:

unchecked
Style: Dialogue1,Tahoma,32,&H00E6FFE6,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,1.4,1.7,2,80,80,15,1
Style: Dialogue2,Tahoma,62,&H00E6FFE6,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,1.4,1.7,2,80,80,15,1

checked
Style: Dialogue1,Tahoma,64,&H00E6FFE6,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,2.8,3.4,2,160,160,30,1
Style: Dialogue2,Tahoma,62,&H00E6FFE6,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,2.8,3.4,2,160,160,30,1

Does this setting apply only to Dialogue1?

If you name files similarly (i.e. same except for numbers and symbols) the error-prone OCR characters (iIl1oO etc.) won't stop the OCR process as they do for differently named files. I saw many cases of these characters being mistaken during OCR before I put this in.
Mmh, so does this result in a higher chance of missing error-prone characters? Or are they assumed correct as they were verified on the first sub of the batch?

If you can find a font that distinguishes between l and I better than the one I use in the "l & I Spellcheck" step let me know.
Probably some font with serifes? Gonna try to find one.

Thunderbolt8
12th December 2011, 19:30
I could force all lines that begin with - (hyphen) to never have a space after the hyphen. It seems the safest way to achieve consistency. Em dash (long hyphen) would be treated similarly. I would italicize the hyphen to match the next character.you'd have to distinguish this from lines beginning with 2 hypens -- though, as occurs after a scene switch and someone is talking or the subs constantly switching back and forth between a conversation/one person talking and background speech of e.g. a TV:

00:16:34: When did he say he wants to come over?

00:16:36: --wheather will be hot and cloudly
with some sunny spells in the morning and afternoon

(^^maybe also be italicised)

(dunno if that still happens, but theres also the case in which a line ends up with a double hyphen -- after SHD removal. at least this used to happen in former versions).

Tappen
12th December 2011, 19:48
Thunderbolt: yes, I'd have to watch for this case and apply the rule after 2 hyphens the same as after 1.

Some of the problem isn't in the OCR, since there are cases where I add hyphens but don't currently check the spacing on other lines in the sub to try to match it.

In any case, do you have a preference? There's no reason for us to keep the subtitle author's choice in this case; it's purely a style decision, not content.