SubExtractor - New Sub Ocr App [Archive] - Page 2

Thunderbolt8

10th October 2011, 11:05

is it possible to adjust positions in case SHD stuff gets removed, but only for those subs which are 3 lines & have position all over the screen in .ass format? positioning looks a bit strange here after SHD removal as it seems

http://thumbnails37.imagebam.com/15334/0ce862153330427.jpg (http://www.imagebam.com/image/0ce862153330427)

Original:

OFFICER 3: Where'd you get
the beauty scar,
tough guy? Eating pussy?

so not only the first line would need to be fixed in positioning, also the 2nd line is a little bit off for some reason.

but as said, that only only apply to these kind of subs for which I dont have another choice than to keep them this way instead of converting to 2 centred lines .srt/.ass (unless I want to spent hours on that). not sure if thats possible

Tappen

10th October 2011, 16:42

I've been thinking that I need to do a little better with left-aligned blocks of text. Currently all positioned text in ASS format is centered (at the same point as the original is centered) so depending on what your font looks like compared to the font originally used to generate the SUP bitmaps the left edge will get ragged. Removing some SDH text from a line can make this a lot worse of course since there's less text centered on the same spot.

So a fix would be to identify left aligned blocks of text and create the ASS tags accordingly. Should be possible to implement pretty soon.

Thunderbolt8

10th October 2011, 17:21

it looks the same with 1080p adjustment unticked btw.

aMvEL

10th October 2011, 17:50

Very nice OCR-tool, Tappen ... It produces the best results of all the applications I've tested (For my use anyway..)

However, are there any way to batch-process idx/sub-ripping? As it is with my collection of TV-series, I use to rip all subtitles in vobsub-format for all episodes when I rip the episodes.
So when I want to convert them to .srt, it becomes a time-consuming process, especially when I need to open all single idx/sub-files and re-identify obscure characters, like I,l o and punctuations. I assume that is because when I choose a new idx-file it gets processed like a new movie.

It would be nice if I could open all idx/sub-files for a season of a series, and process them in the same ocr-process.

Thunderbolt8

10th October 2011, 19:07

^^that would be a good idea in case of series or different parts of movies which all use the same subs. for all 6 star wars parts, I had to identify more than 200 characters for the letter 'o' -.-

Tappen

10th October 2011, 20:31

aMvEL and Thunderbolt8:

If the input file name is similar it is treated as the same file and the problem characters (o, l, I, etc.) are not OCR'd again.
By similar I mean take the filename without extension, remove all symbols, numbers, hyphens and underscores, then make what's left lower-case.
For example "TV show Season 1 disc 2 epi 3" will match "TV Show Season 3 Disc 5 Epi 17"
So if you are reasonably consistent in file naming you won't have to re-OCR the problem characters.
You can also temporarily rename the files before inputting them (Starwars 1, Starwars 2, etc.) if you know it's going to be a pain and they use the same subtitle authoring fonts. Definitely easier than clicking 'o' 200 times.

I'll look into multi-selection of Sup/IDX files so you get a list. There's already a button on the "Create Subtitle File" page to "OCR Next Encoded Title" but it's only hooked up for multiple tracks on a single DVD that you've run through my parser to create my custom "bin" file format. Should be a pretty easy and useful feature to add.

The "1080p Adjustments" just doubles the size, border, shadow and margins of the font chosen for ASS file creation. Nothing else.

Tappen

10th October 2011, 20:43

aMvEL:

I support IDX/SUB format but I never use it myself for DVDs. When I do I get the Subs out of sync with video about 25% of the time.

The problem is the way DVDs are written. Often there are discontinuous sections (chapters, cells) of video and the gaps have to be estimated by whatever program is extracting the subtitles to try to match whatever a program like Handbrake does to combine the chapters or cells into a single stream.

On the other hand, if you run SubExtractor on a DVD from the beginning you'll get mpg/d2v/ac3 files which are ready for re-encoding in a program like MeGUI or Handbrake, but also an srt/ass file which syncs to the mpg file exactly because the same program is doing the tricky appending of cells for both video and subtitles. Getting this perfect was my main motivation for creating this program: I wanted to re-encode my 1000+ DVD collection and not have to hand-adjust every subtitle track.

Tappen

11th October 2011, 02:34

Version 1015 now left-aligns instead of center-aligning positioned text in ASS files, and allows multi-select on the Choose Subtitles step to allow batch processing.

Give it a try

aMvEL

11th October 2011, 06:41

The multi-selection of idx-files is working good, however I still have to re-ocr l/I/o and punctuation... which would imply that there still is a need for similar filenames?
I should think that when you choose "OCR next encoded title" you would automatically re-use all ocr from this subtitle/subtitle-selection?

Either way It's an improvement :)

Tappen

11th October 2011, 16:50

I'm worried that many people would use multi-selection just to load up a directory full of unrelated movies they want to OCR. The mess that happens when those problem characters are confused is almost impossible to sort out manually and you end up clearing the OCR database of a large number of matches for a bunch of movies. I could put up a dialog box asking if you want to treat all the files as sharing the same subtitle style and OCR dataset, but that's a question most people wouldn't understand.

Anyway, I think the reliability of the OCR is the best feature of SubExtractor, so I don't want to compromise it. I'll try to think of another way to solve this: one idea is to leave things as they stand but if I find, during OCR, that someone has matched 3 problem characters (alphabetic, not punctuation) in common with another movie I automatically pre-populate the OCR database with the rest of the matches from the other movie. That should cut down on the number of clicks needed per file by more than half.

aMvEL

11th October 2011, 18:41

Yes, I see how that could become a problem ....

You could add it as an option selectable for advanced users though, or something to that effect... But as you say, you shouldn't compromise the reliability of the OCR-process.

EDIT:
I seem to have mistaken a upper-case 'Z' for a lower-case 'z' somewhere during an OCR. How can I remove it from the dictionary, without deleting everything?

Tappen

15th October 2011, 02:11

Open one of the files that resulted in the problem and let the OCR run to completion normally. Then hit "Review and Correct OCR Matches" button in bottom right. Open the "OCR Training" drop-down list and select the Z that's the problem and hit "Remove a Training". If you're not sure which one it is delete all the Zs in the list one at a time. When you hit "Done" the OCR will restart and you can choose the correct matches this time around. This will fix it for all future movies (or ones you re-run) as well.

Thunderbolt8

15th October 2011, 14:17

in case of subtitles in which SHD stuff goes over two lines, those lines remain. this is not a problem, but when trying to search for such lines with the brackets symbols () and "exact position every line" is also ticked, then it will find those symbols in every line, because its part of the positioning.

so is there maybe another way to look for those () symbols in this situation? because having to look through 4000 lines manually could take quite a bit of time.

Thunderbolt8

15th October 2011, 16:34

sometimes, theres a strange mixup with I and l right in the middle of a sub. I and l are recognized fine, but up from a sudden point on, I is mistaken for l in many words (and also the other way round). deleting and rechecking all I and l orc'ed chars, theres no mistake to be noticed. Im wondering why this is. do I and l share the same character from what point on? why not before? or does is the same character being used for I and l? and why then the decision to do for I and not l?

Tappen

15th October 2011, 19:22

Thunderbolt8 Question 1: I haven't seen SDH stuff that goes over 2 lines. It might be possible to remove this text in SubExtractor by looking for an unmatched '(' or '[' on 1 line and an unmatched ')' or ']' on the next line but I'd need a sample to test with. I can't change the ASS tag syntax to help you with this though, \pos(x,y) is how it has to be.

Thunderbolt8 Question 2: I just assume I and l are the same character during OCR (because the problem is so common) and sort them out in the spell-checking step if there's any doubt. By doubt I mean unless it's in the middle or end of a word and there are other reliably lower-case characters on either side, in which case I can safely assume it's an 'l' and not an 'I'. If you have a bunch of words with l and I mixed up in the final output I'd guess there are some incorrect words in the "l and I Spelling" word list (3rd tab in Options). Can you find these words, remove them in Options, and re-run the OCR for the messed up subs? If there's still a problem please put a sample on a file sharing site and send me a link so I can find and fix the bug.

Thunderbolt8

15th October 2011, 19:54

found out that problem 1 is actually easy to solve, simply by searching for }( instead of just (

I have never used the spell checking option in this tool, does it work on a case to case basis or also like the ORCing process, that some words can be saved & wont turn up again next time?

Tappen

15th October 2011, 22:32

The spell checking is only there to fix the l vs. I problem. It's not at all a full spell-check with a real dictionary, so it tends to go very fast. You should be using it as part of every OCR, then do a real spell check with another program afterwards if you want. Basically all it does is ask which is the right spelling for words that have l or I in them where the choice isn't obvious. Typically that's just a word or 2 per movie since I've already entered over 1000 words in the list that the database starts with.

Thunderbolt8

16th October 2011, 02:02

but those spell checking changes I make at this stage for I & l get added to the database, yes? otherwise, I could do it in aegisub just as well.

Tappen

16th October 2011, 02:08

Yes, your choices get added to the database so it won't ask about the same word twice. Unlike Aegisub the l & I stage will auto-correct the words without stopping if they're in the database.

Also the "l & I" tab in Options is there if you make a mistake and need to correct it.

I also use Aegisub to spell-check afterwards. That's the reason I put the "Edit Subtitle File" button on the Create Subtitle page: if I find there's a consistent spacing error around one or more the characters I can close the Aegisub window, go to the "Advanced Word Spacing" page to tweak things, hit "Previous", re-Create the sub file and open Aegisub again for another try in just a couple of clicks.

I should mention there's 1 exception: the words Al and AI are both quite common so the choice you pick only applies for a single movie and doesn't go into the database. Just hope you don't get a movie about artificial intelligence that also has a guy named "Al" in it, haha.

Thunderbolt8

16th October 2011, 13:08

some examples doesnt seem to get picked up by the spellchecker though e.g. 'l'm --> I'm with ' at the beginning as signal of speech etc. and mispelled with l instead of I. Or 'lllogical or 'l've

got a subtitle example here which is

-C'mon!
- (Loudspeakers) 'Martinez.'

which gets changed to

- C'mon!
'Martinez.'

with SHD removal, missing out the '-' at the beginning of the 2nd line.

and a funny one:

- We're just gonna wheel right by 'em (!)
- We gonna try brother.

with SHD removal, the (!) get removed :D

Tappen

16th October 2011, 17:41

Spellchecker issues:

I should really drop the ' and spell-check the rest of the word normally. Fixed in 1016.

SDH issues:

The first two are already fixed - I found them in my own testing. Fixed in 1016.

The 3rd issue is what I'd expect. Actually I think in this case the (!) really is SDH text. It's signalling an emotional tone of voice that someone deaf or hard-of-hearing wouldn't catch. Maybe I should replace a (!) or (?) with a . (period) if there are other lines that end with a period?

Thunderbolt8

16th October 2011, 18:13

guess that would be ok

btw. what happens if a character is present twice, in your built in character map and also in that one stored in the user dir? because your file gets updated as well with every new version, so there is no internal comparison of those databases in this situation. which one is then being used in case a character is present in both databases?

Tappen

16th October 2011, 18:24

I only use my database if you don't have one already. There is no merging, just a file copy if I find you have no starting database at all, so you only get my database on a first install or if you deleted your database by hand.

1 thing - I messed up the initial 1016 release - forgot something. So re-download if you got it between 5 and 15 minutes before the time of this message. The real 1016 is attached to changeset 10586.

loekverhees

18th October 2011, 21:02

This is by far the best subtitle OCR program I've ever used! Thanks a lot Tappen! Though I found one thing that was quite annoying: I used the 'Manually Enter Character' feature because typing is much faster than clicking the right character. But every time I typed the correct character, I had to grab my mouse and click the 'Normal' button. This is very annoying. It would be much, much easier and faster if one was able to simply hit the Return key instead of clicking the 'Normal' button all the time. Normally, there are far less italic styled subtitles, so this shouldn't be a big problem (if one wants to enter italic characters, one just needs to use the mouse again to click the 'Italic' button). Maybe even better would be a check box that one can select if the characters are italic (because most of the time, there are multiple italic characters in a row and if the characters get normal again, one simply needs to uncheck that check box again).

Thunderbolt8

18th October 2011, 21:45

clicking the button or return key should be the same. if you have your right hand on your mouse which is hovering above the normal button, you only need to click

left hand for the corresponding letter, right hand does only do the clicks. same as pressing enter.

Tappen

19th October 2011, 15:12

loekverhees: When the keyboard focus is in the "Manual Enter" textbox I should be able to make whichever button was last chosen the "Windows Default" button, either Normal or Italic. That means it will have a thicker black border around it and be pressed automatically by Windows whenever the Enter key is hit without changing the focus. It's a good suggestion and I will look into it.

Thunderbolt8 is also right though: I can type characters with my left hand and keep my right hand on the mouse pretty fast. Personally, since I've used it so much, I have the character selector layout memorized and can play "whack-a-letter" very fast now, much faster than manual entry. But regardless, I'll look into adding that feature.

Thunderbolt8

19th October 2011, 23:03

what is the scale image option actually good for?

Tappen

20th October 2011, 02:21

All the subs in DVDs and Blurays are bitmaps in various forms, and they have different sizes. Sometimes a subtitle bitmap fills the whole video frame, with most of the pixels transparent, and sometimes it's just a small rectangle in one part of the video. Depends on the authoring tool the disc creator used.

With "Scale ..." unchecked, you see just the bitmap, scaled up or down to fill the orange window that takes up most of that step. The subs can look really distorted if their bounding rectangle is very small or has a strange aspect ratio.

When you check "Scale..." I show the subtitle as it would appear on the final video, at the correct x and y coordinates, and scaled as if the orange window was the full video size.

I prefer it checked. If there's something I think went wrong I like to compare what I see on the video player with what I see in SubExtractor in an apples to apples comparison. But it's up to you. The checked or unchecked state is remembered like an option, but doesn't change the OCR or final output steps in the slightest.

Tappen

20th October 2011, 04:41

loekverhees: I changed manual entry to work better with the Enter key in 1017. Give it a try and let me know what you think.

loekverhees

22nd October 2011, 13:28

I have tried the 1.0.1.8 version and it works perfectly! Now I can OCR the subtitles really quick (as I'm used to type with 10 fingers). Thank you!

sl1pkn07

22nd October 2011, 18:40

is possible make version for linux?

greetings

Tappen

22nd October 2011, 19:53

I could try compiling it in Mono (the Linux C# development environment). Or someone who has experience in Mono would probably be able to do it pretty quickly. Anyone interested in helping send me a note.

Since Windows 7 came out I've sort of lost interest in Linux on the desktop and don't have a machine or even a virtual image of Ubuntu to build and test with.

Thunderbolt8

23rd October 2011, 13:04

another small thing related to SHD removal:

(GUNSHOT)
LESTER: Lotte, no!

gets changed to

-Lotte, no!

while it should be just

Lotte, no!

because the first SHD line contains only SHD as sound indication, but no actual person speaking. so would it be able to add this kind of recognition as well or would this then break something with the other kind of SHD removal and '-' adding if someone indeed speaks, indicated by (), [] or : ?

and another thing:

Oh!
LOTTE: Oh, God.

gets changed to

Oh!
Oh, God.

while it should be

Oh!
- Oh, God.

I can see why the '-' is not added, because theres no such indication for the first line being another speaker. so the question is whether it would be able to add this case as well without breaking anything or if that is the case, maybe it could be considered just to add another '-' to the first line as well?

sample: http://www.mediafire.com/?zeda517adw3uz2d

nautilus7

23rd October 2011, 13:17

while it should be

Oh!
- Oh, God.

Is this correct? I believe it should be:

- Oh!
- Oh, God.

to indicate 2 different persons are speaking.

Thunderbolt8

23rd October 2011, 13:23

the above is indeed correct (at least according to the SUP file), because the first person spoke some more lines before this change in dialogue. so by watching that scene you'd be able to tell that the first line was another speaker, but maybe not from seeing that picture only. some subs do indeed have this kind of presentation. but if its not possible to differentiate between this and other cases of SHD removal and '-' adding (as said above, I edited my post), then maybe changing it automatically to the other case could be a solution.

nautilus7

23rd October 2011, 13:52

Yes, I've seen this case your refer to, but I still think that the correct way to present 2 different speakers is to put a "-" for each one anyway.

Tappen

23rd October 2011, 14:26

Certainly if a line is entirely removed it shouldn't be counted in considering whether there's more than 1 speaker. I'll fix that.

The 2nd case is more interesting. Does a line without any SDH text count as a different speaker is there's another line anywhere on the subtitle which does have some SDH and wasn't completely removed? I suspect it does.

I'll change the code and run some of the test cases and see what the results are before I check in a change.

Thunderbolt8

23rd October 2011, 15:12

Yes, I've seen this case your refer to, but I still think that the correct way to present 2 different speakers is to put a "-" for each one anyway.
in general I wouldnt mind, but I dont agree on all cases

e.g. when the first line is spoken by an outside narrator and maybe also presented in italics then imho the look of it wouldnt be as fitting as without (cant seem to center the 2nd line here though)

...and the princess ran home as fast as she could
- Mother!

imho this fits better than

- ...and the princess ran home as fast as she could
- Mother!

because in this situation the narrator in not a person inside the story and the '-' expresses a more immediate presence to me.

nautilus7

23rd October 2011, 18:44

Maybe you're right. I won't insist. It's a minor issue anyway.

Thunderbolt8

23rd October 2011, 19:25

I think the most important thing is to keep things working. so if it can be implemented like this, fine. but if not, then I also wont mind to have it changed as suggested.

Thunderbolt8

23rd October 2011, 20:50

moar: http://www.mediafire.com/?ugbgtacinxlx3b8

- ♪ Is mighty chilly♪
- [ Whimpering Continues ]

gets changed to

- ♪ Is mighty chilly♪

while it should be

♪ Is mighty chilly♪

;)

edit: might be the same case as in #85 though -.-

Tappen

23rd October 2011, 23:21

- ♪ Is mighty chilly♪
- [ Whimpering Continues ]

gets changed to

- ♪ Is mighty chilly♪

while it should be

♪ Is mighty chilly♪

This just seems like the subtitle creators are trying to make my life hard. If there's a - in the original text I don't think I can reliably remove it without causing more problems than I solve.

Thunderbolt8

23rd October 2011, 23:30

have you stored all those recent samples I uploaded?

when you have a sample for each different case we had so far and note down the line in which the typical feature of each sample occurs, then it should be rather easy to test all your samples after each change and you can see if anything breaks.

if there is anything you cannot implement without breaking other stuff, maybe its a good idea to collect all those cases with examples in a seperate post so that a user always knows what he has to look out for by himself if he reliably wants to get rid of SHD.

Tappen

24th October 2011, 07:27

I do have all the test cases. It's sort of amazing how you guys keep finding and documenting some pretty rare bugs in the SDH removal. When I'm having a particularly clear-headed day this week I'll try to improve the multiple speaker code. I also need to consider how much of the code is common between SRT and ASS creation, since positioning on the screen can indicate multiple speakers in ASS format.

Honestly I'm probably more interested in adding 2 new features at this point: 1. multiple character sets (Greek, Cyrillic, User Custom 1 and 2 is my first thought) along with multiple OCR databases and 2. localization (easy UI translation to other languages) to the program.

nautilus7

24th October 2011, 11:40

Looking forward for these. You can count me in for Greek translation.

aMvEL

24th October 2011, 16:32

I'm trying to OCR a vobsub, but I get an error: "No Subtitle Found".
It is working when using SubRip.

Example: http://www.mediafire.com/?bl3brwq58umv5be

Tappen

25th October 2011, 01:29

aMvEL: your idx file had an extra piece of data in every subtitle timestamp line that I wasn't parsing correctly (because I'd never seen it before). 1019 fixes the problem.

Chetwood

25th October 2011, 06:31

Yes, I've seen this case your refer to, but I still think that the correct way to present 2 different speakers is to put a "-" for each one anyway.
Apparently the professional sub studios see it differently. I've seen various official DVD subs lately that do make this distinction only on line 2 when the new person starts to talk. There however seems to be no consensus about whether a blank should follow the dash or not. Some subs have them, some don't.