Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles

Reply
 
Thread Tools Search this Thread Display Modes
Old 11th October 2011, 06:41   #61  |  Link
aMvEL
Registered User
 
Join Date: Dec 2008
Posts: 30
The multi-selection of idx-files is working good, however I still have to re-ocr l/I/o and punctuation... which would imply that there still is a need for similar filenames?
I should think that when you choose "OCR next encoded title" you would automatically re-use all ocr from this subtitle/subtitle-selection?

Either way It's an improvement
aMvEL is offline   Reply With Quote
Old 11th October 2011, 16:50   #62  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
I'm worried that many people would use multi-selection just to load up a directory full of unrelated movies they want to OCR. The mess that happens when those problem characters are confused is almost impossible to sort out manually and you end up clearing the OCR database of a large number of matches for a bunch of movies. I could put up a dialog box asking if you want to treat all the files as sharing the same subtitle style and OCR dataset, but that's a question most people wouldn't understand.

Anyway, I think the reliability of the OCR is the best feature of SubExtractor, so I don't want to compromise it. I'll try to think of another way to solve this: one idea is to leave things as they stand but if I find, during OCR, that someone has matched 3 problem characters (alphabetic, not punctuation) in common with another movie I automatically pre-populate the OCR database with the rest of the matches from the other movie. That should cut down on the number of clicks needed per file by more than half.

Last edited by Tappen; 11th October 2011 at 16:57.
Tappen is offline   Reply With Quote
Old 11th October 2011, 18:41   #63  |  Link
aMvEL
Registered User
 
Join Date: Dec 2008
Posts: 30
Yes, I see how that could become a problem ....

You could add it as an option selectable for advanced users though, or something to that effect... But as you say, you shouldn't compromise the reliability of the OCR-process.

EDIT:
I seem to have mistaken a upper-case 'Z' for a lower-case 'z' somewhere during an OCR. How can I remove it from the dictionary, without deleting everything?

Last edited by aMvEL; 14th October 2011 at 23:47. Reason: added question
aMvEL is offline   Reply With Quote
Old 15th October 2011, 02:11   #64  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
Open one of the files that resulted in the problem and let the OCR run to completion normally. Then hit "Review and Correct OCR Matches" button in bottom right. Open the "OCR Training" drop-down list and select the Z that's the problem and hit "Remove a Training". If you're not sure which one it is delete all the Zs in the list one at a time. When you hit "Done" the OCR will restart and you can choose the correct matches this time around. This will fix it for all future movies (or ones you re-run) as well.
Tappen is offline   Reply With Quote
Old 15th October 2011, 14:17   #65  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,144
in case of subtitles in which SHD stuff goes over two lines, those lines remain. this is not a problem, but when trying to search for such lines with the brackets symbols () and "exact position every line" is also ticked, then it will find those symbols in every line, because its part of the positioning.

so is there maybe another way to look for those () symbols in this situation? because having to look through 4000 lines manually could take quite a bit of time.
__________________
Laptop Acer Aspire V3-772g: i7-4202MQ, 8GB Ram, NVIDIA GTX 760M (+ Intel HD 4600), Windows 8.1 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64)
Thunderbolt8 is offline   Reply With Quote
Old 15th October 2011, 16:34   #66  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,144
sometimes, theres a strange mixup with I and l right in the middle of a sub. I and l are recognized fine, but up from a sudden point on, I is mistaken for l in many words (and also the other way round). deleting and rechecking all I and l orc'ed chars, theres no mistake to be noticed. Im wondering why this is. do I and l share the same character from what point on? why not before? or does is the same character being used for I and l? and why then the decision to do for I and not l?
__________________
Laptop Acer Aspire V3-772g: i7-4202MQ, 8GB Ram, NVIDIA GTX 760M (+ Intel HD 4600), Windows 8.1 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64)
Thunderbolt8 is offline   Reply With Quote
Old 15th October 2011, 19:22   #67  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
Thunderbolt8 Question 1: I haven't seen SDH stuff that goes over 2 lines. It might be possible to remove this text in SubExtractor by looking for an unmatched '(' or '[' on 1 line and an unmatched ')' or ']' on the next line but I'd need a sample to test with. I can't change the ASS tag syntax to help you with this though, \pos(x,y) is how it has to be.

Thunderbolt8 Question 2: I just assume I and l are the same character during OCR (because the problem is so common) and sort them out in the spell-checking step if there's any doubt. By doubt I mean unless it's in the middle or end of a word and there are other reliably lower-case characters on either side, in which case I can safely assume it's an 'l' and not an 'I'. If you have a bunch of words with l and I mixed up in the final output I'd guess there are some incorrect words in the "l and I Spelling" word list (3rd tab in Options). Can you find these words, remove them in Options, and re-run the OCR for the messed up subs? If there's still a problem please put a sample on a file sharing site and send me a link so I can find and fix the bug.

Last edited by Tappen; 15th October 2011 at 19:25.
Tappen is offline   Reply With Quote
Old 15th October 2011, 19:54   #68  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,144
found out that problem 1 is actually easy to solve, simply by searching for }( instead of just (

I have never used the spell checking option in this tool, does it work on a case to case basis or also like the ORCing process, that some words can be saved & wont turn up again next time?
__________________
Laptop Acer Aspire V3-772g: i7-4202MQ, 8GB Ram, NVIDIA GTX 760M (+ Intel HD 4600), Windows 8.1 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64)
Thunderbolt8 is offline   Reply With Quote
Old 15th October 2011, 22:32   #69  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
The spell checking is only there to fix the l vs. I problem. It's not at all a full spell-check with a real dictionary, so it tends to go very fast. You should be using it as part of every OCR, then do a real spell check with another program afterwards if you want. Basically all it does is ask which is the right spelling for words that have l or I in them where the choice isn't obvious. Typically that's just a word or 2 per movie since I've already entered over 1000 words in the list that the database starts with.
Tappen is offline   Reply With Quote
Old 16th October 2011, 02:02   #70  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,144
but those spell checking changes I make at this stage for I & l get added to the database, yes? otherwise, I could do it in aegisub just as well.
__________________
Laptop Acer Aspire V3-772g: i7-4202MQ, 8GB Ram, NVIDIA GTX 760M (+ Intel HD 4600), Windows 8.1 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64)

Last edited by Thunderbolt8; 16th October 2011 at 02:04.
Thunderbolt8 is offline   Reply With Quote
Old 16th October 2011, 02:08   #71  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
Yes, your choices get added to the database so it won't ask about the same word twice. Unlike Aegisub the l & I stage will auto-correct the words without stopping if they're in the database.

Also the "l & I" tab in Options is there if you make a mistake and need to correct it.

I also use Aegisub to spell-check afterwards. That's the reason I put the "Edit Subtitle File" button on the Create Subtitle page: if I find there's a consistent spacing error around one or more the characters I can close the Aegisub window, go to the "Advanced Word Spacing" page to tweak things, hit "Previous", re-Create the sub file and open Aegisub again for another try in just a couple of clicks.

I should mention there's 1 exception: the words Al and AI are both quite common so the choice you pick only applies for a single movie and doesn't go into the database. Just hope you don't get a movie about artificial intelligence that also has a guy named "Al" in it, haha.

Last edited by Tappen; 16th October 2011 at 02:19.
Tappen is offline   Reply With Quote
Old 16th October 2011, 13:08   #72  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,144
some examples doesnt seem to get picked up by the spellchecker though e.g. 'l'm --> I'm with ' at the beginning as signal of speech etc. and mispelled with l instead of I. Or 'lllogical or 'l've



got a subtitle example here which is

-C'mon!
- (Loudspeakers) 'Martinez.'

which gets changed to

- C'mon!
'Martinez.'

with SHD removal, missing out the '-' at the beginning of the 2nd line.


and a funny one:

- We're just gonna wheel right by 'em (!)
- We gonna try brother.

with SHD removal, the (!) get removed
__________________
Laptop Acer Aspire V3-772g: i7-4202MQ, 8GB Ram, NVIDIA GTX 760M (+ Intel HD 4600), Windows 8.1 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64)

Last edited by Thunderbolt8; 16th October 2011 at 13:21.
Thunderbolt8 is offline   Reply With Quote
Old 16th October 2011, 17:41   #73  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
Spellchecker issues:

I should really drop the ' and spell-check the rest of the word normally. Fixed in 1016.

SDH issues:

The first two are already fixed - I found them in my own testing. Fixed in 1016.

The 3rd issue is what I'd expect. Actually I think in this case the (!) really is SDH text. It's signalling an emotional tone of voice that someone deaf or hard-of-hearing wouldn't catch. Maybe I should replace a (!) or (?) with a . (period) if there are other lines that end with a period?

Last edited by Tappen; 16th October 2011 at 18:10.
Tappen is offline   Reply With Quote
Old 16th October 2011, 18:13   #74  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,144
guess that would be ok

btw. what happens if a character is present twice, in your built in character map and also in that one stored in the user dir? because your file gets updated as well with every new version, so there is no internal comparison of those databases in this situation. which one is then being used in case a character is present in both databases?
__________________
Laptop Acer Aspire V3-772g: i7-4202MQ, 8GB Ram, NVIDIA GTX 760M (+ Intel HD 4600), Windows 8.1 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64)

Last edited by Thunderbolt8; 16th October 2011 at 18:17.
Thunderbolt8 is offline   Reply With Quote
Old 16th October 2011, 18:24   #75  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
I only use my database if you don't have one already. There is no merging, just a file copy if I find you have no starting database at all, so you only get my database on a first install or if you deleted your database by hand.

1 thing - I messed up the initial 1016 release - forgot something. So re-download if you got it between 5 and 15 minutes before the time of this message. The real 1016 is attached to changeset 10586.
Tappen is offline   Reply With Quote
Old 18th October 2011, 21:02   #76  |  Link
loekverhees
Registered User
 
Join Date: Sep 2005
Location: Holland
Posts: 86
This is by far the best subtitle OCR program I've ever used! Thanks a lot Tappen! Though I found one thing that was quite annoying: I used the 'Manually Enter Character' feature because typing is much faster than clicking the right character. But every time I typed the correct character, I had to grab my mouse and click the 'Normal' button. This is very annoying. It would be much, much easier and faster if one was able to simply hit the Return key instead of clicking the 'Normal' button all the time. Normally, there are far less italic styled subtitles, so this shouldn't be a big problem (if one wants to enter italic characters, one just needs to use the mouse again to click the 'Italic' button). Maybe even better would be a check box that one can select if the characters are italic (because most of the time, there are multiple italic characters in a row and if the characters get normal again, one simply needs to uncheck that check box again).
loekverhees is offline   Reply With Quote
Old 18th October 2011, 21:45   #77  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,144
clicking the button or return key should be the same. if you have your right hand on your mouse which is hovering above the normal button, you only need to click

left hand for the corresponding letter, right hand does only do the clicks. same as pressing enter.
__________________
Laptop Acer Aspire V3-772g: i7-4202MQ, 8GB Ram, NVIDIA GTX 760M (+ Intel HD 4600), Windows 8.1 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64)
Thunderbolt8 is offline   Reply With Quote
Old 19th October 2011, 15:12   #78  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
loekverhees: When the keyboard focus is in the "Manual Enter" textbox I should be able to make whichever button was last chosen the "Windows Default" button, either Normal or Italic. That means it will have a thicker black border around it and be pressed automatically by Windows whenever the Enter key is hit without changing the focus. It's a good suggestion and I will look into it.

Thunderbolt8 is also right though: I can type characters with my left hand and keep my right hand on the mouse pretty fast. Personally, since I've used it so much, I have the character selector layout memorized and can play "whack-a-letter" very fast now, much faster than manual entry. But regardless, I'll look into adding that feature.
Tappen is offline   Reply With Quote
Old 19th October 2011, 23:03   #79  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,144
what is the scale image option actually good for?
__________________
Laptop Acer Aspire V3-772g: i7-4202MQ, 8GB Ram, NVIDIA GTX 760M (+ Intel HD 4600), Windows 8.1 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64)
Thunderbolt8 is offline   Reply With Quote
Old 20th October 2011, 02:21   #80  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
All the subs in DVDs and Blurays are bitmaps in various forms, and they have different sizes. Sometimes a subtitle bitmap fills the whole video frame, with most of the pixels transparent, and sometimes it's just a small rectangle in one part of the video. Depends on the authoring tool the disc creator used.

With "Scale ..." unchecked, you see just the bitmap, scaled up or down to fill the orange window that takes up most of that step. The subs can look really distorted if their bounding rectangle is very small or has a strange aspect ratio.

When you check "Scale..." I show the subtitle as it would appear on the final video, at the correct x and y coordinates, and scaled as if the orange window was the full video size.

I prefer it checked. If there's something I think went wrong I like to compare what I see on the video player with what I see in SubExtractor in an apples to apples comparison. But it's up to you. The checked or unchecked state is remembered like an option, but doesn't change the OCR or final output steps in the slightest.

Last edited by Tappen; 20th October 2011 at 02:54.
Tappen is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 08:27.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2017, vBulletin Solutions Inc.