SubExtractor - New Sub Ocr App [Archive] - Page 6

Mug Funky

7th May 2012, 08:44

i wonder whether it'd spoil the layout formatting logic if top titles could behave in a similar way to bottom titles?

regular subs are detected as centre-aligned and have the "pos" info ditched, and this works wonderfully. however, a lot of DVDs have subtitles in the top of frame, either for intro song translation, or if action is at the extreme bottom of screen and the subs move to the top to not hide the action.

would it be possible to extend the centre-align behaviour to the top part of the screen?

am i making sense? i'm afraid English is my first language, and i'm still bloody awful at it.

thanks!

Tappen

7th May 2012, 23:20

Mug Funky. I believe that I understand your question, and can tell you that this change will be in the next release.

It was requested by other users as well that I notice when subtitles are centered on the screen and use the /an5 tag (middle-center) instead of the /an4 tag to position those lines of text.

Tappen

7th May 2012, 23:23

aMvEL, "alt gr" isn't supposed to trigger italics. I'll have to grab a Euro keyboard from somewhere and figure out how to turn that off. Ctrl is supposed to temporarily turn on italics, and space toggles the base state, but alt-gr isn't supposed to do anything. I'm guessing that I'm getting a false positive on my check if Ctrl is down in this case.

Mug Funky

8th May 2012, 01:22

excellent! i believe you've just made the closest thing to a perfect sub OCR utility that the world has ever seen :)

[edit]

one thing that might be of use is the ability to run "no good character" matches through the spell check so we can manually enter what they're supposed to be without influencing the character database. i've noticed in some cases square brackets can be spotted as Is and Ls, and wasn't able to make it behave. perhaps a checkbox to not store the entered character as a match? not sure how it should be done, but there's a few approaches one can take.

Tappen

10th May 2012, 01:42

Mug Funky: If I understand your question correctly, I believe you might need to use the "Split in 2" function to deal with characters which are connected instead of just marking them as "no good character".

loekverhees

18th May 2012, 12:01

I have subtitles with little chunks around it, see the image below. Every time I have to press 'Different Palette', for every subtitle. Can the sensitivity be changed manually, so it ignores these little pieces around the letters? I think it is some kind of shadow effect.

http://puu.sh/vndM.png

Tappen

18th May 2012, 18:08

loekverhees: I thought I had taken care of all the situations like this but I guess not. The more common problem is with shading that wraps entirely around each letter, which I do handle pretty well now. This is a weird one. The program sees 2 palettes as containing the same number of recognizable characters, but one has some extra pixels which it thinks are unknown characters that need to be identified. aMvEL's problem a few posts up is an example of how sometimes characters really are drawn in 2 colors and the program is already too insensitive for some subtitles so I can't reduce the sensitivity even more.

I'm going to try to add a learning feature to manual palette choices made by the user so that the program won't require you make the same adjustments again and again.

loekverhees

19th May 2012, 09:08

Thanks for your reply. Or just an option to force the program to use a certain palette for the entire subtitle, as aMvEL mentioned. In my case, palette 1 is the right one. However, it just automatically defaults again to palette 1,2 all the time (even if I choose palette 1 for the first subtitle).

Tappen

20th May 2012, 00:37

Palettes are a little tricky so I'd worry about adding such an option. The colors can change in the middle of a subtitle track: 1,2 might be the letters and shading in one sub, then the color of index 2 changes and it's used for some independent text on the next sub while 1,3 are used for the main letters and shading. If you had forced it to be just palette 1 you'd miss the index 2 colored text (and possibly never realize it was gone from the output file).

I suppose this is rare enough that adding an option to force the palette wouldn't hurt. If I change the "Next Palette" button to NOT automatically begin OCR'ing and add 2 more buttons: "Apply" and "Apply for Movie" underneath it this should provide the function you need. I've always worried a little about OCR automatically starting after the "Next Palette" button anyway.

loekverhees

20th May 2012, 11:32

Okay, sounds great ;-). Thanks!

cheer

26th May 2012, 05:59

Betsy25

26th May 2012, 11:48

OK, no questions or problems, just a note to say...this is a FANTASTIC app. I'd been using SubRip on my DVDs for years, and since it was good enough, I never bothered looking for another. Then today I stumbled in here and...wow. You've just made my sub-ripping life so much easier.

Thank you so much.

That's exactly what I've been thinking. So cleverly programmed, minimalistic yet so powerful. I've since never looked back at SubRip.:)

Tappen

27th May 2012, 05:39

1.0.2.7 is out with a few fixes and features:

Fix: out-of-memory exception when reading DVDs with very large (over 1GB) cells
Fix: AltGr key toggling Italics during OCR
Feature: use centered alignment SSA tag for centered text in the upper part of the frame
Feature: increased number of subtitle tracks visible in Choose Subtitles step listbox
Feature: allow change of palette for entire movie

The memory use issue is one that I hope nobody has a problem with: I recently bought some unusual DVDs that I wanted to encode with 3GB+ cells in them and was running out of memory space in 32-bit Windows and out of physical memory in 64-bit during the SubExtractor encode step. (Cells are the smallest kind of "chunk" of audio/video defined in a DVD IFO file and I'd never seen one over 1GB before so it was quite a surprise.) Anyway, now any cell over 500MB uses a temp file to hold the extra data while loading and deletes the file when saving is complete. This change might have the added benefit of not slowing down your computer quite as much during this step.

sneaker_ger

29th May 2012, 14:26

http://www.abload.de/img/split41bjc.png
How to split these?

MokrySedeS

29th May 2012, 15:57

Hit "Split in 2" (obviously) --> paint "f" with your cursor --> Save split --> program will ask you to recognize the remaining part, which you left black --> select the dot --> hit "i"

sneaker_ger

29th May 2012, 16:11

Thank you, it did indeed work. It didn't ask me to manually enter f nor i, so I assumed something went wrong. I had manually typed in a special char for this combination and just searched and replaced all occurrences with a text editor by hand later. But now that I just followed your advice, the resulting file came out bit-identical to my manually created one, so Subtitle Extractor did its work just fine after all.
Maybe it could be changed to first click on the i-dot and then click split in 2, kinda feels more intuitive. If it were two i characters (or something like "i?") you wouldn't really be able to correctly split them.

MokrySedeS

29th May 2012, 16:22

Maybe it could be changed to first click on the i-dot and then click split in 2, kinda feels more intuitive.

+1

It's even more confusing when 2 characters from first and second line are picked together, like this:

http://i49.tinypic.com/2lmqeya.png

I wasn't sure if it's gonna get recognized as comma or something... but it actually worked as it should :D

Tappen

29th May 2012, 23:56

I know splitting can be non-intuitive sometimes but I can't think of how to improve it without adding a lot of clicks.

SubExtractor does know how to auto-split many 2-character combinations but it requires manual intervention for ones that could easily be errors: for example it used to sometimes split m into r and n if the m had a pattern that hadn't been trained. Now it requires you to manually enter any splits containing the letter r (as well as i, l, I, most punctuation and all accents).

Don't forget that painting the character green can be made easier using the left-mouse click which acts as a "paint-bucket" like you see in drawing programs. Just mouse paint along the joint between the 2 sections then left-click to fill the rest. (Right-click resets all painting)

MokrySedeS

30th May 2012, 08:24

it used to sometimes split m into r and n if the m had a pattern that hadn't been trained.

Still does. Happened to me couple of days ago with this (http://www.sendspace.com/file/rmonf1) file.
Luckily for me this particular BD had 4 subtitle tracks in my language and this issue didn't occur in this one (http://www.sendspace.com/file/ac9mbq).

Tappen

30th May 2012, 20:45

I tried both those subs and didn't see the problem using version 1.0.2.7. It might be that you've trained more Polish specific characters (ń in particular) that led to the problem. I'll add something to 1.0.2.8 that hopefully will catch this.

Yes that was the problem. ń needed to be tested for as well as n & ñ. Fixed in 1028

mood

6th June 2012, 21:55

This is the best piece of software to extract sub and Ocr application.

Thanks for this fantastic software ;)

loekverhees

26th June 2012, 20:22

Feature: allow change of palette for entire movie

Great! I tested the new version and it works perfectly! Thanks ;-).

pandv2

26th July 2012, 01:34

Two suggestions, for your consideration:

- When only one part of a two parts char is recognized (as in i, ñ or á) extend the selection with the keyboard (maybe up arrow if the lower part is the recognized part, and down arrow in the inverse case). Maybe, it can be also automàtic, if the key pressed is a two parts char, add the part not selected.

- When I remove a training don't return to the begin of the list, use the next item to the deleted one. I am checking a lot of training (hundreds) and i need to search for the point each time i see an error and delete it.

Thanks.

Tappen

26th July 2012, 04:46

pandv2: Both good suggestions.

1. I thought of the first one already and am in the process of implementing it: Up arrow will add the nearest fragment whose center is above the top of the currently selected fragment, Down arrow the nearest fragment below.

2. I'll change the training removal dialog to use large Listboxes instead of Comboboxes so this isn't a problem. I've been bothered by this behavior as well.

Tappen

1st August 2012, 07:45

1.0.2.8 is out with a fix and some features:

Fix: handle padding bytes in IDX files
Feature: added "last character matched" feedback after making OCR matches
Feature: added support for Thai language accents and punctuation
Feature: added left/right/up/down key functions during OCR matching to allow selection of nearby fragments (dots, accents) without using the mouse
Feature: changed the OCR Match Review dialog to have less annoying list scrolling and selection behavior

masster64

6th August 2012, 21:45

I've trained SubExtractor v1.028 for 113 lines out of 1601, and it still asks to identify almost all letters of the alphabet.
This is getting very boring...

pandv2

7th August 2012, 01:48

Hello,

2 questions or suggestions.

Is the new version pre-trained?. Because if I delete all the trainings for this movie, the app doesn't ask for all the chars. It confuses í with i (accented i with letter i), and ¡ with i (open exclamation with letter i).

Somethimes a sub file contains a lot of variants for each letter (differing only in a few pixel on the border). I thinked about how to resolve this, and maybe reducing the letters to a pixel wide edge previously to compare, can works. Because the differences are mainly in the external border.

Actually I am resolving the confusions passing the result for a ortographic corrector.

A little more test results: If I delete all the trainings for this movie, press ok, and after, return to the training editions, the list is not empty.

Thanks.

Tappen

7th August 2012, 19:19

masster64: Sorry, there are some subtitles that just aren't consistent enough to be trained. The disc authors create the subtitle bitmaps in all sorts of different ways and some just don't OCR well (at least how SubExtractor does OCR).

pandv2: If a character was trained on another movie and matched on the current movie and then you delete all trainings the match still exists in the database (linked to the other movie). When you hit OK the match will be found again if it's near the start of the subtitles (before SubExtractor pauses for the first unmatched character) and not one of the i, o, I, l, O, etc. characters that don't share trainings between movies.

If you really want to start from scratch you need to delete both your OcrMap.bin file and the OcrMapOrig.bin file that comes in the zip package.

There is a problem with Blu-ray subtitles and errors on small characters or fragments of characters like accents. This started when I added the fuzzy logic OCR feature. I've seen it confuse periods (.) and commas (,) myself. I'm working on a fix, though it will inevitably mean there will be fewer automatic matches and more typing.

Chetwood

8th August 2012, 06:00

This is getting very boring...
Complain to the morons who were to stupid to properly author their subs. SubExtactor has the best OCR routines ever.

I'm working on a fix, though it will inevitably mean there will be fewer automatic matches and more typing.
Maybe you could add an option so people could switch between routines per subtitle? Kinda like switching between palettes.

Thunderbolt8

8th August 2012, 08:39

I've trained SubExtractor v1.028 for 113 lines out of 1601, and it still asks to identify almost all letters of the alphabet.
This is getting very boring...there are some discs which are simply a PITA e.g. the blu-ray of the tree of life gave me nightmares.

masster64

8th August 2012, 13:31

masster64: Sorry, there are some subtitles that just aren't consistent enough to be trained. The disc authors create the subtitle bitmaps in all sorts of different ways and some just don't OCR well (at least how SubExtractor does OCR).
But there is a solution to that. A more relaxed 'recognition success' algorithm. Let's say 100% is a pixel per pixel recognition consistency. Give us a slider to allow using lower rates of success and all will be easier.

pandv2

8th August 2012, 23:45

About the fuzzy logic OCR feature.

Maybe you can allow to disable it (in the beginning of the movie training) for all the chars or for a list of selected chars (user selected). So when it's trained in the diferences between (for example) i , í, ì, ¡ the user can reactivate it After this, the fuzzy logic needs to be applied after the strict logic, or a metric defined to find the nearest match.

I deleted OcrMap.bin and OcrMapOrig.bin, and retried the failing subs. It's a lot better now. The only error is now the confusion between i and ¡ (voyel i and opened exclamation), but now, it not happens all the time. Maybe a feature to search for a concrete word bad ocr'd (as: iHola!) and retrain it, can be useful.

Tappen

9th August 2012, 00:46

pandv2: The fuzzy logic is only applied to Bluray sups so if your problem is with a DVD sub then there's something different causing the problem (let me know here right away if the problem is with DVDs).

Assuming it's with a Bluray subtitle, first let me explain that the fuzzy logic is just downscaling: each character or part of a character is shrunk by 3x in each dimension (there's actually 9 ways to do this so I end up with 9 mini characters). If any one of these 9 is an exact match of a mini character from the database then the OCR match is completed just as if the full-sized characters exactly matched.

During testing this caused too many bad matches like what you're seeing on small fragments, so I set the downscaling to 2x if the initial character size is 9 or less pixels on each side (so then there's 4 mini-match possibilities = much less chance of a bad match). Unfortunately this isn't sensitive enough and I should have set it at something like 10 pixels on 1 side and 16 pixels on the other to make the matches fully reliable even though it would be less "fuzzy" and find fewer matches. You can't change this on the fly because it requires re-downscaling all the characters in the database (would take a minute or more on the fastest machine given the 1000s of characters in the typical database). This is why a slider isn't possible.

I'll try making this change and putting a test version up on Codeplex for you. If it fixes your problem I'll do a real release with the change.

In case other people are nervous about the change, the fuzzy logic only helps about 1/3 of the bluray subtitles in my experience - the other 2/3 are created with authoring tools where the characters match perfectly. So this change will mean more typing on just the 1/3 that are dodgy to begin with, and those are the subtitles that are likely to see mismatched character problems, so the trade-off seems reasonable.

pandv2

10th August 2012, 13:51

Thanks, and yes there are sup files extracted straight from a bluray.

But, now i know the reason to the problems with a sub+idx converted from a bluray. I think in this case the fuzzy logic doesn't get applied.

masster64

10th August 2012, 15:57

@Tappen
any reply to my suggestion?

Tappen

10th August 2012, 16:37

masster64: sorry but if you changed how much "fuzzy logic" is being used you'd have to re-compute the database. So it would take a minute or more for every click of the slider. I don't think anyone wants that. The only simple change is an on/off switch.

pandv2: There is a "size:" definition near the top of most idx files. If the horizontal resolution (the first number) is greater than 1400 then the fuzzy logic code is used.

Tappen

11th August 2012, 22:42

I've created a release - Tighter Fuzzy Logic 1 - on Codeplex. Please try it out and see if it fixes the recognition problems on High Def subtitles.

http://subextractor.codeplex.com/releases/view/92608

mscsmyrpln

13th August 2012, 10:22

Very impressive. I've wanted a tool like this for a long time. Thank you!

rbauer

13th August 2012, 14:37

@Tappen
If possible, could you make It portable?

Now It writes to (Win7x64-standard account):

HKEY_USERS\S-1-5-21-3668210609-2088346886-2201473197-1001\Software\Microsoft\Windows\CurrentVersion\Explorer\FileExts\.idx\
HKEY_USERS\S-1-5-21-3668210609-2088346886-2201473197-1001\Software\Microsoft\Windows\CurrentVersion\Explorer\FileExts\.idx\OpenWithList\

c:\Users\<USER>\AppData\Local\DvdSubExtractor\
c:\Users\<USER>\AppData\Local\DvdSubExtractor\DvdSubExtractor.exe_Url_anvjrhxtsierb2tr4i2lphhngd2nuvhe\
c:\Users\<USER>\AppData\Local\DvdSubExtractor\DvdSubExtractor.exe_Url_anvjrhxtsierb2tr4i2lphhngd2nuvhe\1.0.1.3\user.config

c:\Documents and Settings\<USER>\AppData\Local\DvdSubExtractor\
c:\Documents and Settings\<USER>\AppData\Local\DvdSubExtractor\DvdSubExtractor.exe_Url_anvjrhxtsierb2tr4i2lphhngd2nuvhe\
c:\Documents and Settings\<USER>\AppData\Local\DvdSubExtractor\DvdSubExtractor.exe_Url_anvjrhxtsierb2tr4i2lphhngd2nuvhe\1.0.1.3\user.config

Many thanks

Tappen

13th August 2012, 19:32

rbauer: I think you just need to go into Options and check "Use Program Exe Location" to make the OCR database portable. Most other options you're likely to change depend on the machine directories so I'm not sure it can be made fully portable. In any event I just use the default .Net Settings system and I'm not sure how to customize that.

I don't register the .idx extension - you must have done that yourself.

rbauer

13th August 2012, 20:38

rbauer: I think you just need to go into Options and check "Use Program Exe Location" to make the OCR database portable.
Unfortunately that option ("Use Program Exe Location") is for OCR Data File ("OcrMap.bin") location only.

User settings (user.config) are still write to
%appdata%\DvdSubExtractor\DvdSubExtractor.exe_Url_anvjrhxtsierb2tr4i2lphhngd2nuvhe\1.0.1.3\user.config

Thanks

Tappen

13th August 2012, 21:00

rbauer: sorry, further portability isn't going to happen. It's too much trouble to support all the possible Windows versions which have different file system access rights if I tried to write my own settings storage.

I'll stick with the standard .Net settings for now. On the bright side if you are on a Active Directory Domain these settings should persist across machines.

rbauer

14th August 2012, 11:40

pandv2

18th August 2012, 14:33

I've got time to test your Tighter Fuzzy Logic 1. I uncompressed it in another directory and executed from it.

The results are similar to the results obtained deleting the OcrMap.bin and OcrMapOrig.bin, but I needed to enter less chars in general (but more times the , char).

It continues only confusing i and ¡ (voyel and oppening exclamation). The rest it's ok.

In the mean time I detected another things:

- In the spelling phase, its not possible to select the correct option for some roman numerals (as III).

- The idx file from the bluray is from a rip to a 720p resolution. It has 1280 horizontal resolution, so the fuzzy logic doesn't get activated in this case.

Thanks.

Tappen

20th August 2012, 17:49

pandv2: I think the only way to discriminate between i and ¡ is to make these characters movie-specific. If I allow all the database matches to be searched it's just too likely that an i from some other movie matches the ¡ in the current movie. I don't think the slightly lower baseline position of the ¡ character is reliable enough to allow SubExtractor to tell the difference. This means you'll have to re-enter i and ¡ matches every movie like you currently have to for o, O, l, I, etc.

I can add this as an option for people working with languages that contain the ¡ character (and turn it on automatically if I know the subtitle is Spanish). I'll put a build up on Codeplex with this option if you want to test it, or just let me know here what you think of the idea.

pandv2

21st August 2012, 16:28

I can add this as an option for people working with languages that contain the ¡ character (and turn it on automatically if I know the subtitle is Spanish). I'll put a build up on Codeplex with this option if you want to test it, or just let me know here what you think of the idea.

For me it's ok.

Another solution, is to add to the I-L discrimination phase. The exclamation opened char, only can appears at the beginning, never in the middle of a word. And normally precedes a uppercase letter.

Tappen

21st August 2012, 18:58

Adding to I-l is a good idea, but the reason this feature works is that you eventually build up a database of words where the choice of letter is questionable (starts with l, mostly) which isn't too big. A similar database for i and ¡ discrimination would have to contain all the words which could start a sentence. Not as useful since it would have to be huge and there'd be a lot of false positives (words which are also a word when prefixed by the letter i).

I could replace all words that start with i followed by an upper-case letter with ¡ automatically during the Spellcheck step. Hard to think of any legitimate cases of this pattern. On the whole I think per-movie i and ¡ ocr is the best solution.