SubExtractor - New Sub Ocr App [Archive] - Page 8

errantkkn

19th December 2012, 08:17

It's like a hardsub muxed video, not a video contain sub stream

sneaker_ger

19th December 2012, 08:31

@errantkkn:
Hard-subbed video files are a whole different issue. Separating the subtitles from the background would be at least as difficult as all the work done so far on this project. So sorry there's no plans to start work on that.

speedoflight

20th December 2012, 22:48

I don't understand the problem, speedoflight. Could you copy a line or 2 of an ass sub with the problem highlighted for me?

I didnt explain too well anyways, here it is a screenshot, and this is wat the program does ->

http://imageshack.us/a/img341/5729/clipboard01rk.jpg

As u see, its very weird.I tried opening the subtitles with subtitle editor and aegis subs, both with the same result.

speedoflight

25th December 2012, 21:36

Now, im having a lot of trouble with "I" and "L", the program doesnt ocr right these characters on vobsub subtitles. Im getting more bugs every subtitle i try...

An IP BreAKDoWN

31st December 2012, 20:52

I'm trying to OCR some subs that have underline in them, it would be nice if SubExtractor had an underline option. If you need an example here is the sup (http://www.filefactory.com/file/5q4hotorvknl/).

Tappen

1st January 2013, 19:35

speedoflight: ASS Subtitles can have overlapping, non-sequential time-stamps. This is normal and expected.
Only SRT subs must be non-overlapping. So what you highlighted isn't a bug.

If you go through the spell-check step of the program it should disambiguate I and lower L characters for you. They are often the same pattern so can only be told apart after the OCR.

An IP Breakdown: Underlining is on the list of issues I'd like to fix (http://subextractor.codeplex.com/workitem/625). Thanks for the subtitle example. For now you can deal with it by Splitting the underline from the characters and Ignoring the underline pieces but I agree it isn't ideal.

radigast

5th January 2013, 06:30

Your program is by far the best out there. However, with this (http://www.sendspace.com/file/csq9ra) BluRay subtitle set, ! is recognized as either '. or l. I reset all characters and ensured that, when prompted, each letter was reprogrammed correctly. I never received a prompt to OCR !. I'm not sure if this is something on my end or whether it is a bug that can be fixed in the image database within the program. Suprip can recognize the ! with no problems, so proper recognition is possible. Again, love the program. I'm hoping this can be sorted out!

Thanks again!

deco20

5th January 2013, 14:45

Strange thing: I always have to type chars like "l", "o", "O" and special characters like ",", ".", "-" although I typed them before in ocr of the same font.

speedoflight

6th January 2013, 03:51

speedoflight: ASS Subtitles can have overlapping, non-sequential time-stamps. This is normal and expected.
Only SRT subs must be non-overlapping. So what you highlighted isn't a bug.

If you go through the spell-check step of the program it should disambiguate I and lower L characters for you. They are often the same pattern so can only be told apart after the OCR.

I know about ass subtitles. But the thing is , those lines should not be overlapping each other. They should be regular lines, one after another. If i check the original subtitle watching the blu ray, there is no overlapping in em. So i dont understand y the program makes that. And wat the hell, the screen i posted is the text section, not the coding of the ass. U cant see anything related to those overlapped lines in the coding of the ass. Its just the program that writes that for some reason.

Besides, i tried the ocr with, for example the ocr tool of subtitle edit, and it gives me a good ass subtitle without those overlapped lines, well not really , it gives me a subtitle that i can save it as .ASS after ive done with the ocr.

I already tried the spell-check a lot of times about "I" and "L" without result. I deleted all the subtitle ocr images and i started from the beginning, with the same result.

I dont need to say, its a similar problem with "¡" and "i", it doesnt matter if i delete and i start from the beginning, even with a different palette, the program will not recognize those characters.

nautilus7

18th January 2013, 19:11

Hi, I have some problems with greek letters. SubExtractor splits some letters no matter if i tell it not to and this makes ocr impossible. See the screenshots below.

http://t.imgbox.com/abwXHoct.jpg (http://imgbox.com/abwXHoct) http://t.imgbox.com/abuySDWh.jpg (http://imgbox.com/abuySDWh) http://t.imgbox.com/adr3DfTv.jpg (http://imgbox.com/adr3DfTv)

The letters are "Γ" (greek capital "gamma"), "Η" (greek capital "heta") and "Π" (greek capital "pi"), either normal or italic. Why even if i remove these specific splits, they are introduced again? I have deleted the map files (the one that comes with the program and the other in the /users folder) with no change.

This happens in several pgs files i have, one is here: http://www.sendspace.com/file/kgd5po

Tappen

18th January 2013, 19:39

nautilus7: There is an automatic split function which always runs on unknown characters. If both the left and right of the split make letters that aren't in the list of known bad auto-split characters or make a pair of characters known to be bad (rn = m for example) we don't stop and ask, we just do the split. I'll need to add an option to allow the user to add characters to this list to fix this problem based on their own language's characters.

nautilus7

18th January 2013, 20:06

Thanks for the lightning fast respond. I understand the logic and i will welcome such an option. I have to say though that this "problem" didn't occur in the past at least with Greek characters. I've used subExtractor to ocr a lot of subtitles with 100% success over the previous few years. If I recall correctly automated splits didn't take place in Greek characters at that time. Is that a recent change in the split logic?

nautilus7

18th January 2013, 20:16

Also I've noticed that some letters (greek and english) are split in a bizarre way. Anything you can improve here?

http://t.imgbox.com/abfqrxlE.jpg (http://imgbox.com/abfqrxlE)

Tappen

20th January 2013, 04:18

Nautilus7: Removing Γ Ι ι Ξ Π from the auto-split lists will probably solve your problems. Actually just Ι and ι will solve all the problems you've shown me so far I think. Splits using Iota gives the same bad results as the latin I and i characters to no one's great surprise. I'll make the change in the next release (soon).

nautilus7

20th January 2013, 12:22

Hi, there are a few letters more.

It's greek "π" (lower case pi) which is detected as double "τ" (lower case tau), because the horizontal line in "π" can exceed the 2 vertical lines and then there's big similarity.

Also, it's "Θ" (capital theta) which is detected as "O" with a "-" after it.

"Ξ" didn't gave me any problems whatsoever, but i guess removing it wont harm.

nautilus7

20th January 2013, 15:23

Tappen, I got this exception log yesterday, but i can't remember what i did to cause it.

Exception thrown at 20/1/2013 5:12 πμ
Parameter is not valid.
System.Drawing
at System.Drawing.Bitmap..ctor(Int32 width, Int32 height, PixelFormat format)
at DvdSubOcr.BlockViewer.OnSizeChanged(EventArgs e)
at System.Windows.Forms.Control.UpdateBounds(Int32 x, Int32 y, Int32 width, Int32 height, Int32 clientWidth, Int32 clientHeight)
at System.Windows.Forms.Control.UpdateBounds()
at System.Windows.Forms.Control.WmWindowPosChanged(Message& m)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
Exception thrown at 20/1/2013 5:12 πμ
Parameter is not valid.
System.Drawing
at System.Drawing.Graphics.FromImage(Image image)
at DvdSubOcr.BlockViewer.RedrawBlockPicture()
at DvdSubOcr.BlockViewer.UpdateBlockPicture(BlockEncode block, IEnumerable`1 otherSelectedBlocks, IEnumerable`1 allEncodes)
at DvdSubExtractor.OcrBlocksStep.LoadCurrentSubtitleData()
at DvdSubExtractor.OcrBlocksStep.FindNextOcr()
at DvdSubExtractor.OcrBlocksStep.reviewButton_Click(Object sender, EventArgs e)
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

Tappen

20th January 2013, 16:55

OK will look into it.

Overdrive80

22nd January 2013, 20:44

Hi, thanks for you magic app. When I select "Split by cell" and I want go back to "Split by chapter", button is desactived. I dont know if its a bug, but I report. Thanks

nautilus7

24th January 2013, 00:19

Tappen, I saw you made some changes to the source repo. Are those you mention above? Will you compile a binary, because it seems that my installation of vs 2012 lucks some components and can not compile the source code.

Tappen

25th January 2013, 22:41

This weekend Nautilus7 (Jan 26-27)

Tappen

25th January 2013, 22:48

Overdrive80: A chapter is made up of 1 or more whole cells. So once you split by cells there are no more chapters.

You can go back to the first page where you Browse for the DVD folder and hit the "Reload" button if you make a mistake splitting a track up.

technical note: There are really only cells on a DVD. What I call a new chapter is when a cell starts with a different, non-continuous, time-stamp (internal numbering of video/audio/subtitle/system packets) from the previous cell. Many but not all DVDs follow this convention.

nautilus7

25th January 2013, 22:48

Looking forward to it! Thank you.

nautilus7

27th January 2013, 02:54

Hi tappen thanks for the updated verion. It's an improvement to the previous version regarding wrongly split greek characters, but there are still some issues.

Greek lower case p (π) is still split in 2 lower case t's (τ). Maybe you should put "τ" in the "cheap split symbols" or "very cheap split characters" lists as well? Or it wont do anything?
There also a case where greek capital P ("Π") is split in "Γ" and "Ι" (gamma and giota respectively), though mentioned in the above mentioned lists (I read you code - lines 52, 53 in subconstants.cs file).

52 public const string CheapSplitSymbols = "\'\".,:-—_|„!ΓΠπΘГҐПӨг";
53 public const string VeryCheapSplitCharacters = "lIÌÍÎÏiìíîïj][ΙιΞІЇӀ";

Here is a screen with examples:

http://t.imgbox.com/adx7blWU.jpg (http://imgbox.com/adx7blWU)

nautilus7

27th January 2013, 03:30

There are also some cases where greek capital theta ("Θ") is still detected as "O-".

rizu

27th January 2013, 09:03

edit->Nevermind found the italic checking :)

Betsy25

27th January 2013, 09:12

@Tappen,

just have a little question. I ran a Blu-ray .sup file through the extractor and made the necessary inputs to make it through. However after I had the file saved as a *.srt, I noticed SubExtractor made quite a lot of "mistakes" by not seeing the <space> before a lot of words starting with a "j" (when on an italic line). I tried redoing the file but now it doesn't prompt anymore for any unrecognized items and runs through the 100% at once.

So, in short, I would like to delete the "database" meaning some file so it will once again prompt for corrections, however I do not wish to delete my "learned words" database, which is already fairly complete for the dutch language.

which file to delete please ?:helpful:

Tappen

27th January 2013, 18:18

nautilus7: I'm sorry I made a mistake with this fix for you. This is what comes of taking a break from the code for a couple of months. Of course I should have added the characters into the CheapSplitSymbols list that are the RESULT of the split, not the source. If the code knew what the character was before it was split then it wouldn't do the split in the first place. I'll fix this with a new version today.

Betsy25: The database is the OcrMap.bin file. You can see it's location on the first tab of Options. However, the database isn't your problem: you need to go to "Advanced Word Spacing" which is accessible from either the Spell-Check page or the File Save page (the last). Select lower case "j", and increase the "Normal Left" and/or "Italic" left values until the spaces to the left of the "j" characters in the samples look good. These adjustment values will last till you shut down SubExtractor.

J and j, as well as y and a couple of other characters are pretty common spacing errors. I've played with the defaults to improve the results in the majority of cases but many subtitles still draw these characters with either too long or too short tails and throw off the calculations of word spacing.

nautilus7

27th January 2013, 18:28

Tappen, thanks! Looking forward for the fix to this great ocr application.

Betsy25

27th January 2013, 19:02

Betsy25: The database is the OcrMap.bin file. You can see it's location on the first tab of Options. However, the database isn't your problem: you need to go to "Advanced Word Spacing" which is accessible from either the Spell-Check page or the File Save page (the last). Select lower case "j", and increase the "Normal Left" and/or "Italic" left values until the spaces to the left of the "j" characters in the samples look good. These adjustment values will last till you shut down SubExtractor.

J and j, as well as y and a couple of other characters are pretty common spacing errors. I've played with the defaults to improve the results in the majority of cases but many subtitles still draw these characters with either too long or too short tails and throw off the calculations of word spacing.

Thanks for the help Tappen, I've found and increated the "default before" space for the italic j, and now everything appears fine, HOWEVER, I can not go to the next step, the save as window, it's greyed out ?:confused:

EDIT: Sorry, I found out i had to click the "Previous Step". Evenything is working fine in this fantastic converter.

speedoflight

27th January 2013, 20:27

Tappen, did u look into the program .ASS problem of my post??

Well, it doesnt have too much trouble anyways, i just save it on SRT instead and then i convert it to ASS. But the problem with "I" and "L" and "i" and "¡" continues. Impossible to fix. Suprip makes perfect converted subtitles, with no problem on these characters. Thats the ocr tool im using now, since (sadly) subextractor seems not be able to handle it.

Betsy25

27th January 2013, 21:59

Perhaps just a little feature request ?

Could it be possible to have a DEL hotkey in the Options/Special Characters screen, so we don't have to constantly do -Select Item -> press "Remove from list" ?

I for example, only have dutch subtitles, from which a lot will clash with the default English words, cleaning it up now takes a extereme lot of clicking and selecting, while with a "del" hotkey, this can be sped up quite a bit. Ideally, it would be great when able to navigate the list using the arrow keys & have the DEL key for deleting items.

73ChargerFan

28th January 2013, 00:17

Nice app, thanks. I was trying to review a SUP track from the BD director's cut of Dark City, which crashed every other program I tried. With SubExtractor I could see that the images were animated pictures (cue cards with text, but expanding) that I could ignore.

Thank You!

Tappen

28th January 2013, 02:19

Betsy25: any reason you wouldn't want just a "Remove All" button? You could then train the program in just the Dutch spelling words.

I found the initial words during the OCR of the English subtitles in my personal DVD/Bluray library. I didn't use a dictionary so there's nothing special about this list.

Tappen

28th January 2013, 04:56

speedoflight: I think I have an idea what's happening in your case with "i" and "¡": the best fit character isn't being chosen sometimes when OCRing Bluray subtitles due to an error in the algorithm. It's pretty complicated to fix and I've not found the time to sit down and work on it yet.

I still haven't reproduced the 'I' and 'L' problem. My guess is that it's somehow related to being a Spanish subtitle because others aren't seeing the problem.

Betsy25

28th January 2013, 09:13

Betsy25: any reason you wouldn't want just a "Remove All" button? You could then train the program in just the Dutch spelling words.

I found the initial words during the OCR of the English subtitles in my personal DVD/Bluray library. I didn't use a dictionary so there's nothing special about this list.

That would be really handy for non-English people like myself !:)

Tappen

28th January 2013, 23:04

nautilus7: Please try out release 31b and let me know which Greek characters are still being split incorrectly and into which sub-characters. Thanks.

nautilus7

28th January 2013, 23:56

Hi, I am already testing for 1 hour... :p

It's seems that most problems are gone now. Only "Θ" (capital theta) is being split in "O" and "-" in some caes. Actually this not detected as split. It never did. Maybe because the horizontal inside bar is not attached to the outer circle of the letter, but has some space. That's causing theta to be ocred in "O-" without any indication of being split up.

If I come accross any other issue I'll let you know.

May I ask for a feature as well? In the "Correct OCR matches" window, where you can overview all matches for the current subtitle stream, it would be great if a window pops up and asks for confirmation before saving any changes. I accidentally hit the "remove all ocr matches for this movie" button and had to start over.

Tappen

29th January 2013, 00:05

I hate confirmation dialogs in general but in this case you have a point.

nautilus7

29th January 2013, 00:20

You can also put a check box in the options page to choose whether to ask for confirmation or not. This way the pop up won't be shown to users they don't want to.

Tappen

29th January 2013, 00:23

The "Θ" problem will involve some special code to detect. Not difficult to do, though, so I should be able to write it fairly quickly. I'm glad you told me it wasn't an auto-split problem because that was making my head hurt.

nautilus7

29th January 2013, 00:35

Yeah, although, when I first reported this specific character problem (in the previous page) I wrote "it's been detected as.." and not "been split up to...", I thought I should be more clear now. :p

Anyway, glad to hear that it is easy to fix.

Thunderbolt8

3rd February 2013, 23:28

got a bit of trouble with this file here: http://www.sendspace.com/file/42kue9

apparently some "l" (small letter L) are misrecognized as 1 or I. but I cant seem to delete those letters from the training list of that movie, because no small letter "l" is listed there (after k comes m there). I also cant change this during the spell check stage, because e.g. if the word villain is recognized as "vi1Iain" then spellcheck will only give me suggestions for the first 3 letters vi1, but not the complete word. and in case of words with only one wrong letter "l", those are not even suggested to me for spellchecking.

I already did delete all trainings for 1, I, L, i, l and also its italic variants from that movies trainings list. but it didnt change anything.

Tappen

4th February 2013, 21:51

Thunderbolt8: It doesn't matter if a letter is trained as l (lower L) or I (upper i). The program treats them as the same internally and sorts out which is which in the spell-check stage because they're so often identical.

There is a problem when 1 (one) has the same pattern as l or I, however. I've seen this myself. Currently there's no way to fix this problem and the spell-check step will ignore the 1 characters. I think in the few cases where this came up for me I restarted the OCR and trained the 1 character as I then edited the final file by hand because 1s are pretty rare.

Betsy25

4th February 2013, 23:46

Tappen, Is adding a "Remove All" button in the Options/Special Characters window on the to-do list please ?

Tappen

5th February 2013, 01:47

Betsy25: Yes I already implemented it. I just put up a minor version build 1031c for you.

Overdrive80

5th February 2013, 05:12

@Tappen Would be possible hardsubs rip with your soft??

Tappen

5th February 2013, 05:30

Overdrive80: It would take a huge amount of work, sorry. The OCR algorithm I'm using doesn't work at all well on anything but soft subs so I'd have to start over.

nautilus7

5th February 2013, 13:45

Hi, Tappen, any news regarding the special code for Greek upper case Theta ("Θ") detection?

Betsy25

6th February 2013, 01:42

Betsy25: Yes I already implemented it. I just put up a minor version build 1031c for you.

Perfect ! Thank you so much for implementing this.

Thunderbolt8

6th February 2013, 12:41

Thunderbolt8: It doesn't matter if a letter is trained as l (lower L) or I (upper i). The program treats them as the same internally and sorts out which is which in the spell-check stage because they're so often identical.

There is a problem when 1 (one) has the same pattern as l or I, however. I've seen this myself. Currently there's no way to fix this problem and the spell-check step will ignore the 1 characters. I think in the few cases where this came up for me I restarted the OCR and trained the 1 character as I then edited the final file by hand because 1s are pretty rare.I think in case of .srt sub style it wouldnt be much of a problem. but in this case its a file for which I need to retain the subtitle positions at their original position via .ass and therefore I have lots of 1 in the position information of every line and also in case of italic markers -.-