SubExtractor - New Sub Ocr App [Archive] - Page 9

Betsy25

9th February 2013, 05:39

I have problems with the "Monty Python - The Meaning of Life" DVD.

Suddenly the subtitle times become negative for the rest of the movie :
345
00:20:54,422 --> 00:20:56,698
kwamen er overal kinderen.

346
00:-11:-56,-386 --> 00:-11:-51,-994
DE ZIN VAN 'T LEVEN
DEEL II - GROEI EN ONDERWIJS

347
00:-11:-51,-106 --> 00:-11:-46,-953
"En zagen zij andermaal de kamelen
voor het derde uur.

348
00:-11:-45,-666 --> 00:-11:-42,-195
"En zo gingen de Midianieten
op weg naar Ram Gilead...

349
00:-11:-41,-746 --> 00:-11:-39,-948
"in Kadesh Bilgemeth...

etc...

When I'm in the "Select subtitle data file" page, I can enter 346 in the Index field, and there it shows the correct timing, but when the file gets saved after recognition, the timings are negative from sub 346 onwards...

Please, what could be wrong ?:confused:

Tappen

10th February 2013, 21:33

nautilus7: sorry no news yet - busy at my real work
Betsy25: you'll have to send me the source sup or idx/sub file to look at - I'll send you a private message

nautilus7

10th February 2013, 23:13

No problem. Real life is priority.

Sm3n2

21st February 2013, 20:21

Hi,
Your program is very good but something is missing. Are you thinking to add .SON support ? Because I extract subtitle from TS to SON file (with .bmp images) and I OCR using Subtitle Edit. If your program was able to do this, it will be great.
Cheerz.

Thunderbolt8

26th February 2013, 20:06

got a subtitle file in which the letter "P" is automatically split into "F>". I can remove this split in the review and correct OCR matches dialog, but it automatically gets recognized as such again. so theres no way for me to actually change this within the program.

would it be possible in the future to add the option to define any character at any line of a subtitle file? then these problems shouldnt really matter much any more, because then you could simply tell "P" to be recognized as character "P".

Tappen

27th February 2013, 17:33

Thunderbolt8: Try getting rid of the letter > (untrain all instances of it) to fix this particular problem. I'm not sure what you means by "define any character at any line of a subtitle file". One thing I can do is allow you to turn off auto-splitting temporarily, but you'd have to notice that there's a problem for that to work.

PowerGamer

28th February 2013, 20:24

Can you add an option to produce

00:01:02,981 --> 00:01:05,108
RALPH: My passion bubbles
very near the surface,

instead of

00:01:02,981 --> 00:01:05,108
RALPH: My passion bubbles
very near the surface,

when outputting in srt format?

Tappen

28th February 2013, 21:37

PowerGamer: I think the first should probably be the default. I'll look into it. I hate adding options like this that are hard to understand.

nautilus7

28th February 2013, 22:43

What if the 1st line is written in italics and the 2nd in regular letters?

Tappen

3rd March 2013, 06:35

nautilus7: I just meant that if the last character on 1 line and the first character on the next line are both italic, I shouldn't close then open the italic tag. It shouldn't make a difference but there are lots of funny renderers in the world.

Thunderbolt8

3rd March 2013, 16:30

Thunderbolt8: Try getting rid of the letter > (untrain all instances of it) to fix this particular problem. I'm not sure what you means by "define any character at any line of a subtitle file".what I mean is basically to take any line of the first screen when you still can look through all lines manually before you start the OCR process with the next step and then take such a line to the OCR table of the next process, be able to click on any character of the original .sup file and assign a character from the table to it manually.

or something like that ;)

Tappen

6th March 2013, 07:05

nautilus7: please try the test release I put up on Codeplex for your Theta issue

Tappen

6th March 2013, 07:06

Thunderbolt8: I just got rid of <> and () from the auto-splittable character list. It's safer and I don't think it will hurt people much if at all.

nautilus7

6th March 2013, 11:12

nautilus7: please try the test release I put up on Codeplex for your Theta issue

I will later today. Thanks.

nautilus7

6th March 2013, 23:02

Tappen, it works fine now with the subs i tested. If i come across any other issue, i will report. Thanks.

Tappen

9th March 2013, 23:46

speedoflight: I've made progress on the accuracy of the HD OCR algorithm. i and ¡ characters are recognized properly in the cases I've tested. The release is 1032b on Codeplex.

nautilus7

10th March 2013, 00:57

I guess this will be helpful for the greek giota ( ί ) as well. Right?

Tappen

10th March 2013, 01:12

nautilus7: Probably will help. I'll add it in 1 extra place to be sure. Also I'll change the option text to read "Spanish or Greek". The latest is 1032c

nautilus7

10th March 2013, 12:29

Thanks. As always, I 'll let you know when I come across any issues.

Tappen

11th March 2013, 05:56

Thunderbolt8: You might want to try 1032c for the various subs you've found with errors (I'm thinking of the 1 recognition problem in particular). It may be fixed.

I didn't get the problem file from Sendspace in time to test it myself

Chetwood

17th March 2013, 10:41

Just OCRed a new batch with 1.0.3.1 and I cannot undo the last item that was spellchecked?

Tappen

17th March 2013, 17:50

Chetwood, That's a bug that's been around forever and I've never bothered to fix. Maybe now that someone other than me has found it...

Johnny_B_E-Work

25th March 2013, 09:28

Hello, could you please integrate functionality to manually enter string of letters? Sometimes no matter what I try SubExtractor will not split correctly / select the correct segments of particular letters. Ignoring such symbol cannot be used because if the same segment is part of any other letter, it is ignored too. If there is any other way how to solve it, I could not find out how...

I hoped it would work this way: I hit F2 (Enter mode), select with keyboard arrows the section of text (for example the whole word), manually type in the letters, hit Enter = unfortunately this leads to training of the first letter typed only...

Thunderbolt8

28th March 2013, 20:25

got a file here in which the hyphen "-" still remains after removing SHD stuff in case when [] is used for brackets instead of ().
http://www.sendspace.com/file/lbj4ez

e.g.

- blaa. blabala
- [shd stuff]

-->

- blaa. blabla

should be easy to fix though I guess.

Tappen

29th March 2013, 00:08

Johnny_B_E-Work - I'm pretty sure it's possible to use the current split function to solve your problems, but how you use it is not intuitive at all. (Adding multi-character matching is just too hard with the code as it is).

A couple of hints for this (admittedly too difficult to use) feature:

1. If you're getting a lot of characters that need to be split you should try a different Palette. Sometimes the shadows or highlights around characters are selected by mistake resulting in characters merging.

2. Select the largest part of the character you need to split before hitting the Split button. The Split feature will work on the biggest chunk selected, not the 1st.

3. Tricky characters running together will sometimes require you to split multiple small pieces off a big chunk one at a time, hitting the Split button again and again to do so, then connecting pieces back together to do the match. If this is happening be sure to double-check if reason 1. is the basic problem.

Tappen

29th March 2013, 00:10

Thunderbolt8: I'm not sure I ever remove a hyphen if it is in the original subs and part of a line that I'm keeping. I'll check what's happening though.

Johnny_B_E-Work

29th March 2013, 03:46

I figured how to use split function already and in most scenarios it works fine. The problem I sometimes have is with particular special letters of my native language (Czech) like " ď " and " ť " and believe me that no matter what I try I am unable to achieve satisfactory results (I can provide some examples if you are interested).
It's a shame that multi-letter enter function cannot be implemented (like old SubRip has) - sometimes it is really much easier just to type in the letters manually and move on. If this could be solved somehow then your program would be absolutely perfect (already MUCH faster than SubRip, more user friendly and all in all great to work with)

Chetwood

29th March 2013, 08:24

Chetwood, That's a bug that's been around forever and I've never bothered to fix.
I usually don't run into this but this time was the end of a long batch of subs and I clicked on the wrong word so it'd be nice to be able to correct it too.

Tappen

29th March 2013, 21:54

Johnny_B_E-Work: Please send me a bin or sup file with the problems so I can see for myself using the Private Message system here or on Codeplex. I might be able to see a way to solve things that is easier to implement.

Ghitulescu

16th May 2013, 08:59

According to your web page, you plan an installer. I believe people that need to work with subtitles have enough experience to unpack a ZIP ... haven't they? :)
Is the Italic thing solved?
What about musical characters? They must be Unicode, as none of them are in ANSI (some are in the extended ANSI code pages) ...

Thunderbolt8

2nd June 2013, 11:52

got a case here in which

Bangkok:
good-time city,

is recognized as SHD and therefore "Bangkok:" is removed if SHD removal is checked.

maybe it would be useful only to consider these cases as SHD if the trigger "Xxxxx:" is not the only thing a line consists of? Or are there examples in which such cases (e.g. "Name:" and the blablalbla in the next line) are really SHD subs?

Tappen

6th June 2013, 04:14

There are definitely cases where the speaker name: is on one line and the text is on another I'm afraid.

SJX

11th June 2013, 16:28

My first post... I'm in the process of converting my dvd's to H.264 compressed files. I wanted soft subtitles so I started using Subextractor instead of encoding hard sub's with Freemake. BUT, I'm unable to produce any good subtitle files and I cannot figure out what is wrong. I have tested 5 dvd's now. The most common problem is that the sub timestamp keeps resetting during the movie. For instance this DVD "Umur"

155
00:02:55,560 --> 00:02:58,155
En mie tarvitse. Sie saat sen.

156
00:00:00,800 --> 00:00:03,679
Umur kävi luonani.
time stamp is resetted after 3 minutes
and again after 5mins
218
00:04:42,280 --> 00:04:45,398
Minä rakastan sitä miestä.

219
00:00:03,200 --> 00:00:06,398
Lähetin Umurille kaktuksen,
en ruusua.

this results that I have a bunch of overlapping subtitles from various parts of the movie. Usually this resetting seems to happen at 3-5min interval, but one title had 10 minute intervals.

Chetwood

12th June 2013, 05:57

What tool do you use to encode to which container? Usually there's no need to extract subs.

SJX

12th June 2013, 07:35

What tool do you use to encode to which container? Usually there's no need to extract subs.

I don't see how the encoding is related to subtitle extraction, but:
-for encoded (hard) sub's, I use Freemake all the way
-For soft subtitle i.e. subtitle files that could be turned on/off I use DVDFab to rip, SubExtractor to OCR sub's, Freemake to convert.
For final format I usually use MP4 (H.264/AAC) since my Sony tv and PS3 are able to play those directly.

Tappen

12th June 2013, 17:20

Are you using DVDFab to convert to an unencrypted VIDEO_TS folder or to go to a single main movie file? If the 2nd choice that explains the problem. For accurate subs SubExtractor should be used to convert from the VOB files to an mpg file (which can be used in Handbrake or whatever) and a bin file (SubExtractor sub format) which can be OCR'd.

SUB/IDX files created from DVDFab main movie rips will have bad timing regardless of which OCR problem you use. DVDs have chapter breaks which reset the time to 0 inside them and DVDFab does not clean this up when it rips to a single file unfortunately.

Chetwood

13th June 2013, 05:50

MMh, you're sure about that? Cause so far I had no timing problems when ripping main movie only. In fact, Fengtao stated on the forums that the only way to be sure all protections are gone (if not they may cause out of sync subs) is to rip the main movie only rather than the complete disc.

SJX

13th June 2013, 07:08

Are you using DVDFab to convert to an unencrypted VIDEO_TS folder or to go to a single main movie file?
I extracted only the main movie to harddrive, not the extras. But of course it resulted an unenctrypted VIDEO_TS folder.

But I think I found the reason. I was careless about encoding the mpg and others - really didn't understand the point until you explained. I had not necessarily deleted the old mpg's and others or even skipped mpg encoding. Now I managed to get sub's without time stamp issue.

It would be nice to have an button in Subextractor "delete aux. files" after successful extraction since those are no longer needed.

Johnny_B_E-Work

14th June 2013, 03:17

I extracted subtitles from THX 1138 DVD and I get weird timings in the output SRT file which makes it unsuable:

278
00:06:19,027 --> 00:06:21,861
...and speed it up by four times...

279
00:06:22,027 --> 00:06:27,421
...you'll hear "Stabat Mater"
by Pergolesi.

280
00:-11:-12,-022 --> 00:-11:-09,-666
I was so afraid.

281
00:-11:-05,-462 --> 00:-11:-03,-584
So alone.

282
00:-10:-55,-782 --> 00:-10:-53,-313
I wanted to touch you...

...and so on until the very end:

1280
-07:-07:-57,-139 --> -07:-07:-53,-589
We only want to help you.

1281
-07:-07:-52,-979 --> -07:-07:-49,-781
This is your last chance.

Tappen

14th June 2013, 21:47

Johnny_B_E-Work: Are you OCRing from a SUB/IDX file combo? If so can you open the IDX file in a text editor like Notepad and see if the problem is there? Otherwise let us know - maybe there's some other suggestions.

Johnny_B_E-Work

15th June 2013, 00:13

Are you OCRing from a SUB/IDX file combo?

No, directly from DVD e.g. *.bin file was created by SubExtractor.

Johnny_B_E-Work

15th June 2013, 00:39

I think I figured it out. I originally extracted (the longest) track 1 which - as I just discovered - is not the feature film only but feature PLUS some behind the scenes which are played during the movie e.g. they are inserted on thy fly hence the first 6 minutes were just fine, after that first extra scene was inserted and the timing got screwed.
Now I extracted track 2 (shorter, feature only) and there it seems to work just fine.

Thunderbolt8

28th July 2013, 21:56

is it actually possible to adjust the spacing for italic characters? the list apparently only shows non-italic characters and all line examples at the left side are also non-italic ones.

Weirdo

1st August 2013, 11:23

Great application, thanks! Doing a quick test now, and it seems to have problems with the medium DPI setting (125%), several windows/dialogs are cut off.

Tappen

3rd August 2013, 17:16

Thunderbolt8: There's 2 radio buttons to select this: one labelled Normal and the other Italic - roughly in the middle top of the Word Spacing page.

Weirdo: It looks like the Options dialog has most problems, I should probably make it bigger. I didn't see any issues with the main pages.

rhaz

9th August 2013, 17:42

Hi. Very nice tool indeed. It worked fine with one subtitles from one DVD, but now I am having problems with other subtitles from other DVD. It gives incomplete characters.

I.e. characters 'š', after OCR it gives me only 's' on some words, but not all. Why? I am sure I selected it correctly when OCR'ing (also I cannot find how to delete the saved characters from tool's memory so I could try to reapply it for the character again and see if this gets fixed).

Also there's misspelled word 'hilite' in couple areas in the tool. I think it should be 'highlight'.

rhaz

9th August 2013, 18:04

Nevermind. Fixed it in 'Review and Correct OCR Matches'.

Tappen

11th August 2013, 00:41

rhaz: I use the spelling hilite because I'm a programmer and that's the way it's spelled in most software APIs. I't s usually done that way because it's shorter and therefore requires less valuable screen space. Just a personal quirk, sorry.

Also - if you didn't notice you can use the arrow keys to select the multiple pieces making up a character like š. I was surprised when I added that feature how faster it made even English OCR (ij!: characters).

rhaz

16th August 2013, 15:13

Ok other problem.

http://i.imgur.com/lKDgs10.gif

It is letter 'ą' and ',' so it's 'ą,', but as you can see split doesn't work because it doesn't touch together or something. And I tried many ways splitting it, but as you can see it gives me just 'a', not 'ą' when splitting nor 'ą,'.

So if I select it as 'a', the word will be misspelled and would probably affect the rest of the words having same characters.

rhaz

17th August 2013, 20:41

Another problem. I get error from one DVD subs. I tried extracting subs with various tools like MeGUI tool, VSRip and with your tool, but your tool doesn't even let me to extract subs, buttons are grayed out.

http://i.imgur.com/Ar4Vtl8.jpg

Subs plays fine on DVD though. Don't get it what's the problem. It works fine on SubtitleEdit, but I prefer your tool because it is waaay more accurate and less of work and saves tons of time compared to SubtitleEdit.

There's (http://www72.zippyshare.com/v/75761929/file.html) subs attached