Log in

View Full Version : SubExtractor - New Sub Ocr App


Pages : 1 2 [3] 4 5 6 7 8 9 10

nautilus7
25th October 2011, 21:46
I 've also seen professional studio subs that use 2 dashes to distinct each speaker in that case.

Tappen
26th October 2011, 01:15
I've seen hyphens, dash ems, and double-hyphens all used to separate speakers, and also sometimes, but it's not common, the first speaker doesn't have an indicator only the 2nd and any others down the screen.

It's also common for speakers to be separated by positioning of the subtitles on the screen, so this is much more of an SRT conversion issue than ASS conversion where positioning can be preserved.

Interesting problems. I think I'm going to try to preserve the choices of the original subtitle authors where possible, but use a single hyphen the rest of the time to indicate multiple speakers after SDH removal. No one will be perfectly happy but things should work out pretty well.

Thunderbolt8
26th October 2011, 06:28
but use a single hyphen the rest of the time to indicate multiple speakers after SDH removal. No one will be perfectly happy but things should work out pretty well.give an example please of how you mean to look it, not sure what you mean. imho the one hypen thing should only be used when its really one hyphen in the original subtitle layout, if possible. otherwise, the two hyphen thing for two lines we have so far is better imho. I find that doing your own one hyphen layout can sometimes look a bit strange compared to what the studios would do. they seem to have give it more thought when it seems logical to put the one hyphen or two or none when needed. that wouldnt be possible here.

Tappen
26th October 2011, 13:26
The current "Change of speaker prefix" I use is a single "-". All I'm saying is that I'm going to keep this the same. If there's a different prefix in the subs already I leave it alone.

nautilus7
28th October 2011, 20:41
version 1019 has some issues with SDH removal...

143
00:15:34,893 --> 00:15:37,062
[ALARM CONTINUES SCREECHING
IN DISTANCE]


isn't remove at all, probably because the brackets are not in the same line.

Thunderbolt8
28th October 2011, 23:00
that is the same for all versions so far. tappen said he hasnt thought of a way to remove SHD stuff which goes over two lines (without being sure to break anything)

nautilus7
28th October 2011, 23:17
Ah, ok, missed that.

mindbomb
29th October 2011, 00:10
neat.

Thunderbolt8
29th October 2011, 16:51
would it be possible to block that the '-' hyphens get added in the course of SHD removal when .ass and exactly position every line output both are ticked?

because then the different positions of lines of different speakers on screen already indicates that there is more than 1 person speaking at the moment. the additional hyphens are then superfluous and look strange with that kind of subtitles (the situation I am referring to are those subtitles containing of up to 3 lines which are positioned like everywhere on screen)

if others have a different opinion on this, then at least having the option for this would be nice

Thunderbolt8
30th October 2011, 00:41
http://www.mediafire.com/?yujdlgn86d157uk

imho it would be useful to implement that lines which begin with '--' dont get the additional hyphen added in case of SHD removal. currently

MAN 1 [ON RADIO]:
<i>--supported Senator Eagleman.</i>

gets changed to

<i>---supported Senator Eagleman.</i>

while that 2nd line would look just fine the way it was:

<i>--supported Senator Eagleman.</i>



now I am not sure, are there cases in which another speaker is indicated first, like

man 1: blabla
<i>--supported Senator Eagleman.</i>

which normally would get changed to (? just a guess, havent seen such a case yet)

- blabla
<i>- --supported Senator Eagleman.</i>

but that looks somehow strange. after the proposed change, it would look like

- blabla
<i>--supported Senator Eagleman.</i>

would look strange a bit as well. but maybe that situation doesnt really occur? at least as far as I can remember, usually a -- at the beginning of a line is only used in combination of other lines if they dont contain a hyphen at the beginning. I might be wrong though.

Thunderbolt8
30th October 2011, 01:41
http://www.mediafire.com/?zz19kis7hcfcxe7

some inconsistency:

(HORN HONKING)
GIRL: Hi, John.

gets changed to

-Hi, John. (hyphen superfluous, but we know that problem already)

or

GIRL: <i>Can I wizz</i>
<i>on you, Wolfman?</i>
(SOFT ROMANTIC SONG PLAYING)

gets changed to

<i>-Can I wizz</i>
<i>on you, Wolfman?</i> (SRT, 1 hyphen)

{\an4\pos(377,737)}{\i1}-Can I wizz{\i0}
{\an4\pos(377,817)}{\i1}-on you, Wolfman?{\i0} (ASS, 2 hyphens; seems to depend on whether you tick exact position... or not, then the SHD line () gets inserted above that dialogue instead below)



while

Hi, John.
JOHN: Not too good, huh?

gets changed to

Hi, John.
Not too good, huh?

instead of

Hi, John.
- Not too good, huh?

here the hyphen is actually missing (which wouldnt be bad for .ass in combination with exact position of every line as proposed some posts ago, but bad for .srt and/or when not using exact position of every line)

Chetwood
30th October 2011, 10:46
Why change that at all and not simple convert what's in there?

Also some more issues/feature requests:

I'm having trouble OCRing the word "figures" in italics. I can only split the word in 2 instead of 3 parts and I'm not automatically asked for a second split.

http://img97.imageshack.us/img97/1424/splitbj.th.png (http://imageshack.us/photo/my-images/97/splitbj.png/)


I can't save SRTs ripped from Vobsubs elsewhere but to c:\Users\Chetwood\videos. "Store Sup File Outputs in Source Directory" only applies to SUP but not Vobsub?


I can't change the "OCR data file location"


please add an option to save to ANSI instead of UTF (a lot of standalones have problems with the latter)


Bein able to drag and drop a sub file onto the program window would be cool


Thanks!

Thunderbolt8
30th October 2011, 12:51
Why change that at all and not simple convert what's in there?because these problems occur in combination with SHD removal.

Chetwood
31st October 2011, 06:33
Right. Apparently overread this. I'm still annoyed though, that the authoring people just don't add seperate stream for this. Should be piece of cake for them.

Tappen
2nd November 2011, 23:25
Chetwood:

Just split multiple times if you have to. So in the case you shared highlight the "dot" of the i and complete split, start split again and get the bottom of the i. (Sorry I'm making people think like a programmer rather than a normal human in this case but my long estimate of the time it'd take to code it to work the proper way makes it a low priority issue)

I'll make the option to save in same directory apply to sup and idx/sub next release.

Sorry I don't yet allow moving the OcrMap.bin file location. I show the location so people can back up or move it to new machines manually. I'm thinking about how to allow this to be changed safely. Different versions of Windows have really different default program data locations and security around writing files. I don't want to spend a lot of time on error handling on such a minor feature.

Good idea to allow option to save to ANSI SRT as well as UTF. I'll try to add it soon. Whatever ANSI codepage the Windows UI Culture is currently running should be ok.

Drag'n'drop files. Yeah maybe.

Tappen
2nd November 2011, 23:28
Thunderbolt8: I'll look into the SDH errors soon. One thing I'm definitely going to change is to remove the added hyphens when saving to ASS if the 2 lines aren't part of the same block (in terms of position on the screen).

Thunderbolt8
3rd November 2011, 00:14
how do you plan to find out whether the 2 lines are part of the same block? by distance of letters and lines?

usually, even when 2 people are standing next to each other, theres always enough space to indicate 2 different speaker. but sometimes, when for example one person is speaking from the off or maybe standing behind another speaker, it can happen that those 2 or 3 lines of speech on screen are quite close to each other that its easy to mistake all those lines belonging to a single speaker. but in such cases the lines of one speaker are often differentiated from the other speaker by being italicized. so maybe italics can also be a criteria to distinguish in these situations when determining whether lines belong to the same block and narrowing the criteria of distance down too far wouldnt be of help.

Tappen
3rd November 2011, 02:02
I already break characters into rectangular blocks and OCR them separately in the code. The rule is something like "within 4 normal character's width left or right or 2 normal character's height up or down means it's in the same block".

If there's an error, you'll see an extra couple of hyphens occasionally. Add too many rules and it'll just make the code unfixable AND unreliable. So we'll go with what I've already got for blocks.

Chetwood
3rd November 2011, 14:19
Just split multiple times if you have to. So in the case you shared highlight the "dot" of the i and complete split, start split again and get the bottom of the i. (Sorry I'm making people think like a programmer rather than a normal human in this case but my long estimate of the time it'd take to code it to work the proper way makes it a low priority issue)
MMh, gonna retry this next time it occurs. IIRC splitting it once made the item not appear again so another split was impossible. Same goes for three letters "erj" recognized as one.

I get it that you want to minimize any potential troubleshooting for users but I'd really appreciate being able to select the OcrMap.bin file location myself. I don't trust 'c:\users' or 'My documents' so I put all important files into a folder that I backup regularly. Maybe you could pop up a short message ("all changes at your own risk!") when someone tries to deviate from the default location and be done with it.

BTW, I had another char not recognized, it had low double quotes Germans often use and looked like this: ,,e''. I had to manually fix it cause Subextractor would not accept it. Thanks again.

Tappen
3rd November 2011, 16:44
There's an automated attempt to split every unknown character: if there's a perfect split SubExtractor won't stop and ask, it'll just do it. So sometimes even 3 sections joined together require only 1 split.

I can add the low double quotes to the character selection box if it's a common occurrence in German. I think there's an empty spot right now. Let me see if I can find the unicode character point. I'll have to make it work like double quotes I guess.

Tappen
6th November 2011, 17:36
1.0.2.0 is out with some fixes for Chetwood and Thunderbolt8

Thunderbolt8
6th November 2011, 18:28
thanks

Thunderbolt8
6th November 2011, 19:24
is there anything you can do to improve character recognition with this file here? http://www.mediafire.com/?75hlrz4g7dmzsz9

its the wort one I've ever encountered (and also the first one using this program), so far Im barely half through with it and already have got about 150 different characters for 'o' :/

Tappen
6th November 2011, 21:05
I've seen a few of those myself, and just gave up and used another program. I have some ideas on how to improve the OCR (it's issue #1 on codeplex) to allow for slight variations on the letters but haven't implemented it yet. It'll probably take a few months (of my spare time) to do.

Thunderbolt8
6th November 2011, 21:16
also resorted to another program to do them, but then I thought maybe it was nice to have those letters in my character library in case a similar BD turns up and I hopefully might be able to use some chars from that.

nautilus7
6th November 2011, 23:47
Thanks for the new version.

Tappen, would you consider adding basic subtitle syncing capabilities like framerate change and time delay to both bitmap and text based formats?

Tappen
7th November 2011, 00:24
Well, there already is a "Subtitle Offset (ms)" field on the Create Subtitle step which is the time delay you're looking for.

I have a 25->24fps conversion already in the code but only visible in Debug builds since I thought only I would use it. Are you looking for a generic Numerator/Denominator type framerate change option?

nautilus7
7th November 2011, 01:17
I've seen the offset you refer to, but i was thinking for a more flexible functionality like synchronizing already ocr'ed subtitles. Or just synchronize pgs subs without orc'ing them.

If by "generic Numerator/Denominator type framerate change option" you mean the user will be able to type whatever frame rate he wants, something like this would be great:

http://i40.tinypic.com/9t0h91.png

Tappen
7th November 2011, 03:00
I'm going to try to improve the OCR as my first priority. I might be able to put in a framerate conversion quickly before that though. Synchronizing functions will have to wait till after the new OCR and alternate character sets are in.

Chetwood
7th November 2011, 07:55
1.0.2.0 is out with some fixes for Chetwood and Thunderbolt8
As my usual way of sayin thanks I'm gonna add some more feature requests ;)

Please make your tool remember folders! When I rip subs with VSrip, I keep English and German in one Vobsub. When I open it in Subextractor with some files I get the message "Subtitle file out of date or corrupted" (BTW, is there a way to copy the whole text of such an error message?). Subextractor seems to be picky cause playback of these Vobsubs is just fine and it also opens in Subtitle Creator, Vobsubstrip, etc.

Anyway, I split the sub to de/en and now I can open it in Subextrator. However, when one language is finished I have to click all the way down to the folder to open the next file. Having Subextractor remember the folder would be nice.

Also, I'm not quite sure about where do you save the OCR results when I correct or enter any chars? Like I said, I'm ripping two langauges from the same TV show, both of which have the same font and colours. So if I enter some chars not automatically recognized like , o l O how come I have to enter them again when doing the second language? In case you need some demo files just let me know.

Thunderbolt8
7th November 2011, 10:44
afaik unless the names of the subtitle file are similar and you open both during the same session youd have to enter those letters again.

Tappen
7th November 2011, 16:56
Chetwood: I don't support multiple streams in 1 Vobsub file because I've never seen what the file format looks like. I'll add support for it. There's just a lot I still don't know about subtitles.

Also, I set the starting directory to the default output directory for the open file dialog. I should probably only do that the first time - after that I'll let the system remember the last location.

Chetwood
8th November 2011, 09:44
Nice. In case you wanna create some test file just reauthor some DVD in DVD Shrink. Please remember that you do need to select all subtitle streams, cause otherwise the stream order will be messed up in tools like VSrip.

Thunderbolt8
8th November 2011, 21:23
got another thingie: (tree.sup)

unmodified .ass output (actually before and after looks the same here for .srt output, only that the 2nd line is already split into two separate lines there)


{\i1}- [Patrons Chattering]{\i0}\N- Hope it's not gonna take as long as last time.


gets changed in combination with SHD removal and exactly positioning of every line to:


{\i1}-{\i0}
- Hope it's not gonna take
-as long as last time.


while it should be ideally:


Hope it's not gonna take
as long as last time.


or maybe at least something which involves only 1 hyphen instead of that that one line gets split up as if there were different speakers. actually, one hyphen or 3 doesnt matter, they have to be removed manually afterwards anyway. so maybe the ideal solution is possible without breaking stuff - or not :P

Tappen
9th November 2011, 02:25
I'll look at that Thunderbolt8, but I'd rather get to work on other features right now than another round of SDH fixes.

Thunderbolt8
9th November 2011, 12:38
when clicking the start over with this movie button, it seems that all the characters OCRed during the same session get deleted. I did one movie and wanted to make the other subtitle track afterwards (only difference in filename was the tracknumber), but then stopped in between and changed my mind and clicked that button in case there were some mistakes, because I didnt feel to review the characters for that track I didnt want to do any more. when I checked back on the first track, I noticed that I had to enter every character again I already did during the first run.

so if this is really the case, it would be nice if it can be distinguished between recognized characters from different subtitle track during one session. otherwise, you either would waste time spending to check on the other track you might not feel doing any longer or discard the characters ocred of other tracks during the same session.

Tappen
10th November 2011, 00:50
The Start Over button does really do that. It should probably be called something different. If you want to change your mind in the middle of an OCR just click the Previous menu item to get out. The Start Over button is there in case you need to clear all the OCR matches for the movie for some reason, like your cat walked over the keyboard of something.

Thunderbolt8
10th November 2011, 00:58
that is what I mean, but it seems not only to delete all the matches for the current movie, but all the movies already recognized before during the whole session.

when I press the previous menu button before I finish ocring a movie, will the characters recognized up to then get deleted?

Tappen
10th November 2011, 01:53
Previous saves everything done so far on the movie, so that's what you want.

Restart clears all the OCR matches for the current movie, even if they were manually made in another movie and just were found again and used in the current movie by the OCR engine. I really should put a warning message when people press it. I should maybe even remove the button from the Release version of the product since it's hard to see where it'd be useful for anyone but a developer.

Thunderbolt8
10th November 2011, 14:32
so if I understand correctly that restart process could remove more characters for a movie than you actually ocred during a session, right? so if a movie i.e. consists of 20 characters done just now and another 20 already taken from your database, then all 40 chars will get deleted and not only those 20 you did this session?

Tappen
11th November 2011, 05:34
Correct.

Thunderbolt8
11th November 2011, 12:19
eh then I suggest to remove this button completely and have the program point to a better solution instead please :p

Thunderbolt8
12th November 2011, 21:36
got another one for your SHD list (actually similar to the last one) :D (.srt) http://www.mediafire.com/?ugnugrbvxue6uds


- There's close to $10,000. Where?
<i>- [Siren Wailing]</i>

get changed to:

- There's close to $10,000. Where?
<i>-</i>

when its supposed to be:

There's close to $10,000. Where?



and another strange one:

- Well, what can we do, Mother?
- I thought if you went and talked to him —
you know, another man.

here actually a wrong hyphen get added:

- Well, what can we do, Mother?
- I thought if you went and talked to him —
-you know, another man.

Thunderbolt8
12th November 2011, 22:17
another strange thing (.srt): http://www.mediafire.com/?9se3uk3cej12d0j


BOSUN: All hands, check equipment.
MAN: Let's go.

gets changed to

All hands, check equipment.
- Let's go.

even though it should be

- All hands, check equipment.
- Let's go.


did something break? in 1019 its still fine. maybe its only restricted to this single one file?
this one is a rather crucial for me, if this is the same for all those situations with other subtitles, I guess Ill revert to 1019 then for the time being.

Tappen
12th November 2011, 22:30
Things were just changed around quite a bit. Some things fixed and others broken.

- There's close to $10,000. Where?
<i>- [Siren Wailing]</i>
What kind of subtitle author puts a hyphen in front of a sound effect? Seems stupid, but I guess I have to take that possibility into account.

BOSUN: All hands, check equipment.
MAN: Let's go.
This last one is by design: I don't start adding hyphens at the start of lines until I'm sure there's a reason to going top to bottom (so no hyphen on the first line of a group). This avoids some problems though maybe looks a little funny to some people. But it's one of the standard ways to indicate different speakers.

Thunderbolt8
13th November 2011, 02:03
if possible, imho it should rather be with adding a hyphen. there are some subtitles which only or mostly work with a bottom hyphen, but as already said somewhere in this thread, this seems to work only well in context with spoken lines before, e.g. if there is a change of speaker or not. imho it could turn out a bit confusing when trying to do a 1-hyphen subtitle track by oneself.

I'd vote in favour of staying with the original choice of the subtitle track creators. meaning if there are people named with : at the beginning of each line, then there should be 2 hyphens. otherwise, they could have only made the SHD indication for one line.

BOSUN: All hands, check equipment.
MAN: Let's go.

should imho stay

- All hands, check equipment.
- Let's go.

while if the orignal SHD choice were

All hands, check equipment.
MAN: Let's go.

then it would be converted with one hyphen to

All hands, check equipment.
- Let's go.


btw. could you give an example what would break when doing the 2 hyphens again? maybe its only comparably minor stuff.

in general, I think its good to have a list which always tells and gives examples of which kind of lines create which problem at SHD removal. might be easier to have an overview what to look out for durinf SHD removal checking and also could help with discussion which situation causing problems can be regarded as comparably unimportant to others.

Thunderbolt8
14th November 2011, 14:00
another one, rather special as it seems: http://www.mediafire.com/?ic08cto87qi5gco

- Sal-adin.
<i>- (belches) </i>Gibberish.

-->

- Sal-adin.
<i>-</i>Gibberish.

seems to be right according to the SHD removal rules, but it would be nice if the hyphen could get moved out of removal and italicized field directly in front of the spoken part, because only the hyphen being italicised but not the rest of the line is rather weird.

so --->

- Sal-adin.
- Gibberish.

Slasher
18th November 2011, 14:09
Hi Tappen, I just wanted to say "thank you" for the wonderful app!

Also, I would like to see some minor changes/features in the upcoming versions:
* the ability to specify a path for the OCR data file (for example I want it to be the program path)
* the ability to "save as" the subtitles or to be able to change their name when saving
* the app should remember the path of the last opened file, it's better when having multiple subtitles to process
* resizeable window, it can make it easier for the user when doing the OCR
* mouseover tooltip help for the options not described in the vertical right help bar (or add this information to the help bar)

nautilus7
18th November 2011, 16:38
I agree with all the requests. Remembering the last file position would be a very nice addition and i was thinking to ask for this also.

Chetwood
19th November 2011, 07:46
Me too. That's why I already requested some of them. Reading previous posts of a thread can be helpful.