Subtitle Edit [Archive] - Doom9's Forum

Nikse555

8th October 2011, 12:19

Subtitle Edit 4.0.7 is now out
https://github.com/SubtitleEdit/subtitleedit

SE is an open source (C#) subtitle editor with main focus on creating/editing/sync'ing/adjusting/fixing subtitles, but SE can also import and ocr vobsub and blu-ray image based subtitles (even from matroska/mp4 files), and DVB sub + teletext from .ts files.
SE supports 300+ subtitle formats - let me know if you need more ;)
Can create/edit blu-ray sup and bdn xml files.

Available for Windows and Linux

mastrboy

9th October 2011, 21:43

:) thanks...

Ghitulescu

10th October 2011, 06:59

It looks promising.
I never used it before, that's why I would ask you how well manages SE32 to work with DVD subtitles (SUP), like retiming, synching (to other/preexistent SUP), bitmap editing etc?

Nikse555

10th October 2011, 20:09

It looks promising.
I never used it before, that's why I would ask you how well manages SE32 to work with DVD subtitles (SUP), like retiming, synching (to other/preexistent SUP), bitmap editing etc?

Not too well :(
SE can read and ocr dvds/vobsub/blu-ray sup + a few more image based formats - but the only image based format SE can write is bdn xml/png.

(the blu-ray sup code is converted from 0xdeadbeef's java code for BDSup2Sub)

StainlessS

16th October 2011, 03:36

Crash during "Fix Common Errors".

Crash Report & Mi2.srt here:-

EDIT: Link removed.

Nikse555

16th October 2011, 20:33

Crash during "Fix Common Errors".

Crash Report & Mi2.srt here:-

http://www.mediafire.com/?4z8obiqskr8ikui

Hi StainlessS!

Thx for reporting this :)
Fixed here: http://www.nikse.dk/SubtitleEdit.zip

StainlessS

20th October 2011, 10:46

Thanks Nikse555, got something else to keep you busy.

SubTitle Edit 3.2.2, Build 25663

Crash during Spell Check (HunSpell, dont know if same error as previously reported in other thread)

Crash Report & DWL.srt here:-
http://www.mediafire.com/?6pltpi72a82lz52

MajorX

21st October 2011, 02:19

Thanks Nikse555 for new version of SE.
I have some problem with OCR ...plzz help me...when i use 3.2 OCR of VobSub & Blu-ray sup files are working perfectly but when uninstall it and install new version 3.2.2 my OCR not working now..it only shows orange lines no text. :(

Sample of *.sup subtitles...can u plzz check these subtitles.

http://www.mediafire.com/?zqml8hbrcy6jqgt

Nikse555

22nd October 2011, 07:42

Crash during Spell Check (HunSpell, dont know if same error as previously reported in other thread)

Looks like it's still Hunspell suggest!
I could not re-create this error on my Win7 machine, but I've tried to fix it here (by running suggestions in a separate thread): http://www.nikse.dk/SubtitleEdit.zip
Any better?

Nikse555

22nd October 2011, 18:52

StainlessS

23rd October 2011, 19:30

Sorry for the delay, Nikse555,
Before I tried your update, I had to download the srt from MediaFire as I did not
keep a verbatim copy. I tried it with the original faulting build 25663, and it did
not fault. Tried this several times, no fault. Got the version srt that I kept,
(probably spell checked via other means) and checked that, same thing, no
fault. Have not ripped any other subs since then (I think) and made no changes
to the setup. I guess it will have to remain a mystery. :confused:

EDIT: Also tried with build 13726, no fault (earlier build No ???).

Chetwood

24th October 2011, 06:51

What do I do to OCR German subs? I've downloaded a German Tesseract package and unpacked it to the program dir but to no avail. I pretty much have to type every word?

Nikse555

24th October 2011, 07:36

Sorry for the delay, Nikse555,
...
Tried this several times, no fault.
...
I guess it will have to remain a mystery. :confused:

Yep, the nhunspell "suggest-method" is not entirely stable

What do I do to OCR German subs? I've downloaded a German Tesseract package and unpacked it to the program dir but to no avail. I pretty much have to type every word?
The German tesseract package should be unpacked to Tesseract\tessdata. Unpacked the file is called deu.traineddata.
Do choose Tesseract as OCR method (not image compare)
And if you're lazy just get this version with German dictionaries included: http://subtitleedit.googlecode.com/files/SE322DE.zip

Chetwood

25th October 2011, 07:55

Mm, I had unpacked it to Subtitle Edit\tesseract\tessdata but ok, your de package is fine, thanks. It also works pretty good, however some events described in parenthesis for the hearing impaired are recognized with mixed case, like

(KEucH†) instead of (KEUCHT)
(I_AcH†) instead of (LACHT).

Also, the small t is recognized as a small l which messes up a lot of items and can only be fixed manually. These new words don't even exit in the German language so shouldn't spellchecking kick in with "prompt for unkown words" being checked? Then the distance between two words ending with r and starting with j is not recognized. Instead of "aber jetzt" it reads "aberjetzt". What can I do to improve this? Thanks.

Nikse555

25th October 2011, 18:04

These new words don't even exit in the German language so shouldn't spellchecking kick in with "prompt for unkown words" being checked?
Yes... problems with loading German dictionary should be fixed here: http://www.nikse.dk/SubtitleEdit.zip
(Tesseract should also be a bit faster now)

Then the distance between two words ending with r and starting with j is not recognized. Instead of "aber jetzt" it reads "aberjetzt". What can I do to improve this? Thanks.
When you press "change all" or "use always" the correction is remembered in OcrFixReplacelist.xml...

xekon

25th October 2011, 22:00

This actually works really good! almost all of the text is right on, and the GUI guides your through smoothly when it needs a fix.

I have a rather strange auto fix though (some kind of error or bug): http://i1208.photobucket.com/albums/cc361/xekon/weirds.png

If you want the .SUP that caused this error to occur give me an email address I can send the file to. (Upon further testing this weird error only happens if the "Try MS MODI OCR for unknown words" checkbox is checked, If I un-check it then this strange substitution does not happen.)

The only OCR error that I get that does not get automatically corrected is the letter "k" being detected as "l<" and not being auto corrected: http://i1208.photobucket.com/albums/cc361/xekon/k.png

I notice similar errors in OCR but they DO get auto corrected like "I\/ly" -> "My"

is there a way I can add "l<" to be autocorrected to "k" ?

also a setting in the options panel to disable "Try MS MODI OCR for unknown words" by default would be handy, then I wouldn't have to uncheck it every subtitle I load

I have also had "d" been detected as "ol" pretty often, then the spell checker dont recognize the word so I edit it manually and change the "ol" to "d"

like the word worried, gets detected as worrieol

Nikse555

26th October 2011, 10:03

...a setting in the options panel to disable "Try MS MODI OCR for unknown words" by default

OK, this setting is now remembered - but I've also improved then check for when to use MODI, so do try to keep it on.

Tesseract (new 3.01 version) now runs in it's own thread, so it should be a bit faster too.

Link to new version:
http://www.nikse.dk/SubtitleEdit.zip

How is it working?

Chetwood

26th October 2011, 14:22

problems with loading German dictionary should be fixed here: http://www.nikse.dk/SubtitleEdit.zip
Mh, this file contains no German dictionary so I copied the one from the de.zip over.

When you press "change all" or "use always" the correction is remembered in OcrFixReplacelist.xml...
The German Umlaut ü (u with two dots above) is often recognized as two i's: ii. Since it's common in several words, how do I replace it for all of them? Thx.

Nikse555

26th October 2011, 14:52

The German Umlaut ü (u with two dots above) is often recognized as two i's: ii. Since it's common in several words, how do I replace it for all of them? Thx.

You can edit [Subtitle edit folder]\Dictionaries\deu_OCRFixReplaceList.xml - add a new WordPart under PartialWords:
<PartialWords>
...
<WordPart from="ii" to="ü" />
</PartialWords>
SE will now look for correct spelled words, where "ii" is replaced with "ü".

You can also take a look at "eng_OCRFixReplaceList.xml".

chainring

26th October 2011, 23:41

Just wanted to stop in here and say thank you for this awesome tool. I love loading up a .sup, letting OCR rip through and having minimal work to correct errors. I can get through an entire movie in 20 minutes.

MajorX

27th October 2011, 03:23

xekon

28th October 2011, 22:33

I have another feature request, or maybe you know of a configuration file I can edit so that a replacement is always performed.

I would like to replace ’ with ' because ’ shows up very weirdly (last word is supposed to be: didn't ):

http://i1208.photobucket.com/albums/cc361/xekon/didnt.png

Nikse555

29th October 2011, 05:44

@MajorX: This is hard to say why without the actual sub... The ocr window has a check box weather to use time codes from .idx file or from .sub file.

@xekon: Works here in latest version: http://www.nikse.dk/SubtitleEdit.zip

xekon

29th October 2011, 08:01

WOW! you weren't kidding about it going faster! just did a couple more episodes and its zooming through the lines much faster!

edit: odd new bug:

I'm sorry!

was detected as:

I.m
s
0
r
rY
!

http://i1208.photobucket.com/albums/cc361/xekon/imsorry.png

Anakunda

29th October 2011, 15:33

Hello there!
I feel like having trouble with OCR. Recognizing from SUP format, tried both methods and both have significant inaccuracies:
In the pattern comparison mode, the engine totally ignores differencies between letters 'i' and 'l', and 'c' and 'o' and 'e'. All the letters are assigned the character that was assigned by the first occurence of on of letters from "same" group. For example. 1st subtitle contains word more, the wizard stops at o and I assign it o. When it passes over e, it doesnot ask again for letter even if that s 1st "e" in subtitles and assigns it automatically 'o'. That's very bad. I don't know if that's a result of some auto corrections made by SE, but seems to get wrong assigned even if I turn off all the auto corrections on the right side.
That's about character comparison method. Tesseract seems to work better but has considerable flaws too:
Some characters are auto uppercased even if they are in lowercase in the source matrix, especially it concerns 's', 'z', 'c' and 'a'. All occurences of these letters seem uppercased regardless on case in the original matrix if they stand as standalone letter or 1st letter in word. All of s, z, c and a's are kept lowercase if in middle a word.
PPlease give me some suggestions to make functional at least one of the methods, so that most words are recognized properly and don't need to correct by spell checker. The uppercase problem even doesnot seem repairable by spell checker processing!
Thank U !

xekon

29th October 2011, 17:19

I have another feature request, could we have a checkbox to omit all <i> </i> tags, they are being used for only half lines when the whole line is italic, they are also being used when there are no italic lines at all.

Right now after I rip a sub I am going through and doing find/replace to delete them all, but it would be great to have that as a feature in Subtitle Edit.

very often !! gets detected as ll

Is this something that can be fixed? or is there something I can do to help with the detection of exclamation points? or do I have to wait till tesseract is updated?

EDIT: on a side note, whatever you did for MS MODI OCR seems to have worked. and it definitely does help!

here is an example of the ll instead of " or !!
http://i1208.photobucket.com/albums/cc361/xekon/1.png

http://i1208.photobucket.com/albums/cc361/xekon/2.png

xekon

30th October 2011, 09:27

OMG OMG OMG! The programmer in me has just thought of a VERY COOL feature you could add!

call it a visual tool for super fast comparison. (OCR can only get so good, and if you want to verify perfect subs, this is a good way to do it.)

The goal should always be perfect OCR on the first sweep, but visually checking the subs afterwards is just to verify, and the quicker you can do that the better.

Let me know what you think of this idea, I am sure it would actually be something that would be pretty fun to program.

Please let me know what you think because i think it would be AWESOME!

I am drawing an illustration in Photoshop now.

EDIT: ok to illustrate my idea... OCR a .SUP file. then use the arrow key to go down line by line, reading the text, and then looking at the image to compare and see that they are the same.

Now, that is not exactly quick, the brain has to think more, it has to remember more, and your eyes have to move and focus on more than one area, below is my idea:

Basically, use an opengl or directx library that can overlay text, or any library that looks like it will work to overlay text with transparency. And size the text to roughly overlay the SUB image with like a 50-60% transparency. The letters dont have to line up perfectly, anywhere close will allow you to quickly with just a glance tell if the sub and text match visually. (basically you read the sub line ONLY once, and your brain looks for discrepancies as you do it. versus reading two or three times, and moving your eye between locations, and also having to remember and hope you remember correctly.)

I think for somebody that visually checks there OCR for their subs, this would probably speed up the process for them 200%+

see how easy it is to see that they match:
http://i1208.photobucket.com/albums/cc361/xekon/idea.jpg

here is one that passed the OCR, but is incorrect:
http://i1208.photobucket.com/albums/cc361/xekon/wrong1.jpg

here is another one that passed the OCR, but is incorrect (depending on the library used you could even apply a border/stroke to the outside of the letters)
http://i1208.photobucket.com/albums/cc361/xekon/wrong2.jpg

here is another, there is probably one that passes through the ocr, green light and all, in every episode, you just have to look carefully (you might even be able to adjust the thickness of the characters, so that they usually fall within the bounds of the SUB image character outlines):
http://i1208.photobucket.com/albums/cc361/xekon/wrong3.jpg

Chetwood

31st October 2011, 07:56

Nikse555

31st October 2011, 19:20

Thanks Nikse555 :)
I have some problem in timing with some subtitles.
when i use OCR...First---I extract subtitles(*.VOBSUB) from video then use it in OCR ..it shows some start time & end time problem like if the original sub have,
Stat Time --> End Time 00:00:13,097 --> 00:00:19,185
OCR shows 00:00:13,097 --> 00:00:17,185
but if i use subtitles direct from video it shows correct start time & end time in OCR.

My guess would be that the application you ripped the vobsub with did not use time codes from the mkv container, but rather used the time codes in the sub file itself (the time codes in idx and sub file are exactly alike).

xekon

31st October 2011, 19:43

Nikse555 please let me know what you think of my idea, if its not something your interested in, then I will try adding it. I just noticed Subtitle Edit is open source.

Could I please have a copy of the source code that is as current as: http://www.nikse.dk/SubtitleEdit.zip

the one on code.google.com is October 14.

xekon

31st October 2011, 19:47

Looks impressive but I think it's overkill. Why not simly have a small window showing the item and an editable text window below that shows the OCRed text. In case they don't match simply alter the text and move on to the next item.

Subtitle Edit has very accurate result for the OCR. There are usually only 1-3 wrong subs out of 300 lines. That is quite impressive. So generally you wont need to do much editing, only verifying. The method I posted is the quickest way that I can think of to scan through entire sub files after the OCR and visually verify.

Nikse555

31st October 2011, 20:49

Hi Anakunda!

...
In the pattern comparison mode, the engine totally ignores differencies between letters 'i' and 'l', and 'c' and 'o' and 'e'. All the letters are assigned the character that was assigned by the first occurence of on of letters from "same" group. For example. 1st subtitle contains word more, the wizard stops at o and I assign it o. When it passes over e, it doesnot ask again for letter even if that s 1st "e" in subtitles and assigns it automatically 'o'. That's very bad.

Yes, this is true. I've tried to improve it a bit here: http://www.nikse.dk/SubtitleEdit.zip
A work-around is to right-click on the offending line in the list view, and choose "Inspect compare matches for current image" - here you can choose "Add better match" to correct mistakes.
(my image compare code is a bit slow for blu-ray images...)

Tesseract seems to work better but has considerable flaws too:
Some characters are auto uppercased even if they are in lowercase in the source matrix, especially it concerns 's', 'z', 'c' and 'a'. All occurences of these letters seem uppercased regardless on case in the original matrix if they stand as standalone letter or 1st letter in word. All of s, z, c and a's are kept lowercase if in middle a word.

Is this still the case in latest version?
If yes, could you provide a test file + a few line numbers?

Nikse555

1st November 2011, 08:53

...
call it a visual tool for super fast comparison. (OCR can only get so good, and if you want to verify perfect subs, this is a good way to do it.)
...

Another way to proof read would be to right click in the list view - and choose "Save all images with html index...". This displays a web page with all images + ocr'ed text if available. In latest version, this also shows text with background color.

MajorX

2nd November 2011, 03:03

Hi Nikse555
can u plzz check this *.SUP file...i get only strange symbols with OCR.

http://www.mediafire.com/?aoy66c5ue9mbah9

xekon

2nd November 2011, 03:08

MajorX I tried your file with Nikse555's latest version here: http://www.nikse.dk/SubtitleEdit.zip

I also got lots of symbols if I had "Try MS MODI OCR for unknown words" unchecked.

but if you use the MS MODI OCR it detects all of them just fine :)

give it a shot.

PS: I wonder if that subtitle file has ever had its resolution resized.... the letters are really bad quality.

MajorX

2nd November 2011, 07:36

I try with this version but i can't enable MS MODI OCR ...can u tell how can i do this.

http://img266.imageshack.us/img266/1182/74087558.png

kypec

2nd November 2011, 10:50

I try with this version but i can't enable MS MODI OCR ...can u tell how can i do this.
You must have some Microsoft Office libraries installed for this to work IIRC...

Nikse555

2nd November 2011, 22:30

Hi Nikse555
can u plzz check this *.SUP file...i get only strange symbols with OCR.

Thx for the file :)
This font don't look blu-ray like but seems clear enough. Resizing did not help, but changing font color to white seems to help, so this is included latest version, which should handle your sup better: http://www.nikse.dk/SubtitleEdit.zip

MajorX

3rd November 2011, 06:06

Thx for the file :)
This font don't look blu-ray like but seems clear enough. Resizing did not help, but changing font color to white seems to help, so this is included latest version, which should handle your sup better: http://www.nikse.dk/SubtitleEdit.zip

Thanks Nikse555....working perfectly. :) :)

Nikse555

13th January 2012, 15:17

Subtitle Edit 3.2.3 is now finally out with lots of minor improvements and fixes!

Change log
New: Added Brazilian Portuguese - thx XXXXXXXXXX
New: Added Italian language file - thx Maff
New: Added Portuguese (Portugal) language file - thx Ricardo Perdigão
New: Added Japanese language file - thx Nardog
New: Added Spanish language file - thx m2s
New: Support for subtitle format AvidCaption - thx Laszlo
New: Support for F4 subtitle formats - thx Fred
New: Export to Blu-ray sup format
Improved: Updated Tesseract to 3.01. Now includes (some) italic detection + adds support for Arabic, Hebrew, Hindi and Thai
Improved: Undo improved so it also works for textbox + redo (Ctrl+Y)
Improved: Many new configurable shortcuts (e.g. for fullscreen video player)
Improved: OCR tweaked a bit + BluRay sup files are processed faster
Improved: TextBox with current subtitle now shows cursor position - thx Leszek
Improved: Subtitle format PAC much improved - thx Peter
Improved: Subtitle format FCP Xml improved - thx Ulrik
Improved: Subtitle format D-Cinema improved - thx Karam
Improved: Splitting of lines - Thx Trottel
Improved: Auto break lines - thx Majid
Improved: Some fixes for Fix common errors/Remove text for HI - thx Majid
Improved: Optimized Fix Common Errors
Improved: DirectShow can now also play audio-only files
Fixed: Crash when setting Options - thx karmazyn
Fixed: Crash in set color (or set font) - thx LEO33
Fixed: Crash/freeze when loading large subtitle files - thx Ulrik
Fixed: Bug when clicking in list view while running ocr - thx sialivi
Fixed: De-selecting text in textbox via single click - thx XhmikosR
Fixed: Possible crash in spell check + German dictionary should work
Fixed: Missing save/load of a fix common errors setting - thx menes
Fixed: Removed Microsoft translate as it's useless with new quotas
Fixed: Milliseconds in timed text - thx Calle
Fixed: Names with spaces now works in spell check - thx Dr. jackson
Fixed: Do not use frame rate if it's zero (audio files) - thx dixie.fever
Fixed: Possible crash when saving xml files - thx Peter

http://code.google.com/p/subtitleedit/downloads/list

kalehrl

13th January 2012, 16:01

Thank you Nikse.
This is the best and most complete subtitle editor ever.

tonyymmao

22nd January 2012, 01:06

this really is good, i just have one question though, is it possible to add a border option for the text, cuz i've seen some videos with image based subs have different colour borders like ass.

Nikse555

22nd January 2012, 09:46

this really is good, i just have one question though, is it possible to add a border option for the text, cuz i've seen some videos with image based subs have different colour borders like ass.
For which subtitle format?

SE allows for styles for in .ass files, but otherwise SE does not offer other styles than italic, bold, and font color in SubRip/MicroDvd files.

Latest SE test version should be able to export VobSub (and bd sup with correct timestamps) via File -> Export.
Could anybody verify that it works?
http://www.nikse.dk/SubtitleEdit.zip (or get the C# code from svn and compile it yourself)

Chetwood

23rd January 2012, 14:52

I've set SE to ANSI as the default encoding type, OCR'ed a Vobsub and saved it to SRT. When I open it again in SE with "Autodetect ANSI encoding" checked, all German Umlauts are messed up. When unchecked the umlauts are ok.

When exporting to VobSub on default settings the sub is far less readable than the original VobSub. Gonna pm you some files.

OtonVM

4th April 2013, 09:59

This is an amazing piece of software, thank you for making it!

I have found a slight problem with FAB subtitles in Encore.
I export from srt to fab so I can keep italics and such. What I get is an image like this (tinypic converted from tiff to jpeg removing the alpha channel so black is actually transparent):
http://i45.tinypic.com/4u8ggj.jpg
with it's script:
IMAGE002.tiff 00:03:01:01 00:03:03:18 123 417 596 465
and this is the result:
http://i48.tinypic.com/2upwrnp.png
notice the line above the text.

What I have to do is lower (I think it's lower) the image by 1px:
IMAGE002.tiff 00:03:01:01 00:03:03:18 123 416 596 465
so I can get this:
http://i48.tinypic.com/2dig4g2.png

One line is not a problem but it happens multiple times, seemingly at random.

Nikse555

8th May 2013, 08:04

Latest version of SE (installer version only) uses the roaming profile. On my computer it's C:\Users\Nikse\AppData\Roaming\Subtitle Edit\Tesseract\tessdata

If you edit your Settings.xml file (C:\Users\<username>\AppData\Roaming\Subtitle Edit\Settings.xml) and change <ShowBetaStuff>False</ShowBetaStuff> to <ShowBetaStuff>True</ShowBetaStuff> you will get a "..." button that can download/install Tesseract languages for you in the ocr window.

@OtonVM: Sorry about not replying sooner - do you still need this?