Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles

Reply
 
Thread Tools Search this Thread Display Modes
Old 16th January 2022, 15:19   #1521  |  Link
VoodooFX
Banana User
 
VoodooFX's Avatar
 
Join Date: Sep 2008
Posts: 985
SE 3.6.4 fails to download any spell-checking dictionaries, tried English and few random ones.
VoodooFX is offline   Reply With Quote
Old 19th January 2022, 21:35   #1522  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Quote:
Originally Posted by VoodooFX View Post
SE 3.6.4 fails to download any spell-checking dictionaries, tried English and few random ones.
I've just tested all spell check dictionary downloads... all work fine now, so it's must have been something temporary - or some firewall issues.


@Janusz: Perhaps it's better with external images/files?

Last edited by Nikse555; 20th January 2022 at 07:21.
Nikse555 is offline   Reply With Quote
Old 20th January 2022, 01:08   #1523  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: Perhaps it's better with external images/files?
The problem with the rules:
<WordPart from = "W" to = "W " />
<WordPart from = "w" to = "w " />
explained. The Polish dictionary contains the unused word "wżyć" and for this reason the word "wżyciu" has not been split into two words "w życiu" (in life). Sorry for the confusion.
As for the change from lower case to capital letter at the beginning of the paragraph, or (".) to (...) at the end, the topic is relevant. If I prepare the examples properly, I will come back to the matter.
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 20th January 2022, 22:14   #1524  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Nikse555
Here are examples of how enabling "Fix common OCR ..." affects our text received during OCR.
Sample files to download.
  1. The first lowercase letter in the text where OCR started or resumed is replaced with the corresponding uppercase letter.
    As you can see in the 4th line at the bottom, this does not apply to the letter "l", which has been replaced with an "I" which made the word "Iet" (let) unrecognized and placed in the bug list.
    Some time ago I wrote about it, he also wrote @tormento when instead of "I" we got "L".
    This is not the case when "l" is not the beginning of the text, but the beginning of a new line in the text. (See line # 1 in the same example.)
  2. If the line on which OCR was started or resumed is correctly terminated with ". - they are changed to ..., but also not always - as you can see in the second example at the bottom.
    Your comment: "// lines ending with ". Should often end at ... (of no other quotes exists near by)" in "OcrFixEngine.cs"
Attached Images
 
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 21st January 2022 at 15:10.
Janusz is offline   Reply With Quote
Old 21st January 2022, 18:21   #1525  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: I did not understand the "w" issue.
The two other issues I do not have here with default dictionaries.
Nikse555 is offline   Reply With Quote
Old 21st January 2022, 22:53   #1526  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: I did not understand the "w" issue.
I wanted to split the phrase "wżyciu" (inlife) into two words "w życiu" (in life) and it didn't work because, as it turned out, the phrase "wżyciu" (inlife) is in the dictionary so the rule <WordPart from = "w" to = "w " /> will not work in this case.
Quote:
The two other issues I do not have here with default dictionaries.
After the first scan of all the text, press [Start OCR] on the 2nd, 3rd, and 4th lines separately and additionally on the last (fourth) line again and you will see what I mean.
In the second example, I made a mistake with the order of the characters: is ." and it should be: ". so here we will not get ... instead ".
This is especially frustrating when you create a rule to fix a bug on a specific line and it works for that line, and after scanning all the text you find it doesn't work.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 21st January 2022 at 23:12.
Janusz is offline   Reply With Quote
Old 25th January 2022, 03:45   #1527  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
In addition to the previous post, I attach a new image with a description of the imperfections of text correction after OCR after enabling the option [Settings/Tools/Fix common errors - also use hard-coded rules].

My program version: 3.6.4 NEXT, beta 388. The contents of the Dictionaries directory: apart from the standard English and Polish dictionaries, I have deleted the remaining files.
The contents of the zip file:
- ivon.source.srt - source file - used to create sup - used for comparison with the OCR result,
- ivon.source.sup - proper file with subtitles,
- ivon_60.12.8.131.250.nocr - character base - please set threshold = 131,
- ivon.d-on_f-on.srt - OCR result without any correction.
My OCR settings as in the picture.

Files to download



Observations:
  1. Lines 1 and 4 - If the beginning of the paragraph should start with a capital letter and replacing lowercase letters with their uppercase equivalents makes sense, it makes sense to unconditionally replace "l" with "I" without confirming the existence of a new word in the dictionary earlier not any more. As a sweetness on line 8, such a substitution gave the correct word at the beginning of the paragraph.
  2. Subsequent words and errors - not all of them - suggest that the replacement of "l" with "I" of the first letters of words takes place only after prior confirmation of the existence of the new word in the dictionary (all of them are 'Iran'). Otherwise, the word is unchanged (london, LOndon, and lran).
  3. Remaining words and errors: in line 7 as a result of the unfortunate change (".) - end of paragraph) to (...) - continuation style, caused that instead of Iran we have lran.
  4. Lines 2, 4, 5, 6, and 7 for a newline without a preceding to (.) kept the word on the newline unchanged.

In addition to nOCR, I also checked:
  1. Tesseract 3.02 - without success - in key places instead of "l" I got "|").
  2. Binary image compare - while using the character matching you can achieve a very good result when it comes to OCR, there are still words for manual correction. It was nice that on the All Fixes list I got line 6 saying that I changed ". To ....

Conclusions:
  1. While in the case of English and Polish, replacing a single letter "l" with "I" makes sense at the beginning of a paragraph or sentence, then "i" to "I" do not get "I" in the sentence between words written in lowercase.
  2. Changing the word starting with "l" to "I" - literally - based on the dictionary - yes.
  3. What's wrong with [Auto fix names where only casing differs] is not working? The words: london, LOndon from the automat should be fixed, however it did not happen. Let's try to add "london" to the [Add to names] list, we'll get "London" - nothing easier - just click OK, but don't do it - it won't work. ???, let's try "LOndon" - give "London" and OK - it will work, the last "lran" instead of "Lran", enter "Iran" and OK - it works.
  4. Cannot fix more without _OCRFixReplaceList.xml.
    I mean, you can turn off the [Settings/Tools/Fix common errors - also use hard-coded rules] option, but then we will lose a lot of nice things, so let's ask ourselves is it worth it?
Attached Images
 
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 25th January 2022, 03:54   #1528  |  Link
iKron
Registered User
 
Join Date: May 2021
Posts: 16
I using "subtitle edit" app to convert PGS to srt. in this example there is a word "ANNIE"

but app read it as "ANN IH". in inspect compare matches option how can i remove the space between "ANN" and "lH". any help please?

https://i.imgur.com/cuRJKjl.png
iKron is offline   Reply With Quote
Old 25th January 2022, 05:12   #1529  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@iKron
1. Increase the number of pixels by 1 or 2 and check if the gap disappears. If not enough, add more.
2. You have assigned the H character to the E picture. Set to H, change the assignment of the E to E picture in the text field.
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 25th January 2022, 05:26   #1530  |  Link
iKron
Registered User
 
Join Date: May 2021
Posts: 16
@Janusz thank you. no of pixel space 10 worked fine. i got another problem.

there is a space between two word. but it's merged. is there anyway we can add space? please check the screenshot.

last word OCR converted to "ofAbed". it suppose to be of Abed. it was working fine if i use pixel space 8

https://i.imgur.com/xh6igwk.png
iKron is offline   Reply With Quote
Old 25th January 2022, 08:06   #1531  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@iKron
In this case, decreasing the space will separate the words.
Additionally, enable the [Try to ques unknow words] option, because there may already be corrections on the list of suggestions.
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 25th January 2022, 18:25   #1532  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: thx for the files - I've tried to improve the ocr fix engine here: https://github.com/SubtitleEdit/subt...leEditBeta.zip
Better?
Nikse555 is offline   Reply With Quote
Old 25th January 2022, 18:53   #1533  |  Link
iKron
Registered User
 
Join Date: May 2021
Posts: 16
@Janusz thank you for the output. i am really new with this subtitle edit. few suggestion i am looking. is it wise idea to add unknown words to "user directory" like here is "BFFs"
https://i.imgur.com/znA1MO1.png

what is the difference between "add to name/noise list" and "add to user directory"

and is there anyway to disable this option. whenever i finish subtitle edit a popup box appear.

https://i.imgur.com/N1FKczy.png

Last edited by iKron; 25th January 2022 at 20:45.
iKron is offline   Reply With Quote
Old 25th January 2022, 23:41   #1534  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: thx for the files - I've tried to improve the ocr fix engine here: https://github.com/SubtitleEdit/subt...leEditBeta.zip
Better?
Thanks for the fix. Increasing the distance between the opening and closing quotation marks will have a good effect. Long quotes happen much less often than 3-4 lines.

While looking for a way to recover lost characters quickly and reliably, I ran into an error in [Tools/Fix common error]: checking the [Add missing quotes (")] option will not cause the list to be corrected to show lines with a single (").
This can be checked in the current stable or beta version on our example.

@iKron
  1. Download what is available for download in my post here above, follow the description of what to do with it and you will see how "add to name/noise list" works.
  2. "add to user directory" is the same but without case sensitivity, ie adding "london" will cause the program to recognize the words "london", "LOndon" and "London" as correct words.
  3. Pop-up - this is a suggestion by the author of Subtitle Edit, Mr. Nikse555, with whom I am talking to above, that you install the video player he recommends. If you do this again, you won't see this box.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 26th January 2022 at 00:16.
Janusz is offline   Reply With Quote
Old 26th January 2022, 02:01   #1535  |  Link
iKron
Registered User
 
Join Date: May 2021
Posts: 16
thank you so much Janusz. two more question

when i converted subtitle via nOCR i got popup box, there is option "Foreground" and "NOT foreground". what is the difference between "Foreground" and "NOT foreground",

also difference between "OCR via nOCR" and "Binary image compare"

lastly is Tesseract method good? which method is good to convert the sub.
iKron is offline   Reply With Quote
Old 26th January 2022, 08:14   #1536  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@iKron
  1. What version of the program are you using? I have not seen such a window and I do not know what window it is about. I can only guess that it is about hiding the main program window for OCR.
  2. The basic difference between nOCR and the comparison of images is the method of detecting a character from its image and the method of its storage in the character database. nOCR is scaled. Comparing images knows how to use the nOCR character database.
  3. Tesseract is good, but slower and generates a lot of errors. The basic version of SE contains the appropriate files for error correction, so the user decides about the choice of the OCR engine.
  4. sub - I can understand DVD subtitles, any method is good. I prefer nOCR because of its speed. I use Tesseract when the font of the inscriptions is decorative or very exotic.
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 29th January 2022, 11:00   #1537  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by Nikse555 View Post
thx for the files
I have issues with a left/right hearing impaired sup file.

It splits the sentences on left and right side according to the talking actor.

I know that asking you to support {\an*} would lead to excessive programming work, as you already stated.

What would be useful is to fix subtitles with more than 3 lines, making the CR removal only when there are commas or spaces and not full marks or capital letters.

Just try to OCR it and fix common errors and you will see what I mean.: it mixes dialogues between different actors.

The least I can ask is not to make the rule behave in a dumb way. After that some manual work will wait me.

Perhaps you could introduce some "special" characters to recognize left and right side, letting us to have a easy job with such kind of sup files.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 10th February 2022, 03:07   #1538  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Nikse
SE does not recognize missing <WholeWords> section in _OCRFixReplaceList_User.xml
After installing the program, the first time you use [Add pair to OCR replace list] during Import/OCR ... or via Settings/Word lists [Add pair] to [OCR fix list], the file "_OCRFixReplaceList_User.xml" is created.
If we deliberately remove the <WholeWords> section from it for some reason and forget about it, the program will not create the missing section, allowing you to add new pairs of words that will not be saved anywhere.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 10th February 2022 at 03:16.
Janusz is offline   Reply With Quote
Old 10th February 2022, 10:11   #1539  |  Link
Newrone
Registered User
 
Join Date: Sep 2009
Posts: 1
Hi,
Is it possible to move the video forwards or backwards frame-by-frame in SubEdit, as it is in Aegisub?
I couldn't find any reference to it and it is sometimes essential to avoid "flashing" subtitles on scene changes.
Newrone is offline   Reply With Quote
Old 12th February 2022, 10:28   #1540  |  Link
Sakura-chan
Registered User
 
Join Date: Sep 2010
Posts: 34
Hi,

So I was trying to export some subs to SUP with Subtitle Edit. But no matter what font or style I choose, lines come out horribly misaligned. See:

https://i.ibb.co/TgyRm7P/Untitled.png

(Image as link because it's wide and breaks the forum layout)

Without apparent sense lines randomly appear higher or lower. The first window shows the desired height, the one the most lines are shown at. You can see the other three at varying heights. Double line, italics, caps, it seems it doesn't matter, it makes no sense.

How do you make the bottom line in every picture appear at the same height? :-/



P.S.: I've tried some more. Depending on the font, more or less number of lines are shown aligned. For example Times New Roman is the most consistent, still a few lines are too high or low. Even if it worked it's a horrible font for subs though.

Edit 2: Some shitty fonts, like Tempus Sans ITC, seems perfectly in line. I scrolled through a lot of sub-pictures and they look pixel perfect. Ofc, it's an even more horrible font for subs.

Seems it's a matter of having just the right font? Why can't it work with Arial? Weird.
Sakura-chan is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 15:03.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.