Subtitle Edit [Archive] - Page 31

View Full Version : Subtitle Edit

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [31] 32 33 34 35 36 37 38 39 40 41 42 43 44

Nikse555

3rd December 2021, 12:39

SE 3.6.4 is out: https://github.com/SubtitleEdit/subtitleedit/releases

Fixes issue with blu-ray sup palette (thx to Master Yoda) + fixes an issue with "Set start and offset the rest" where first selected line would not change + support for Tesseract OCR 5.00 final.
Change log: https://raw.githubusercontent.com/SubtitleEdit/subtitleedit/master/Changelog.txt

tormento

4th December 2021, 11:32

support for Tesseract OCR 5.00 final
There are many dll that come with windows builds, are all of them needed?

Is there a way to speed it up?

Nikse555

4th December 2021, 12:11

There are many dll that come with windows builds, are all of them needed?

Is there a way to speed it up?

I really don't know if all of the Tesseract dlls are needed.

If some C++ experts read this, then perhaps they will know if e.g. static linking with single exe file will make Tesseract faster to load?
If SE used the dll instead of calling "tesseract.exe" for each image, that would be faster - but using the dll had some problems last time I tested it (some years back).

tormento

9th December 2021, 10:56

using the dll had some problems last time I tested it
Please try again. :)

While decoupling of OCR hard rules hasn't happened yet, would you please add:

English

'II can't exist, it should always be 'll

Italian

Io, Ia, Ii can't exist as single word unless a . is before it or it's on a new line — It should be lo, la, li
Ià, Iì can't exist at all as single word — It should be là, lì
I' can't exist at all as single word — It should be l'
II can't exist as single word unless a . is before it or it's on a new line — It should be Il

I will update the list upon necessity :)

Janusz

12th December 2021, 23:35

Here is one way to successfully fix "l" to "I" and "I" to "l" in the OCR process.

Files to download: (https://www.mediafire.com/file/tzlkpaz529vzli9/italian_lo.zip/file)
it.test.txt - random text from the Italian website, so probably flawless (h__ps: //www.ilsole24ore.com/),
it.test.0.srt - Italian text converted to srt file (it will be used to compare the OCR result),
it.test.sup - the .sup file contains both the letters "l" and "I",
it.test.i.sup - the .sup file does not contain "l" ("l" has been replaced with "I"),
it.test.nocr - character database contains both "l" and "I",
ita_OCRFixReplaceList.xml - this file does all the work.
The files should be placed in the correct directories.
A few words about the settings in the program:
Option / Settings / Tools:
Fix common OCR errors ... - on,
Auto fix names where ... - on / off, Also fix names via ... - on / off (for this test, no difference).
Import / OCR:
OCR method - OCR via nOCR, No of pixels is space - 10 (11 starts to connect words)
Max wrong pixels - 8, Constants italic - off, Line split min ... - Auto, Language - it.test
Dictionanry - Italian, Prompt for unknow words - off, Try to guess unknown words - on
Binary image compare threshold - 200.

I will leave the result of the comparison without comment. Everyone can judge it for themselves.
Whether the method I used will give an equally good result in another language - I do not know.
In Polish, English, Italian (as you can see) - yes. I'm not saying that this way of solving the "l" and "i" problems is perfect.
My character databases do not contain "I" so I can say that the problem with "l" does not apply to me, regardless of the language used.
What about "I"? I think 99% or more is done by this one RegEx.

@tormento:
II can't exist as single word unless a. is before it or it's on a new line - It should be Il
What about: I II III IIII IV V etc? (IIII is also correct).

@Nikse:
From the number and type of differences (4) it can be seen that Fix common OCR errors ... requires some fine-tuning ('' double accent - 3 errors).
Using this method, I omit the lack of correction for words shorter than 5 characters in Fix common OCR errors ...
Thanks for this great software.

tormento

13th December 2021, 09:27

What about: I II III IIII IV V etc? (IIII is also correct).
I prefer to have rare wrong roman numbers than frequent wrong I* spellings. And, no, IIII is not correct. Someone uses it but it's not.

Janusz

13th December 2021, 11:32

I prefer to have rare wrong roman numbers than frequent wrong I* spellings.We have one sentence on this matter.

And, no, IIII is not correct. Someone uses it but it's not.As for IIII - I have a different opinion (https://it.wikipedia.org/wiki/Sistema_di_numerazione_romano). Just because it's not common doesn't mean it's wrong.

tormento

13th December 2021, 11:43

As for IIII - I have a different opinion (https://it.wikipedia.org/wiki/Sistema_di_numerazione_romano). Just because it's not common doesn't mean it's wrong.
I have done five years of Latin in the italian lyceum. Believe me it's a vernacular notation more than a correct Latin number, such as the typical and unfortunate american habit to coniugate irregular past verbs with -ed.

Nikse555

19th December 2021, 13:10

English

'II can't exist, it should always be 'll

The English issue should be fixed in this commit:
https://github.com/SubtitleEdit/subtitleedit/commit/dd27e5fe3dd2610ffa10907f4d83dbc541e5b27f

The Italian corrections I've tried to fix in this commit: https://github.com/SubtitleEdit/subtitleedit/commit/7ace6453550fb2915898eb324d56c93f0ddfe746

SE beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip

The regular expressions for Italian can be tested here:
http://regexstorm.net/tester?p=%28%5b%5cp%7bLl%7d%2c%5d+%29%28I%29%28%5boai%5d%5b%2c+%5c.%5d%29&i=Io%2c+Io%2c+and+Ia%2c+and+Ii%2c+and+Ia.+can%27t+exist+as+single+word%0d%0a&r=%241l%243

http://regexstorm.net/tester?p=%5cb%28I%29%28%5b%c3%a0%c3%ac%5d%7c%27%5b+%5cr%5cn%5d%29%5cb&i=I%c3%a0%2c+I%c3%a0%2c+I%c3%ac+can%27t+exist+at+all+as+single+word%0d%0aI%27+can%27t+exist+at+all+as+single+word&r=l%242

http://regexstorm.net/tester?p=%28%5b%5cp%7bLl%7d%2c%5d+%29%28II%29%5cb&i=II+can%27t+exist+as+single+word+II+unless+a+.+is+before+it+or+it%27s+on+a+new+line&r=%241Il

tormento

20th December 2021, 07:39

The regular expressions for Italian can be tested here:
The "Io ", "Ia " and "Ii " are missing.

locotus

24th December 2021, 17:59

That'|| keep you around and I don't think
you'll be doing any card tricks either.

That error survive OCR with tesserac 3.02, spelling correction and fix common errors.

Merry Christmas to all.

GCRaistlin

30th December 2021, 20:39

Please add a keyboard shortcut for [ ] Auto submit on first char (make a letter of it underscored) in OCR - Manual image to text window.

UPD: sorry, missed that it is already there.

UPD2: the underscored letter isn't working when non-English keyboard layout is active. It is expected but still not handy. Can you please change the elements' order in this window so as Shift-Tab in Character(s) as text field would move the focus to [ ] Auto submit on first char?

GCRaistlin

31st December 2021, 01:14

Bug:

During OCRing, we're submitting the incorrect character for a glyph (а instead of АЯ):
https://i.ibb.co/ZN00CBx/01.jpg (https://ibb.co/ZN00CBx)
Pressing Edit last: a button to remove the incorrect database record:
https://i.ibb.co/M5QrvZR/02.jpg (https://ibb.co/M5QrvZR)
Deleting the incorrect record by pressing Delete:
https://i.ibb.co/0rxRr2j/03.jpg (https://ibb.co/0rxRr2j)
Note the selected area:
https://i.ibb.co/zmS1Prb/04.jpg (https://ibb.co/zmS1Prb)

Nikse555

31st December 2021, 21:41

Beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip

@tormento: The "Io ", "Ia " and "Ii " should work.

@locotus: "That'||" and words like that should also work better.

@GCRaistlin: Ctrl+I and Ctrl+A should now toggle check boxes. Also improved tab stop a little. Could not re-create the other issues.

Also, SE now uses a new "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subtitleedit/discussions/5616

Happy New Year :)

Janusz

1st January 2022, 23:01

@GCRaistlin
Bug:
During OCRing, we're submitting the incorrect character for a glyph (а instead of АЯ): ...

This is not a bug in the program
OCR was performed correctly to the place of detention, i.e. to you, to the unrecognized character "7"
so "a" is displayed in place of "АЯ" because this is how it was processed and saved due to your error.
Changing the content of the character base at this point by means of delete will not affect the previously processed text.
Only when you press [Start OCR] again for this image, in the place where "a" was deleted, you will be asked to reassign the character (s) for "АЯ",
and the new, correct screen content with changes will be displayed again only after the entire line has been processed (whole picture).
That's how it works.

@Nikse555
@GCRaistlin: Ctrl + I and Ctrl + A should now toggle check boxes. Also improved tab stop a little. Could not re-create the other issues.

CTRL + A - This is a bad idea. This is the default Windows shortcut for selecting everything in a document or window. In this case, selecting the text in the [Character (s) as text] box also sets [Auto sybmit on first char].

*** Happy New Year ***

Nikse555

2nd January 2022, 14:10

@Janusz: "Ctrl + A" for "auto-submit first char" is now "Ctrl+F". Thx, nice catch :)
I've added a Polish word split list from 44 subtitle files - do give it a test (via "Fix common errors" - "Fix common OCR error" with lines-without-spaces) - "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subtitleedit/discussions/5616

Janusz

2nd January 2022, 17:11

@Niksee555
I have known this discussion and have been following it for some time now, so my dictionary contains over 26,500 words. For the sup file generated from this dictionary from words in random order, it works very well with one "but".
I had to remove all single letters which in Polish are words like: a, i, o, u, w, z; whether: l, m, C, which are again abbreviations of words, because in normal text where unknown words appear (e.g., untranslated surnames, proper names) single letters meant that the unknown word was divided into smaller - one or two letter words - matching the Polish dictionary.
The idea for this new functionality will certainly be used where italics or fonts of different sizes are used that differ significantly in spacing between words.
I will, of course, check your dictionary.

Janusz

3rd January 2022, 18:05

I do not know how to call it. In any case, the case is as follows.

Image to download: (https://forum.doom9.org/attachment.php?attachmentid=17993&stc=1&d=1641227311)

I was doing OCR and encountered something like this:

https://forum.doom9.org/attachment.php?attachmentid=17994&stc=1&d=1641227077

OCR misread "Ź" - you need to fix it, so:

https://forum.doom9.org/attachment.php?attachmentid=17995&stc=1&d=1641227077

Surprise. Line read almost flawlessly and here the same "?".
What is it about? I press [Add].

https://forum.doom9.org/attachment.php?attachmentid=17996&stc=1&d=1641227077

Really, I don't know what to type. @Nikse555, save a father of a large family.
I would like to add that the Previev window shows the correct division of the image into the upper and lower text.

I also tried binary comparison.
First OCR start - ok - flawless, everything in its place,
second time [Start OCR] - I just had the program asking to add "Z".
With Tesseract 3.02 - standard - almost fine, but you can press [Start OCR] at will - the result is always the same.
And here is probably a bug in the program. With Tesseract 3.02 I went back to binary comparison and after pressing [Start OCR] the program crashed. This has been the case several times.
After a reboot, the program no longer hangs, but after another [Start OCR] letters "c" and "Ź" disappear. Although everything is in place in the [Inspect ...] window.

Nikse555

8th January 2022, 22:26

@Janusz: Perhaps you could use a file host?
So I should remove the single letter from the Polish word-split-list? If you generate one, I'll be happy to include it too.

Janusz

10th January 2022, 14:37

@Niksee
I have just sent pol_WordSplitList.txt to your email.
A dictionary (385 kB, 39256 words) was generated for len<5 = 20 and len>=5 = 10.
With "default 11 and 5", there are not many more words and the file grows twice as large.
I have left only the really necessary individual letters in it. I have also removed a few 3-letter words that are missing from the dictionary I use (pl_PL.dic from 1 December 2021).
Because the rules I have used so far for hyphenating concatenated words work fine with the optimal [No of pixels is space], the action pol_WordSplitList.txt
is visible only after the interval is exceeded by 2 or more pixels. It works very well, it can deal with clumps of 3 or even 4 words.
The number of errors depends on how much no match is [No of pixels is space].

A few remarks on the operation of the program:

The rule in pol_OCRFixReplaceList.xml does not work for:
<WordPart from = "W" to = "W " />
<WordPart from = "w" to = "w " />
if "w" or "W" is followed by "ż". For other characters, it's OK. See the picture below: lines 1, 3 and 4 (ok) and 2 (wrong).

https://forum.doom9.org/attachment.php?attachmentid=18001&stc=1&d=1641820056

Regarding the use of rigid rules - it concerns the Polish language, and perhaps other ones as well. Already during OCR, I miss:

removing whitespace - if there is, before the characters:!?:;, and adding whitespace after them - if there is no space. (lines 5 and 6).
no formatting for "." - that's probably good - at least line 9 looks correct.
while determining whether a single character (") in a single line opens or closes the quotation marks and determining its position with
a space on the left or right side is impossible, then replacing (".) with (...) is in my opinion a mistake ( lines 13 and 17).
is the absolute change of (i) to (I) already at the OCR stage in line 8, if it does not happen after (.) and (...) in line 6?

Thanks for your work.

VoodooFX

16th January 2022, 15:19

SE 3.6.4 fails to download any spell-checking dictionaries, tried English and few random ones.

Nikse555

19th January 2022, 21:35

SE 3.6.4 fails to download any spell-checking dictionaries, tried English and few random ones.

I've just tested all spell check dictionary downloads... all work fine now, so it's must have been something temporary - or some firewall issues.

@Janusz: Perhaps it's better with external images/files?

Janusz

20th January 2022, 01:08

@Janusz: Perhaps it's better with external images/files?
The problem with the rules:
<WordPart from = "W" to = "W " />
<WordPart from = "w" to = "w " />
explained. The Polish dictionary contains the unused word "wżyć" and for this reason the word "wżyciu" has not been split into two words "w życiu" (in life). Sorry for the confusion.
As for the change from lower case to capital letter at the beginning of the paragraph, or (".) to (...) at the end, the topic is relevant. If I prepare the examples properly, I will come back to the matter.

Janusz

20th January 2022, 22:14

@Nikse555
Here are examples of how enabling "Fix common OCR ..." affects our text received during OCR.
Sample files to download. (https://www.mediafire.com/file/f1tew34xxzrojom/2022-01-20_examples.sup.zip/file)

The first lowercase letter in the text where OCR started or resumed is replaced with the corresponding uppercase letter.
As you can see in the 4th line at the bottom, this does not apply to the letter "l", which has been replaced with an "I" which made the word "Iet" (let) unrecognized and placed in the bug list.
Some time ago I wrote about it, he also wrote @tormento when instead of "I" we got "L".
This is not the case when "l" is not the beginning of the text, but the beginning of a new line in the text. (See line # 1 in the same example.)
If the line on which OCR was started or resumed is correctly terminated with ". - they are changed to ..., but also not always - as you can see in the second example at the bottom.
Your comment: "// lines ending with ". Should often end at ... (of no other quotes exists near by)" in "OcrFixEngine.cs"

Nikse555

21st January 2022, 18:21

@Janusz: I did not understand the "w" issue.
The two other issues I do not have here with default dictionaries.

Janusz

21st January 2022, 22:53

@Janusz: I did not understand the "w" issue.
I wanted to split the phrase "wżyciu" (inlife) into two words "w życiu" (in life) and it didn't work because, as it turned out, the phrase "wżyciu" (inlife) is in the dictionary so the rule <WordPart from = "w" to = "w " /> will not work in this case.
The two other issues I do not have here with default dictionaries.
After the first scan of all the text, press [Start OCR] on the 2nd, 3rd, and 4th lines separately and additionally on the last (fourth) line again and you will see what I mean.
In the second example, I made a mistake with the order of the characters: is ." and it should be: ". so here we will not get ... instead ".
This is especially frustrating when you create a rule to fix a bug on a specific line and it works for that line, and after scanning all the text you find it doesn't work.

Janusz

25th January 2022, 03:45

In addition to the previous post, I attach a new image with a description of the imperfections of text correction after OCR after enabling the option [Settings/Tools/Fix common errors - also use hard-coded rules].

My program version: 3.6.4 NEXT, beta 388. The contents of the Dictionaries directory: apart from the standard English and Polish dictionaries, I have deleted the remaining files.
The contents of the zip file:
- ivon.source.srt - source file - used to create sup - used for comparison with the OCR result,
- ivon.source.sup - proper file with subtitles,
- ivon_60.12.8.131.250.nocr - character base - please set threshold = 131,
- ivon.d-on_f-on.srt - OCR result without any correction.
My OCR settings as in the picture.

Files to download (https://www.mediafire.com/file/8d8918uvbeatxky/2020.01.25_ivonOCR.zip/file)

https://forum.doom9.org/attachment.php?attachmentid=18015&stc=1&d=1643078254

Observations:
Lines 1 and 4 - If the beginning of the paragraph should start with a capital letter and replacing lowercase letters with their uppercase equivalents makes sense, it makes sense to unconditionally replace "l" with "I" without confirming the existence of a new word in the dictionary earlier not any more. As a sweetness on line 8, such a substitution gave the correct word at the beginning of the paragraph.
Subsequent words and errors - not all of them - suggest that the replacement of "l" with "I" of the first letters of words takes place only after prior confirmation of the existence of the new word in the dictionary (all of them are 'Iran'). Otherwise, the word is unchanged (london, LOndon, and lran).
Remaining words and errors: in line 7 as a result of the unfortunate change (".) - end of paragraph) to (...) - continuation style, caused that instead of Iran we have lran.
Lines 2, 4, 5, 6, and 7 for a newline without a preceding to (.) kept the word on the newline unchanged.

In addition to nOCR, I also checked:
Tesseract 3.02 - without success - in key places instead of "l" I got "|").
Binary image compare - while using the character matching you can achieve a very good result when it comes to OCR, there are still words for manual correction. It was nice that on the All Fixes list I got line 6 saying that I changed ". To ....

Conclusions:
While in the case of English and Polish, replacing a single letter "l" with "I" makes sense at the beginning of a paragraph or sentence, then "i" to "I" do not get "I" in the sentence between words written in lowercase.
Changing the word starting with "l" to "I" - literally - based on the dictionary - yes.
What's wrong with [Auto fix names where only casing differs] is not working? The words: london, LOndon from the automat should be fixed, however it did not happen. Let's try to add "london" to the [Add to names] list, we'll get "London" - nothing easier - just click OK, but don't do it - it won't work. ???, let's try "LOndon" - give "London" and OK - it will work, the last "lran" instead of "Lran", enter "Iran" and OK - it works.
Cannot fix more without _OCRFixReplaceList.xml.
I mean, you can turn off the [Settings/Tools/Fix common errors - also use hard-coded rules] option, but then we will lose a lot of nice things, so let's ask ourselves is it worth it?

iKron

25th January 2022, 03:54

I using "subtitle edit" app to convert PGS to srt. in this example there is a word "ANNIE"

but app read it as "ANN IH". in inspect compare matches option how can i remove the space between "ANN" and "lH". any help please?

https://i.imgur.com/cuRJKjl.png

Janusz

25th January 2022, 05:12

@iKron
1. Increase the number of pixels by 1 or 2 and check if the gap disappears. If not enough, add more.
2. You have assigned the H character to the E picture. Set to H, change the assignment of the E to E picture in the text field.

iKron

25th January 2022, 05:26

@Janusz thank you. no of pixel space 10 worked fine. i got another problem.

there is a space between two word. but it's merged. is there anyway we can add space? please check the screenshot.

last word OCR converted to "ofAbed". it suppose to be of Abed. it was working fine if i use pixel space 8

https://i.imgur.com/xh6igwk.png

Janusz

25th January 2022, 08:06

@iKron
In this case, decreasing the space will separate the words.
Additionally, enable the [Try to ques unknow words] option, because there may already be corrections on the list of suggestions.

Nikse555

25th January 2022, 18:25

@Janusz: thx for the files - I've tried to improve the ocr fix engine here: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip
Better?

iKron

25th January 2022, 18:53

@Janusz thank you for the output. i am really new with this subtitle edit. few suggestion i am looking. is it wise idea to add unknown words to "user directory" like here is "BFFs"
https://i.imgur.com/znA1MO1.png

what is the difference between "add to name/noise list" and "add to user directory"

and is there anyway to disable this option. whenever i finish subtitle edit a popup box appear.

https://i.imgur.com/N1FKczy.png

Janusz

25th January 2022, 23:41

@Janusz: thx for the files - I've tried to improve the ocr fix engine here: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip
Better?

Thanks for the fix. Increasing the distance between the opening and closing quotation marks will have a good effect. Long quotes happen much less often than 3-4 lines.

While looking for a way to recover lost characters quickly and reliably, I ran into an error in [Tools/Fix common error]: checking the [Add missing quotes (")] option will not cause the list to be corrected to show lines with a single (").
This can be checked in the current stable or beta version on our example.

@iKron
Download what is available for download in my post here above, follow the description of what to do with it and you will see how "add to name/noise list" works.
"add to user directory" is the same but without case sensitivity, ie adding "london" will cause the program to recognize the words "london", "LOndon" and "London" as correct words.
Pop-up - this is a suggestion by the author of Subtitle Edit, Mr. Nikse555, with whom I am talking to above, that you install the video player he recommends. If you do this again, you won't see this box.

iKron

26th January 2022, 02:01

thank you so much Janusz. two more question

when i converted subtitle via nOCR i got popup box, there is option "Foreground" and "NOT foreground". what is the difference between "Foreground" and "NOT foreground",

also difference between "OCR via nOCR" and "Binary image compare"

lastly is Tesseract method good? which method is good to convert the sub.

Janusz

26th January 2022, 08:14

@iKron

What version of the program are you using? I have not seen such a window and I do not know what window it is about. I can only guess that it is about hiding the main program window for OCR.
The basic difference between nOCR and the comparison of images is the method of detecting a character from its image and the method of its storage in the character database. nOCR is scaled. Comparing images knows how to use the nOCR character database.
Tesseract is good, but slower and generates a lot of errors. The basic version of SE contains the appropriate files for error correction, so the user decides about the choice of the OCR engine.
sub - I can understand DVD subtitles, any method is good. I prefer nOCR because of its speed. I use Tesseract when the font of the inscriptions is decorative or very exotic.

tormento

29th January 2022, 11:00

thx for the files
I have issues with a left/right hearing impaired sup file (https://www.mediafire.com/file/wegimhujcv7yll9/manchester_PID_1200_eng.zip/file).

It splits the sentences on left and right side according to the talking actor.

I know that asking you to support {\an*} would lead to excessive programming work, as you already stated.

What would be useful is to fix subtitles with more than 3 lines, making the CR removal only when there are commas or spaces and not full marks or capital letters.

Just try to OCR it and fix common errors and you will see what I mean.: it mixes dialogues between different actors.

The least I can ask is not to make the rule behave in a dumb way. After that some manual work will wait me. :p

Perhaps you could introduce some "special" characters to recognize left and right side, letting us to have a easy job with such kind of sup files.

Janusz

10th February 2022, 03:07

@Nikse
SE does not recognize missing <WholeWords> section in _OCRFixReplaceList_User.xml
After installing the program, the first time you use [Add pair to OCR replace list] during Import/OCR ... or via Settings/Word lists [Add pair] to [OCR fix list], the file "_OCRFixReplaceList_User.xml" is created.
If we deliberately remove the <WholeWords> section from it for some reason and forget about it, the program will not create the missing section, allowing you to add new pairs of words that will not be saved anywhere.

Newrone

10th February 2022, 10:11

Hi,
Is it possible to move the video forwards or backwards frame-by-frame in SubEdit, as it is in Aegisub?
I couldn't find any reference to it and it is sometimes essential to avoid "flashing" subtitles on scene changes.

Sakura-chan

12th February 2022, 10:28

Hi,

So I was trying to export some subs to SUP with Subtitle Edit (https://www.videohelp.com/software/Subtitle-Edit). But no matter what font or style I choose, lines come out horribly misaligned. See:

https://i.ibb.co/TgyRm7P/Untitled.png

(Image as link because it's wide and breaks the forum layout)

Without apparent sense lines randomly appear higher or lower. The first window shows the desired height, the one the most lines are shown at. You can see the other three at varying heights. Double line, italics, caps, it seems it doesn't matter, it makes no sense.

How do you make the bottom line in every picture appear at the same height? :-/

P.S.: I've tried some more. Depending on the font, more or less number of lines are shown aligned. For example Times New Roman is the most consistent, still a few lines are too high or low. Even if it worked it's a horrible font for subs though.

Edit 2: Some shitty fonts, like Tempus Sans ITC, seems perfectly in line. I scrolled through a lot of sub-pictures and they look pixel perfect. Ofc, it's an even more horrible font for subs.

Seems it's a matter of having just the right font? Why can't it work with Arial? Weird.

von Suppé

12th February 2022, 15:41

I can conform "bobbing" subtitles in Arial to SUP. I think a bug sneeked in, Nikse. You were always keen on text appearing as being written on one same line, when export to imagebased subs.

Nikse555

27th February 2022, 16:49

@Sakura-chan / von Suppé:
Thx for the info - can you still re-create this issue with latest beta?
https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip

If yes, please link to .sup file with the problem :)

Janusz

28th February 2022, 00:01

@Nikse555
The issue persists in the latest 509 beta.
I sent a sample sup file to your email.

Nikse555

28th February 2022, 01:48

@Janusz: thx for the .sup file example - is this better? https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip

Nikse555

28th February 2022, 07:41

Janusz

28th February 2022, 10:08

Originally Posted by Janusz View Post
@Nikse
SE does not recognize missing <WholeWords> section in _OCRFixReplaceList_User.xml ...
This bug has been fixed in beta 511.

However, there is still a shift in subtitle images.
While in the case of the Calibri font, the height of the images for subtitles now differs by 1 pixel for a text consisting of one or two lines, for the Arial font it is already 10 pixels.
You can check this for e.g. lines 14 and 15 in the uploaded file.
Użyję lateksu za 6 dolarów
ze sklepu z kostiumami na Halloween.

Żona zapomniała powiedzieć,
że dziś przychodzi rzeczoznawca.
Another thing with Arial is that changing the "Font size" parameter sets "Line height" to a value less than "Font size". This does not prevent the correct display of the image of the inscription, but we will have a problem with OCR of such an image containing letters with accent, e.g. ŃĆŹŻÓŚ

von Suppé

28th February 2022, 11:18

I can confirm "bobbing subtitles" issue has been solved.
Thanks, Nikse.

Janusz

28th February 2022, 18:30

@Nikse555
SE 3.6.4 next, beta 513 crashes on startup.

Janusz

1st March 2022, 07:39

@Nikse555

I can confirm "bobbing subtitles" issue has been solved.
Thanks, Nikse.

Not exactly like that.
Export the word "ibuprom" to sup.
The text exported with Arial will be shifted up by 10 pixels compared to Calibri.

It's also Arial. As you can see, the text looks good here.
https://forum.doom9.org/attachment.php?attachmentid=18039&stc=1&d=1646138513

von Suppé

1st March 2022, 13:45

Not exactly like that.
Export the word "ibuprom" to sup.
The text exported with Ariel will be shifted up by 10 pixels compared to Calibri.

I usually don't compare between two different fonts.

But in font Arial the word ibuprom certainly is raised compared to other Arial text.
I think using the letter "p" makes the difference in Arial. Type "iburom" and it's okay.

In Calibri the word ibuprom seems the same height as other text in Calibri. My earlier quick & dirty tests must have been without "p", I suppose.
So, I think you're right. It's seems not 100% fixed yet.

Thanks for the heads-up, Janusz.

[EDIT] Mmm this gets weirder. A line with "up" shows correcty. Also "apple". And "upr" gets raised again. This seems erratic.