Subtitle Edit 4.0.4 - Page 76

Nikse555 · 3rd December 2021, 12:39

SE 3.6.4 is out: https://github.com/SubtitleEdit/subtitleedit/releases

Fixes issue with blu-ray sup palette (thx to Master Yoda) + fixes an issue with "Set start and offset the rest" where first selected line would not change + support for Tesseract OCR 5.00 final.
Change log: https://raw.githubusercontent.com/Su.../Changelog.txt

tormento · 4th December 2021, 11:32

Quote:

Originally Posted by Nikse555

support for Tesseract OCR 5.00 final

There are many dll that come with windows builds, are all of them needed?

Is there a way to speed it up?

Nikse555 · 4th December 2021, 12:11

Quote:

Originally Posted by tormento

There are many dll that come with windows builds, are all of them needed?

Is there a way to speed it up?

I really don't know if all of the Tesseract dlls are needed.

If some C++ experts read this, then perhaps they will know if e.g. static linking with single exe file will make Tesseract faster to load?
If SE used the dll instead of calling "tesseract.exe" for each image, that would be faster - but using the dll had some problems last time I tested it (some years back).

tormento · 9th December 2021, 10:56

Quote:

Originally Posted by Nikse555

using the dll had some problems last time I tested it

Please try again.

While decoupling of OCR hard rules hasn't happened yet, would you please add:

English

'II can't exist, it should always be 'll

Italian

Io, Ia, Ii can't exist as single word unless a . is before it or it's on a new line — It should be lo, la, li
Ià, Iì can't exist at all as single word — It should be là, lì
I' can't exist at all as single word — It should be l'
II can't exist as single word unless a . is before it or it's on a new line — It should be Il

I will update the list upon necessity

Janusz · 12th December 2021, 23:35

Here is one way to successfully fix "l" to "I" and "I" to "l" in the OCR process.

Files to download:
it.test.txt - random text from the Italian website, so probably flawless (h__ps: //www.ilsole24ore.com/),
it.test.0.srt - Italian text converted to srt file (it will be used to compare the OCR result),
it.test.sup - the .sup file contains both the letters "l" and "I",
it.test.i.sup - the .sup file does not contain "l" ("l" has been replaced with "I"),
it.test.nocr - character database contains both "l" and "I",
ita_OCRFixReplaceList.xml - this file does all the work.
The files should be placed in the correct directories.
A few words about the settings in the program:
Option / Settings / Tools:
Fix common OCR errors ... - on,
Auto fix names where ... - on / off, Also fix names via ... - on / off (for this test, no difference).
Import / OCR:
OCR method - OCR via nOCR, No of pixels is space - 10 (11 starts to connect words)
Max wrong pixels - 8, Constants italic - off, Line split min ... - Auto, Language - it.test
Dictionanry - Italian, Prompt for unknow words - off, Try to guess unknown words - on
Binary image compare threshold - 200.

I will leave the result of the comparison without comment. Everyone can judge it for themselves.
Whether the method I used will give an equally good result in another language - I do not know.
In Polish, English, Italian (as you can see) - yes. I'm not saying that this way of solving the "l" and "i" problems is perfect.
My character databases do not contain "I" so I can say that the problem with "l" does not apply to me, regardless of the language used.
What about "I"? I think 99% or more is done by this one RegEx.

@tormento:

Quote:

II can't exist as single word unless a. is before it or it's on a new line - It should be Il

What about: I II III IIII IV V etc? (IIII is also correct).

@Nikse:
From the number and type of differences (4) it can be seen that Fix common OCR errors ... requires some fine-tuning ('' double accent - 3 errors).
Using this method, I omit the lack of correction for words shorter than 5 characters in Fix common OCR errors ...
Thanks for this great software.

tormento · 13th December 2021, 09:27

Quote:

Originally Posted by Janusz

What about: I II III IIII IV V etc? (IIII is also correct).

I prefer to have rare wrong roman numbers than frequent wrong I* spellings. And, no, IIII is not correct. Someone uses it but it's not.

Janusz · 13th December 2021, 11:32

Quote:

Originally Posted by tormento

I prefer to have rare wrong roman numbers than frequent wrong I* spellings.

We have one sentence on this matter.

Quote:

And, no, IIII is not correct. Someone uses it but it's not.

As for IIII - I have a different opinion. Just because it's not common doesn't mean it's wrong.

tormento · 13th December 2021, 11:43

Quote:

Originally Posted by Janusz

As for IIII - I have a different opinion. Just because it's not common doesn't mean it's wrong.

I have done five years of Latin in the italian lyceum. Believe me it's a vernacular notation more than a correct Latin number, such as the typical and unfortunate american habit to coniugate irregular past verbs with -ed.

Nikse555 · 19th December 2021, 13:10

Quote:

Originally Posted by tormento

English

'II can't exist, it should always be 'll

The English issue should be fixed in this commit:
https://github.com/SubtitleEdit/subt...83dbc541e5b27f

The Italian corrections I've tried to fix in this commit: https://github.com/SubtitleEdit/subt...56c93f0ddfe746

SE beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip

The regular expressions for Italian can be tested here:
http://regexstorm.net/tester?p=%28%5...0a&r=%241l%243

http://regexstorm.net/tester?p=%5cb%...e+word&r=l%242

http://regexstorm.net/tester?p=%28%5...+line&r=%241Il

tormento · 20th December 2021, 07:39

Quote:

Originally Posted by Nikse555

The regular expressions for Italian can be tested here:

The "Io ", "Ia " and "Ii " are missing.

locotus · 24th December 2021, 17:59

That'|| keep you around and I don't think
you'll be doing any card tricks either.

That error survive OCR with tesserac 3.02, spelling correction and fix common errors.

Merry Christmas to all.

GCRaistlin · 30th December 2021, 20:39

Please add a keyboard shortcut for [ ] Auto submit on first char (make a letter of it underscored) in OCR - Manual image to text window.

UPD: sorry, missed that it is already there.

UPD2: the underscored letter isn't working when non-English keyboard layout is active. It is expected but still not handy. Can you please change the elements' order in this window so as Shift-Tab in Character(s) as text field would move the focus to [ ] Auto submit on first char?

GCRaistlin · 31st December 2021, 01:14

Bug:

During OCRing, we're submitting the incorrect character for a glyph (а instead of АЯ):
Pressing Edit last: a button to remove the incorrect database record:
Deleting the incorrect record by pressing Delete:
Note the selected area:

Nikse555 · 31st December 2021, 21:41

Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip

@tormento: The "Io ", "Ia " and "Ii " should work.

@locotus: "That'||" and words like that should also work better.

@GCRaistlin: Ctrl+I and Ctrl+A should now toggle check boxes. Also improved tab stop a little. Could not re-create the other issues.

Also, SE now uses a new "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subt...scussions/5616

Happy New Year

Janusz · 1st January 2022, 23:01

@GCRaistlin

Quote:

Originally Posted by GCRaistlin

Bug:

During OCRing, we're submitting the incorrect character for a glyph (а instead of АЯ): ...

This is not a bug in the program
OCR was performed correctly to the place of detention, i.e. to you, to the unrecognized character "7"
so "a" is displayed in place of "АЯ" because this is how it was processed and saved due to your error.
Changing the content of the character base at this point by means of delete will not affect the previously processed text.
Only when you press [Start OCR] again for this image, in the place where "a" was deleted, you will be asked to reassign the character (s) for "АЯ",
and the new, correct screen content with changes will be displayed again only after the entire line has been processed (whole picture).
That's how it works.

@Nikse555

Quote:

@GCRaistlin: Ctrl + I and Ctrl + A should now toggle check boxes. Also improved tab stop a little. Could not re-create the other issues.

CTRL + A - This is a bad idea. This is the default Windows shortcut for selecting everything in a document or window. In this case, selecting the text in the [Character (s) as text] box also sets [Auto sybmit on first char].

*** Happy New Year ***

Nikse555 · 2nd January 2022, 14:10

@Janusz: "Ctrl + A" for "auto-submit first char" is now "Ctrl+F". Thx, nice catch

I've added a Polish word split list from 44 subtitle files - do give it a test (via "Fix common errors" - "Fix common OCR error" with lines-without-spaces) - "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subt...scussions/5616

Janusz · 2nd January 2022, 17:11

@Niksee555
I have known this discussion and have been following it for some time now, so my dictionary contains over 26,500 words. For the sup file generated from this dictionary from words in random order, it works very well with one "but".
I had to remove all single letters which in Polish are words like: a, i, o, u, w, z; whether: l, m, C, which are again abbreviations of words, because in normal text where unknown words appear (e.g., untranslated surnames, proper names) single letters meant that the unknown word was divided into smaller - one or two letter words - matching the Polish dictionary.
The idea for this new functionality will certainly be used where italics or fonts of different sizes are used that differ significantly in spacing between words.
I will, of course, check your dictionary.

Janusz · 3rd January 2022, 18:05

I do not know how to call it. In any case, the case is as follows.

Image to download:

I was doing OCR and encountered something like this:

OCR misread "Ź" - you need to fix it, so:

Surprise. Line read almost flawlessly and here the same "?".
What is it about? I press [Add].

Really, I don't know what to type. @Nikse555, save a father of a large family.
I would like to add that the Previev window shows the correct division of the image into the upper and lower text.

I also tried binary comparison.
First OCR start - ok - flawless, everything in its place,
second time [Start OCR] - I just had the program asking to add "Z".
With Tesseract 3.02 - standard - almost fine, but you can press [Start OCR] at will - the result is always the same.
And here is probably a bug in the program. With Tesseract 3.02 I went back to binary comparison and after pressing [Start OCR] the program crashed. This has been the case several times.
After a reboot, the program no longer hangs, but after another [Start OCR] letters "c" and "Ź" disappear. Although everything is in place in the [Inspect ...] window.

Nikse555 · 8th January 2022, 22:26

@Janusz: Perhaps you could use a file host?
So I should remove the single letter from the Polish word-split-list? If you generate one, I'll be happy to include it too.

Janusz · 10th January 2022, 14:37

@Niksee
I have just sent pol_WordSplitList.txt to your email.
A dictionary (385 kB, 39256 words) was generated for len<5 = 20 and len>=5 = 10.
With "default 11 and 5", there are not many more words and the file grows twice as large.
I have left only the really necessary individual letters in it. I have also removed a few 3-letter words that are missing from the dictionary I use (pl_PL.dic from 1 December 2021).
Because the rules I have used so far for hyphenating concatenated words work fine with the optimal [No of pixels is space], the action pol_WordSplitList.txt
is visible only after the interval is exceeded by 2 or more pixels. It works very well, it can deal with clumps of 3 or even 4 words.
The number of errors depends on how much no match is [No of pixels is space].

A few remarks on the operation of the program:

The rule in pol_OCRFixReplaceList.xml does not work for:
<WordPart from = "W" to = "W " />
<WordPart from = "w" to = "w " />
if "w" or "W" is followed by "ż". For other characters, it's OK. See the picture below: lines 1, 3 and 4 (ok) and 2 (wrong).
Regarding the use of rigid rules - it concerns the Polish language, and perhaps other ones as well. Already during OCR, I miss:
- removing whitespace - if there is, before the characters:!?:;, and adding whitespace after them - if there is no space. (lines 5 and 6).
- no formatting for "." - that's probably good - at least line 9 looks correct.
- while determining whether a single character (") in a single line opens or closes the quotation marks and determining its position with
  a space on the left or right side is impossible, then replacing (".) with (...) is in my opinion a mistake ( lines 13 and 17).
- is the absolute change of (i) to (I) already at the OCR stage in line 8, if it does not happen after (.) and (...) in line 6?

Thanks for your work.

3rd December 2021, 12:39	#1501 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	SE 3.6.4 is out: https://github.com/SubtitleEdit/subtitleedit/releases Fixes issue with blu-ray sup palette (thx to Master Yoda) + fixes an issue with "Set start and offset the rest" where first selected line would not change + support for Tesseract OCR 5.00 final. Change log: https://raw.githubusercontent.com/Su.../Changelog.txt Last edited by Nikse555; 3rd December 2021 at 14:08.

24th December 2021, 17:59	#1511 \| Link
locotus Registered User Join Date: Nov 2005 Posts: 112	That'\|\| keep you around and I don't think you'll be doing any card tricks either. That error survive OCR with tesserac 3.02, spelling correction and fix common errors. Merry Christmas to all. Last edited by locotus; 24th December 2021 at 18:26.

30th December 2021, 20:39	#1512 \| Link
GCRaistlin Registered User Join Date: Jun 2006 Posts: 353	Please add a keyboard shortcut for [ ] Auto submit on first char (make a letter of it underscored) in OCR - Manual image to text window. UPD: sorry, missed that it is already there. UPD2: the underscored letter isn't working when non-English keyboard layout is active. It is expected but still not handy. Can you please change the elements' order in this window so as Shift-Tab in Character(s) as text field would move the focus to [ ] Auto submit on first char? __________________ Windows 8.1 x64 Magically yours Raistlin Last edited by GCRaistlin; 31st December 2021 at 00:58.

31st December 2021, 01:14	#1513 \| Link
GCRaistlin Registered User Join Date: Jun 2006 Posts: 353	Bug: During OCRing, we're submitting the incorrect character for a glyph (а instead of АЯ): Pressing Edit last: a button to remove the incorrect database record: Deleting the incorrect record by pressing Delete: Note the selected area: __________________ Windows 8.1 x64 Magically yours Raistlin

2nd January 2022, 17:11	#1517 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	@Niksee555 I have known this discussion and have been following it for some time now, so my dictionary contains over 26,500 words. For the sup file generated from this dictionary from words in random order, it works very well with one "but". I had to remove all single letters which in Polish are words like: a, i, o, u, w, z; whether: l, m, C, which are again abbreviations of words, because in normal text where unknown words appear (e.g., untranslated surnames, proper names) single letters meant that the unknown word was divided into smaller - one or two letter words - matching the Polish dictionary. The idea for this new functionality will certainly be used where italics or fonts of different sizes are used that differ significantly in spacing between words. I will, of course, check your dictionary. __________________ Sorry for my mistakes - I'm using a translator.

31st December 2021, 21:41	#1514 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip @tormento: The "Io ", "Ia " and "Ii " should work. @locotus: "That'\|\|" and words like that should also work better. @GCRaistlin: Ctrl+I and Ctrl+A should now toggle check boxes. Also improved tab stop a little. Could not re-create the other issues. Also, SE now uses a new "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subt...scussions/5616 Happy New Year

2nd January 2022, 14:10	#1516 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: "Ctrl + A" for "auto-submit first char" is now "Ctrl+F". Thx, nice catch I've added a Polish word split list from 44 subtitle files - do give it a test (via "Fix common errors" - "Fix common OCR error" with lines-without-spaces) - "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subt...scussions/5616

3rd January 2022, 18:05	#1518 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	I do not know how to call it. In any case, the case is as follows. Image to download: I was doing OCR and encountered something like this: OCR misread "Ź" - you need to fix it, so: Surprise. Line read almost flawlessly and here the same "?". What is it about? I press [Add]. Really, I don't know what to type. @Nikse555, save a father of a large family. I would like to add that the Previev window shows the correct division of the image into the upper and lower text. I also tried binary comparison. First OCR start - ok - flawless, everything in its place, second time [Start OCR] - I just had the program asking to add "Z". With Tesseract 3.02 - standard - almost fine, but you can press [Start OCR] at will - the result is always the same. And here is probably a bug in the program. With Tesseract 3.02 I went back to binary comparison and after pressing [Start OCR] the program crashed. This has been the case several times. After a reboot, the program no longer hangs, but after another [Start OCR] letters "c" and "Ź" disappear. Although everything is in place in the [Inspect ...] window. Attached Images __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 3rd January 2022 at 20:10.

8th January 2022, 22:26	#1519 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: Perhaps you could use a file host? So I should remove the single letter from the Polish word-split-list? If you generate one, I'll be happy to include it too. Last edited by Nikse555; 9th January 2022 at 13:09.

10th January 2022, 14:37	#1520 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	@Niksee I have just sent pol_WordSplitList.txt to your email. A dictionary (385 kB, 39256 words) was generated for len<5 = 20 and len>=5 = 10. With "default 11 and 5", there are not many more words and the file grows twice as large. I have left only the really necessary individual letters in it. I have also removed a few 3-letter words that are missing from the dictionary I use (pl_PL.dic from 1 December 2021). Because the rules I have used so far for hyphenating concatenated words work fine with the optimal [No of pixels is space], the action pol_WordSplitList.txt is visible only after the interval is exceeded by 2 or more pixels. It works very well, it can deal with clumps of 3 or even 4 words. The number of errors depends on how much no match is [No of pixels is space]. A few remarks on the operation of the program: The rule in pol_OCRFixReplaceList.xml does not work for: <WordPart from = "W" to = "W " /> <WordPart from = "w" to = "w " /> if "w" or "W" is followed by "ż". For other characters, it's OK. See the picture below: lines 1, 3 and 4 (ok) and 2 (wrong). Regarding the use of rigid rules - it concerns the Polish language, and perhaps other ones as well. Already during OCR, I miss: removing whitespace - if there is, before the characters:!?:;, and adding whitespace after them - if there is no space. (lines 5 and 6). no formatting for "." - that's probably good - at least line 9 looks correct. while determining whether a single character (") in a single line opens or closes the quotation marks and determining its position with a space on the left or right side is impossible, then replacing (".) with (...) is in my opinion a mistake ( lines 13 and 17). is the absolute change of (i) to (I) already at the OCR stage in line 8, if it does not happen after (.) and (...) in line 6? Thanks for your work. Attached Images __________________ Sorry for my mistakes - I'm using a translator.