Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 3rd December 2021, 12:39   #1501  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
SE 3.6.4 is out: https://github.com/SubtitleEdit/subtitleedit/releases

Fixes issue with blu-ray sup palette (thx to Master Yoda) + fixes an issue with "Set start and offset the rest" where first selected line would not change + support for Tesseract OCR 5.00 final.
Change log: https://raw.githubusercontent.com/Su.../Changelog.txt

Last edited by Nikse555; 3rd December 2021 at 14:08.
Nikse555 is offline   Reply With Quote
Old 4th December 2021, 11:32   #1502  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,582
Quote:
Originally Posted by Nikse555 View Post
support for Tesseract OCR 5.00 final
There are many dll that come with windows builds, are all of them needed?

Is there a way to speed it up?
__________________
@turment on Telegram

Last edited by tormento; 4th December 2021 at 11:42.
tormento is offline   Reply With Quote
Old 4th December 2021, 12:11   #1503  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Quote:
Originally Posted by tormento View Post
There are many dll that come with windows builds, are all of them needed?

Is there a way to speed it up?

I really don't know if all of the Tesseract dlls are needed.

If some C++ experts read this, then perhaps they will know if e.g. static linking with single exe file will make Tesseract faster to load?
If SE used the dll instead of calling "tesseract.exe" for each image, that would be faster - but using the dll had some problems last time I tested it (some years back).
Nikse555 is offline   Reply With Quote
Old 9th December 2021, 10:56   #1504  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,582
Quote:
Originally Posted by Nikse555 View Post
using the dll had some problems last time I tested it
Please try again.

While decoupling of OCR hard rules hasn't happened yet, would you please add:

English
  • 'II can't exist, it should always be 'll
Italian
  • Io, Ia, Ii can't exist as single word unless a . is before it or it's on a new line — It should be lo, la, li
  • Ià, Iì can't exist at all as single word — It should be là, lì
  • I' can't exist at all as single word — It should be l'
  • II can't exist as single word unless a . is before it or it's on a new line — It should be Il
I will update the list upon necessity
__________________
@turment on Telegram

Last edited by tormento; 12th December 2021 at 09:25.
tormento is offline   Reply With Quote
Old 12th December 2021, 23:35   #1505  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Here is one way to successfully fix "l" to "I" and "I" to "l" in the OCR process.

Files to download:
it.test.txt - random text from the Italian website, so probably flawless (h__ps: //www.ilsole24ore.com/),
it.test.0.srt - Italian text converted to srt file (it will be used to compare the OCR result),
it.test.sup - the .sup file contains both the letters "l" and "I",
it.test.i.sup - the .sup file does not contain "l" ("l" has been replaced with "I"),
it.test.nocr - character database contains both "l" and "I",
ita_OCRFixReplaceList.xml - this file does all the work.
The files should be placed in the correct directories.
A few words about the settings in the program:
Option / Settings / Tools:
Fix common OCR errors ... - on,
Auto fix names where ... - on / off, Also fix names via ... - on / off (for this test, no difference).
Import / OCR:
OCR method - OCR via nOCR, No of pixels is space - 10 (11 starts to connect words)
Max wrong pixels - 8, Constants italic - off, Line split min ... - Auto, Language - it.test
Dictionanry - Italian, Prompt for unknow words - off, Try to guess unknown words - on
Binary image compare threshold - 200.

I will leave the result of the comparison without comment. Everyone can judge it for themselves.
Whether the method I used will give an equally good result in another language - I do not know.
In Polish, English, Italian (as you can see) - yes. I'm not saying that this way of solving the "l" and "i" problems is perfect.
My character databases do not contain "I" so I can say that the problem with "l" does not apply to me, regardless of the language used.
What about "I"? I think 99% or more is done by this one RegEx.

@tormento:
Quote:
II can't exist as single word unless a. is before it or it's on a new line - It should be Il
What about: I II III IIII IV V etc? (IIII is also correct).

@Nikse:
From the number and type of differences (4) it can be seen that Fix common OCR errors ... requires some fine-tuning ('' double accent - 3 errors).
Using this method, I omit the lack of correction for words shorter than 5 characters in Fix common OCR errors ...
Thanks for this great software.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 13th December 2021 at 05:44.
Janusz is offline   Reply With Quote
Old 13th December 2021, 09:27   #1506  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,582
Quote:
Originally Posted by Janusz View Post
What about: I II III IIII IV V etc? (IIII is also correct).
I prefer to have rare wrong roman numbers than frequent wrong I* spellings. And, no, IIII is not correct. Someone uses it but it's not.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 13th December 2021, 11:32   #1507  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by tormento View Post
I prefer to have rare wrong roman numbers than frequent wrong I* spellings.
We have one sentence on this matter.

Quote:
And, no, IIII is not correct. Someone uses it but it's not.
As for IIII - I have a different opinion. Just because it's not common doesn't mean it's wrong.
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 13th December 2021, 11:43   #1508  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,582
Quote:
Originally Posted by Janusz View Post
As for IIII - I have a different opinion. Just because it's not common doesn't mean it's wrong.
I have done five years of Latin in the italian lyceum. Believe me it's a vernacular notation more than a correct Latin number, such as the typical and unfortunate american habit to coniugate irregular past verbs with -ed.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 19th December 2021, 13:10   #1509  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Quote:
Originally Posted by tormento View Post

English
  • 'II can't exist, it should always be 'll
The English issue should be fixed in this commit:
https://github.com/SubtitleEdit/subt...83dbc541e5b27f

The Italian corrections I've tried to fix in this commit: https://github.com/SubtitleEdit/subt...56c93f0ddfe746

SE beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip


The regular expressions for Italian can be tested here:
http://regexstorm.net/tester?p=%28%5...0a&r=%241l%243

http://regexstorm.net/tester?p=%5cb%...e+word&r=l%242

http://regexstorm.net/tester?p=%28%5...+line&r=%241Il
Nikse555 is offline   Reply With Quote
Old 20th December 2021, 07:39   #1510  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,582
Quote:
Originally Posted by Nikse555 View Post
The regular expressions for Italian can be tested here:
The "Io ", "Ia " and "Ii " are missing.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 24th December 2021, 17:59   #1511  |  Link
locotus
Registered User
 
Join Date: Nov 2005
Posts: 112
That'|| keep you around and I don't think
you'll be doing any card tricks either.

That error survive OCR with tesserac 3.02, spelling correction and fix common errors.

Merry Christmas to all.

Last edited by locotus; 24th December 2021 at 18:26.
locotus is online now   Reply With Quote
Old 30th December 2021, 20:39   #1512  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 353
Please add a keyboard shortcut for [ ] Auto submit on first char (make a letter of it underscored) in OCR - Manual image to text window.

UPD: sorry, missed that it is already there.

UPD2: the underscored letter isn't working when non-English keyboard layout is active. It is expected but still not handy. Can you please change the elements' order in this window so as Shift-Tab in Character(s) as text field would move the focus to [ ] Auto submit on first char?
__________________
Windows 8.1 x64

Magically yours
Raistlin

Last edited by GCRaistlin; 31st December 2021 at 00:58.
GCRaistlin is offline   Reply With Quote
Old 31st December 2021, 01:14   #1513  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 353
Bug:
  1. During OCRing, we're submitting the incorrect character for a glyph (а instead of АЯ):
  2. Pressing Edit last: a button to remove the incorrect database record:
  3. Deleting the incorrect record by pressing Delete:
  4. Note the selected area:
__________________
Windows 8.1 x64

Magically yours
Raistlin
GCRaistlin is offline   Reply With Quote
Old 31st December 2021, 21:41   #1514  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip

@tormento: The "Io ", "Ia " and "Ii " should work.

@locotus: "That'||" and words like that should also work better.

@GCRaistlin: Ctrl+I and Ctrl+A should now toggle check boxes. Also improved tab stop a little. Could not re-create the other issues.


Also, SE now uses a new "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subt...scussions/5616

Happy New Year
Nikse555 is offline   Reply With Quote
Old 1st January 2022, 23:01   #1515  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@GCRaistlin
Quote:
Originally Posted by GCRaistlin View Post
Bug:
  1. During OCRing, we're submitting the incorrect character for a glyph (а instead of АЯ): ...
This is not a bug in the program
OCR was performed correctly to the place of detention, i.e. to you, to the unrecognized character "7"
so "a" is displayed in place of "АЯ" because this is how it was processed and saved due to your error.
Changing the content of the character base at this point by means of delete will not affect the previously processed text.
Only when you press [Start OCR] again for this image, in the place where "a" was deleted, you will be asked to reassign the character (s) for "АЯ",
and the new, correct screen content with changes will be displayed again only after the entire line has been processed (whole picture).
That's how it works.

@Nikse555
Quote:
@GCRaistlin: Ctrl + I and Ctrl + A should now toggle check boxes. Also improved tab stop a little. Could not re-create the other issues.
CTRL + A - This is a bad idea. This is the default Windows shortcut for selecting everything in a document or window. In this case, selecting the text in the [Character (s) as text] box also sets [Auto sybmit on first char].

*** Happy New Year ***
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 2nd January 2022 at 11:40.
Janusz is offline   Reply With Quote
Old 2nd January 2022, 14:10   #1516  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: "Ctrl + A" for "auto-submit first char" is now "Ctrl+F". Thx, nice catch
I've added a Polish word split list from 44 subtitle files - do give it a test (via "Fix common errors" - "Fix common OCR error" with lines-without-spaces) - "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subt...scussions/5616
Nikse555 is offline   Reply With Quote
Old 2nd January 2022, 17:11   #1517  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Niksee555
I have known this discussion and have been following it for some time now, so my dictionary contains over 26,500 words. For the sup file generated from this dictionary from words in random order, it works very well with one "but".
I had to remove all single letters which in Polish are words like: a, i, o, u, w, z; whether: l, m, C, which are again abbreviations of words, because in normal text where unknown words appear (e.g., untranslated surnames, proper names) single letters meant that the unknown word was divided into smaller - one or two letter words - matching the Polish dictionary.
The idea for this new functionality will certainly be used where italics or fonts of different sizes are used that differ significantly in spacing between words.
I will, of course, check your dictionary.
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 3rd January 2022, 18:05   #1518  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
I do not know how to call it. In any case, the case is as follows.

Image to download:

I was doing OCR and encountered something like this:



OCR misread "Ź" - you need to fix it, so:



Surprise. Line read almost flawlessly and here the same "?".
What is it about? I press [Add].



Really, I don't know what to type. @Nikse555, save a father of a large family.
I would like to add that the Previev window shows the correct division of the image into the upper and lower text.

I also tried binary comparison.
First OCR start - ok - flawless, everything in its place,
second time [Start OCR] - I just had the program asking to add "Z".
With Tesseract 3.02 - standard - almost fine, but you can press [Start OCR] at will - the result is always the same.
And here is probably a bug in the program. With Tesseract 3.02 I went back to binary comparison and after pressing [Start OCR] the program crashed. This has been the case several times.
After a reboot, the program no longer hangs, but after another [Start OCR] letters "c" and "Ź" disappear. Although everything is in place in the [Inspect ...] window.
Attached Images
    
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 3rd January 2022 at 20:10.
Janusz is offline   Reply With Quote
Old 8th January 2022, 22:26   #1519  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: Perhaps you could use a file host?
So I should remove the single letter from the Polish word-split-list? If you generate one, I'll be happy to include it too.

Last edited by Nikse555; 9th January 2022 at 13:09.
Nikse555 is offline   Reply With Quote
Old 10th January 2022, 14:37   #1520  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Niksee
I have just sent pol_WordSplitList.txt to your email.
A dictionary (385 kB, 39256 words) was generated for len<5 = 20 and len>=5 = 10.
With "default 11 and 5", there are not many more words and the file grows twice as large.
I have left only the really necessary individual letters in it. I have also removed a few 3-letter words that are missing from the dictionary I use (pl_PL.dic from 1 December 2021).
Because the rules I have used so far for hyphenating concatenated words work fine with the optimal [No of pixels is space], the action pol_WordSplitList.txt
is visible only after the interval is exceeded by 2 or more pixels. It works very well, it can deal with clumps of 3 or even 4 words.
The number of errors depends on how much no match is [No of pixels is space].

A few remarks on the operation of the program:
  1. The rule in pol_OCRFixReplaceList.xml does not work for:
    <WordPart from = "W" to = "W " />
    <WordPart from = "w" to = "w " />
    if "w" or "W" is followed by "ż". For other characters, it's OK. See the picture below: lines 1, 3 and 4 (ok) and 2 (wrong).


  2. Regarding the use of rigid rules - it concerns the Polish language, and perhaps other ones as well. Already during OCR, I miss:
    • removing whitespace - if there is, before the characters:!?:;, and adding whitespace after them - if there is no space. (lines 5 and 6).
    • no formatting for "." - that's probably good - at least line 9 looks correct.
    • while determining whether a single character (") in a single line opens or closes the quotation marks and determining its position with
      a space on the left or right side is impossible, then replacing (".) with (...) is in my opinion a mistake ( lines 13 and 17).
    • is the absolute change of (i) to (I) already at the OCR stage in line 8, if it does not happen after (.) and (...) in line 6?
Thanks for your work.
Attached Images
 
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 23:56.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.