Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
3rd December 2021, 12:39 | #1501 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
SE 3.6.4 is out: https://github.com/SubtitleEdit/subtitleedit/releases
Fixes issue with blu-ray sup palette (thx to Master Yoda) + fixes an issue with "Set start and offset the rest" where first selected line would not change + support for Tesseract OCR 5.00 final. Change log: https://raw.githubusercontent.com/Su.../Changelog.txt Last edited by Nikse555; 3rd December 2021 at 14:08. |
4th December 2021, 12:11 | #1503 | Link | |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
Quote:
I really don't know if all of the Tesseract dlls are needed. If some C++ experts read this, then perhaps they will know if e.g. static linking with single exe file will make Tesseract faster to load? If SE used the dll instead of calling "tesseract.exe" for each image, that would be faster - but using the dll had some problems last time I tested it (some years back). |
|
9th December 2021, 10:56 | #1504 | Link |
Acid fr0g
Join Date: May 2002
Location: Italy
Posts: 2,582
|
Please try again.
While decoupling of OCR hard rules hasn't happened yet, would you please add: English
__________________
@turment on Telegram Last edited by tormento; 12th December 2021 at 09:25. |
12th December 2021, 23:35 | #1505 | Link | |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
Here is one way to successfully fix "l" to "I" and "I" to "l" in the OCR process.
Files to download: it.test.txt - random text from the Italian website, so probably flawless (h__ps: //www.ilsole24ore.com/), it.test.0.srt - Italian text converted to srt file (it will be used to compare the OCR result), it.test.sup - the .sup file contains both the letters "l" and "I", it.test.i.sup - the .sup file does not contain "l" ("l" has been replaced with "I"), it.test.nocr - character database contains both "l" and "I", ita_OCRFixReplaceList.xml - this file does all the work. The files should be placed in the correct directories. A few words about the settings in the program: Option / Settings / Tools: Fix common OCR errors ... - on, Auto fix names where ... - on / off, Also fix names via ... - on / off (for this test, no difference). Import / OCR: OCR method - OCR via nOCR, No of pixels is space - 10 (11 starts to connect words) Max wrong pixels - 8, Constants italic - off, Line split min ... - Auto, Language - it.test Dictionanry - Italian, Prompt for unknow words - off, Try to guess unknown words - on Binary image compare threshold - 200. I will leave the result of the comparison without comment. Everyone can judge it for themselves. Whether the method I used will give an equally good result in another language - I do not know. In Polish, English, Italian (as you can see) - yes. I'm not saying that this way of solving the "l" and "i" problems is perfect. My character databases do not contain "I" so I can say that the problem with "l" does not apply to me, regardless of the language used. What about "I"? I think 99% or more is done by this one RegEx. @tormento: Quote:
@Nikse: From the number and type of differences (4) it can be seen that Fix common OCR errors ... requires some fine-tuning ('' double accent - 3 errors). Using this method, I omit the lack of correction for words shorter than 5 characters in Fix common OCR errors ... Thanks for this great software.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 13th December 2021 at 05:44. |
|
13th December 2021, 11:32 | #1507 | Link | ||
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
Quote:
Quote:
__________________
Sorry for my mistakes - I'm using a translator. |
||
13th December 2021, 11:43 | #1508 | Link | |
Acid fr0g
Join Date: May 2002
Location: Italy
Posts: 2,582
|
Quote:
__________________
@turment on Telegram |
|
19th December 2021, 13:10 | #1509 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
The English issue should be fixed in this commit:
https://github.com/SubtitleEdit/subt...83dbc541e5b27f The Italian corrections I've tried to fix in this commit: https://github.com/SubtitleEdit/subt...56c93f0ddfe746 SE beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip The regular expressions for Italian can be tested here: http://regexstorm.net/tester?p=%28%5...0a&r=%241l%243 http://regexstorm.net/tester?p=%5cb%...e+word&r=l%242 http://regexstorm.net/tester?p=%28%5...+line&r=%241Il |
24th December 2021, 17:59 | #1511 | Link |
Registered User
Join Date: Nov 2005
Posts: 112
|
That'|| keep you around and I don't think
you'll be doing any card tricks either. That error survive OCR with tesserac 3.02, spelling correction and fix common errors. Merry Christmas to all. Last edited by locotus; 24th December 2021 at 18:26. |
30th December 2021, 20:39 | #1512 | Link |
Registered User
Join Date: Jun 2006
Posts: 353
|
Please add a keyboard shortcut for [ ] Auto submit on first char (make a letter of it underscored) in OCR - Manual image to text window.
UPD: sorry, missed that it is already there. UPD2: the underscored letter isn't working when non-English keyboard layout is active. It is expected but still not handy. Can you please change the elements' order in this window so as Shift-Tab in Character(s) as text field would move the focus to [ ] Auto submit on first char?
__________________
Windows 8.1 x64 Magically yours Raistlin Last edited by GCRaistlin; 31st December 2021 at 00:58. |
31st December 2021, 21:41 | #1514 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip
@tormento: The "Io ", "Ia " and "Ii " should work. @locotus: "That'||" and words like that should also work better. @GCRaistlin: Ctrl+I and Ctrl+A should now toggle check boxes. Also improved tab stop a little. Could not re-create the other issues. Also, SE now uses a new "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subt...scussions/5616 Happy New Year |
1st January 2022, 23:01 | #1515 | Link | ||
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@GCRaistlin
Quote:
OCR was performed correctly to the place of detention, i.e. to you, to the unrecognized character "7" so "a" is displayed in place of "АЯ" because this is how it was processed and saved due to your error. Changing the content of the character base at this point by means of delete will not affect the previously processed text. Only when you press [Start OCR] again for this image, in the place where "a" was deleted, you will be asked to reassign the character (s) for "АЯ", and the new, correct screen content with changes will be displayed again only after the entire line has been processed (whole picture). That's how it works. @Nikse555 Quote:
*** Happy New Year ***
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 2nd January 2022 at 11:40. |
||
2nd January 2022, 14:10 | #1516 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@Janusz: "Ctrl + A" for "auto-submit first char" is now "Ctrl+F". Thx, nice catch
I've added a Polish word split list from 44 subtitle files - do give it a test (via "Fix common errors" - "Fix common OCR error" with lines-without-spaces) - "fix-words-without-spaces" word list - should improve OCR'ing e.g. italic text - see more here: https://github.com/SubtitleEdit/subt...scussions/5616 |
2nd January 2022, 17:11 | #1517 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@Niksee555
I have known this discussion and have been following it for some time now, so my dictionary contains over 26,500 words. For the sup file generated from this dictionary from words in random order, it works very well with one "but". I had to remove all single letters which in Polish are words like: a, i, o, u, w, z; whether: l, m, C, which are again abbreviations of words, because in normal text where unknown words appear (e.g., untranslated surnames, proper names) single letters meant that the unknown word was divided into smaller - one or two letter words - matching the Polish dictionary. The idea for this new functionality will certainly be used where italics or fonts of different sizes are used that differ significantly in spacing between words. I will, of course, check your dictionary.
__________________
Sorry for my mistakes - I'm using a translator. |
3rd January 2022, 18:05 | #1518 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
I do not know how to call it. In any case, the case is as follows.
Image to download: I was doing OCR and encountered something like this: OCR misread "Ź" - you need to fix it, so: Surprise. Line read almost flawlessly and here the same "?". What is it about? I press [Add]. Really, I don't know what to type. @Nikse555, save a father of a large family. I would like to add that the Previev window shows the correct division of the image into the upper and lower text. I also tried binary comparison. First OCR start - ok - flawless, everything in its place, second time [Start OCR] - I just had the program asking to add "Z". With Tesseract 3.02 - standard - almost fine, but you can press [Start OCR] at will - the result is always the same. And here is probably a bug in the program. With Tesseract 3.02 I went back to binary comparison and after pressing [Start OCR] the program crashed. This has been the case several times. After a reboot, the program no longer hangs, but after another [Start OCR] letters "c" and "Ź" disappear. Although everything is in place in the [Inspect ...] window.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 3rd January 2022 at 20:10. |
8th January 2022, 22:26 | #1519 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@Janusz: Perhaps you could use a file host?
So I should remove the single letter from the Polish word-split-list? If you generate one, I'll be happy to include it too. Last edited by Nikse555; 9th January 2022 at 13:09. |
10th January 2022, 14:37 | #1520 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@Niksee
I have just sent pol_WordSplitList.txt to your email. A dictionary (385 kB, 39256 words) was generated for len<5 = 20 and len>=5 = 10. With "default 11 and 5", there are not many more words and the file grows twice as large. I have left only the really necessary individual letters in it. I have also removed a few 3-letter words that are missing from the dictionary I use (pl_PL.dic from 1 December 2021). Because the rules I have used so far for hyphenating concatenated words work fine with the optimal [No of pixels is space], the action pol_WordSplitList.txt is visible only after the interval is exceeded by 2 or more pixels. It works very well, it can deal with clumps of 3 or even 4 words. The number of errors depends on how much no match is [No of pixels is space]. A few remarks on the operation of the program:
__________________
Sorry for my mistakes - I'm using a translator. |
|
|