Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
21st October 2021, 10:01 | #1441 | Link | |
Acid fr0g
Join Date: May 2002
Location: Italy
Posts: 2,580
|
Quote:
<PartialWordsAlways> <!-- Will be replaced always --> <WordPart from="II" to="Il" /> <WordPart from="I'" to="l'" /> <WordPart from="Ii" to="li" /> <WordPart from="Ià" to="là" /> <WordPart from="Iè" to="lè" /> <WordPart from="Ié" to="lé" /> <WordPart from="Iì" to="lì" /> <WordPart from="Iò" to="lò" /> <WordPart from="Iù" to="lù" /> </PartialWordsAlways> doesn't work?
__________________
@turment on Telegram |
|
22nd October 2021, 11:02 | #1442 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
I assume you have "Settings/Tools/Fix common OCR errors - also use hard-coded rules" enabled.
I have this option turned off so that the rules hidden under it do not change the text corrected according to my rules.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 22nd October 2021 at 12:32. |
29th October 2021, 21:42 | #1444 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@darksen
You can give a regex that doesn't work and on what text (fragment, one sentence). My beta 231 works fine. She searched (ctr F later F3) for all words starting with a capital letter (237) in the test text (415 lines) and it did not hang.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 29th October 2021 at 21:47. |
29th October 2021, 23:51 | #1445 | Link |
Registered User
Join Date: Apr 2019
Posts: 64
|
Sure, this is the regex I'm using:
Code:
(\{.+\})*(\s+|^)(\{.+\})*([A-ZÁ-ÚÑ][A-ZÁ-ÚÑ]+(\s*[A-ZÁ-ÚÑ0-9]*)+(\.|,)*)|([a-zá-úñ]$)|([0-9]$)|(^[0-9]+\s([A-ZÁ-ÚÑ]*|\s*)+$)|(^([A-ZÁ-ÚÑ]*|\s*)+[0-9]+([A-ZÁ-ÚÑ0-9]*|\s*)+$) This is the srt with which I'm trying: https://app.box.com/s/qts6ml3cvvefxbxwc6l30z0b595fdoqq I've tried with another SRT and it doesn't hang with it, can you try with the SRT I shared? |
30th October 2021, 07:53 | #1446 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
Hi darksen,
It seems the regex engine has problems with this pattern. SE will now check for timeout, and display a message like this (instead of hanging): Code:
The RegEx engine has timed out while trying to match a pattern to an input string. This can occur for many reasons, including very large inputs or excessive backtracking caused by nested quantifiers, back-references and other factors. SE 3.6.3 should be out soonish |
30th October 2021, 09:37 | #1447 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@ darksen
Your script hangs with the text: "A NUESTRO AMIGO HARRY HOUDINI DE LA GENTE DE KILLARNEY, IRELAND "- this is line 248. Debugger Message: "Catastrophic backtracking has been detected and the execution of your expression has been halted." @Nikse and this causes SE also the latest beta 232 to crash.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 30th October 2021 at 12:52. |
30th October 2021, 12:05 | #1449 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@Nikse
In the Find window, do: Counting, Cancel Obviously for the script and text @darksen gave above.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 30th October 2021 at 12:52. |
30th October 2021, 14:56 | #1450 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@Janusz: thx, slightly improved in beta 234: https://github.com/SubtitleEdit/subt...leEditBeta.zip
|
31st October 2021, 17:43 | #1453 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@tormento:
I think the hard coded rules should probably be moved to the OCR fix replace list... at some point. I did a small test and mostly got stuff about periods (right part is with hard coded rules): Code:
. ..is a meat by-product. <-> ...is a meat by-product. How did you.. .? <-> How did you...? For now I guess you should disable the hard-coded rules, and add something for the periods. @Janusz: Did you add some rules to handle periods? |
1st November 2021, 03:24 | #1454 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@Nikse
Yes. A few more rules that were missing when I turned off "hard-coded rules", for example removing spaces but only between "1" and the next digit, setting correct entries for: , . ; : ! ? <i> - . Since my character base does not contain an "I", I had to add the replacement of "l" with "I". @tormento Here you have the test files: ita_OCRFixReplaceList.xml, test_8.20.237.100e.nocr with character base (contains "l" and "I") - options for ocr set by name: No of ... 8, Max wr ... 20, threshold ... 237 From test.txt, test.srt I created test.sup, from which I got test_ocr.srt. In my opinion everything works as it should, even with the "hard coded rules" option turned on.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 1st November 2021 at 11:04. |
1st November 2021, 10:58 | #1455 | Link | |
Acid fr0g
Join Date: May 2002
Location: Italy
Posts: 2,580
|
Quote:
Plus, as I addressed some time ago, it would be really helpful to have an additional "common" name list button in OCR, not to have to add it multiple times when you recognize multiple languages. I usually OCR original language + italian and the proper names are the same in all the languages, i.e. Luke is always Luke and so on.
__________________
@turment on Telegram |
|
1st November 2021, 11:39 | #1456 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@tormento
Words or expressions added to names.xml are checked regardless of the language used. Add a word or phrase directly to the file by editing, or use the "Name list manager" plug-in in SE.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 1st November 2021 at 12:07. |
1st November 2021, 11:51 | #1457 | Link | |
Acid fr0g
Join Date: May 2002
Location: Italy
Posts: 2,580
|
Quote:
I know the existence of that file but it would be really uncomfortable to exit SE every time I find a name, manually edit the file, run SE again and go on like that. A button inside the OCR would be much better.
__________________
@turment on Telegram |
|
1st November 2021, 12:45 | #1458 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
You don't need to turn off the program during ocr etc.
Just stop ocr, add a word to the file, change the currently used dictionary to another or "none", return to your dictionary, then necessary - already corrected dictionary files will be read again. This is definitely not a comfortable solution - an extra button would be better to add a word to names.xml. At least today there is no other option. The facilitation is that the words added to ..._ names_user.xml are at the end and are not sorted, so it's easy to find and transfer them to names.xml
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 1st November 2021 at 13:07. |
2nd November 2021, 03:08 | #1459 | Link | |
Registered User
Join Date: Apr 2019
Posts: 64
|
Quote:
|
|
|
|