Subtitle Edit 4.0.4 - Page 73

tormento · 21st October 2021, 10:01

Quote:

Originally Posted by Janusz

Sample content of ita_OCRFixReplaceList_User.xml

Code:

<ReplaceList>
  <WholeWords />
  <RegularExpressions>
    <RegEx find="II" replaceWith="ll" /> #
    <RegEx find="I'" replaceWith="l'" /> #
  </RegularExpressions>
  <RemovedWholeWords />
  <RemovedRegularExpressions />
</ReplaceList>

Why

<PartialWordsAlways>

<WordPart from="II" to="Il" />
<WordPart from="I'" to="l'" />
<WordPart from="Ii" to="li" />
<WordPart from="Ià" to="là" />
<WordPart from="Iè" to="lè" />
<WordPart from="Ié" to="lé" />
<WordPart from="Iì" to="lì" />
<WordPart from="Iò" to="lò" />
<WordPart from="Iù" to="lù" />
</PartialWordsAlways>

doesn't work?

Janusz · 22nd October 2021, 11:02

I assume you have "Settings/Tools/Fix common OCR errors - also use hard-coded rules" enabled.
I have this option turned off so that the rules hidden under it do not change the text corrected according to my rules.

darksen · 29th October 2021, 08:50

SE is hanging when using a regex search, after some F3's it hangs and I have to force close it. I'm using latest beta downloaded 30 minutes ago.
This didn't happen before.

Janusz · 29th October 2021, 21:42

@darksen
You can give a regex that doesn't work and on what text (fragment, one sentence).
My beta 231 works fine. She searched (ctr F later F3) for all words starting with a capital letter (237) in the test text (415 lines) and it did not hang.

darksen · 29th October 2021, 23:51

Sure, this is the regex I'm using:

Code:

(\{.+\})*(\s+|^)(\{.+\})*([A-ZÁ-ÚÑ][A-ZÁ-ÚÑ]+(\s*[A-ZÁ-ÚÑ0-9]*)+(\.|,)*)|([a-zá-úñ]$)|([0-9]$)|(^[0-9]+\s([A-ZÁ-ÚÑ]*|\s*)+$)|(^([A-ZÁ-ÚÑ]*|\s*)+[0-9]+([A-ZÁ-ÚÑ0-9]*|\s*)+$)

Right now I tested it again and this time it only searched without problems the first time (When using Ctrl+F) but then pressing F3 makes it hang.
This is the srt with which I'm trying: https://app.box.com/s/qts6ml3cvvefxbxwc6l30z0b595fdoqq

I've tried with another SRT and it doesn't hang with it, can you try with the SRT I shared?

Nikse555 · 30th October 2021, 07:53

Hi darksen,

It seems the regex engine has problems with this pattern.

SE will now check for timeout, and display a message like this (instead of hanging):

Code:

The RegEx engine has timed out while trying to match a pattern to an input string. This can occur for many reasons, including very large inputs or excessive backtracking caused by nested quantifiers, back-references and other factors.

Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip

SE 3.6.3 should be out soonish

Janusz · 30th October 2021, 09:37

@ darksen
Your script hangs with the text: "A NUESTRO AMIGO HARRY HOUDINI
DE LA GENTE DE KILLARNEY, IRELAND "- this is line 248.
Debugger Message: "Catastrophic backtracking has been detected and the execution of your expression has been halted."

@Nikse
and this causes SE also the latest beta 232 to crash.

Nikse555 · 30th October 2021, 10:39

@Janusz: Crash, how?
After using find I get a msgbox with the timeout error... where did you make SE crash?

Janusz · 30th October 2021, 12:05

@Nikse
In the Find window, do: Counting, Cancel
Obviously for the script and text @darksen gave above.

Nikse555 · 30th October 2021, 14:56

@Janusz: thx, slightly improved in beta 234: https://github.com/SubtitleEdit/subt...leEditBeta.zip

Janusz · 30th October 2021, 15:42

@Nikse
Thanks for the fix. It works.
@darksen has to find a bug in his script that causes it to hang.

tormento · 30th October 2021, 19:09

Quote:

Originally Posted by Janusz

I assume you have "Settings/Tools/Fix common OCR errors - also use hard-coded rules" enabled.
I have this option turned off so that the rules hidden under it do not change the text corrected according to my rules.

@Nikse555 could you expose the hidden OCR rules?

Nikse555 · 31st October 2021, 17:43

@tormento:
I think the hard coded rules should probably be moved to the OCR fix replace list... at some point.

I did a small test and mostly got stuff about periods (right part is with hard coded rules):

Code:

. ..is a meat by-product.   <->   ...is a meat by-product.
How did you.. .?            <->   How did you...?

The code is here: https://github.com/SubtitleEdit/subt...Engine.cs#L999

For now I guess you should disable the hard-coded rules, and add something for the periods.

@Janusz: Did you add some rules to handle periods?

Janusz · 1st November 2021, 03:24

Quote:

Originally Posted by Nikse555

@Janusz: Did you add some rules to handle periods?

@Nikse
Yes. A few more rules that were missing when I turned off "hard-coded rules", for example removing spaces but only between "1" and the next digit, setting correct entries for: , . ; : ! ? <i> - .
Since my character base does not contain an "I", I had to add the replacement of "l" with "I".

@tormento
Here you have the test files:
ita_OCRFixReplaceList.xml, test_8.20.237.100e.nocr with character base (contains "l" and "I") - options for ocr set by name: No of ... 8, Max wr ... 20, threshold ... 237
From test.txt, test.srt I created test.sup, from which I got test_ocr.srt. In my opinion everything works as it should, even with the "hard coded rules" option turned on.

tormento · 1st November 2021, 10:58

Quote:

Originally Posted by Nikse555

I think the hard coded rules should probably be moved to the OCR fix replace list... at some point.

YES, please.

Plus, as I addressed some time ago, it would be really helpful to have an additional "common" name list button in OCR, not to have to add it multiple times when you recognize multiple languages. I usually OCR original language + italian and the proper names are the same in all the languages, i.e. Luke is always Luke and so on.

Janusz · 1st November 2021, 11:39

@tormento
Words or expressions added to names.xml are checked regardless of the language used.
Add a word or phrase directly to the file by editing, or use the "Name list manager" plug-in in SE.

tormento · 1st November 2021, 11:51

Quote:

Originally Posted by Janusz

@tormento
Words or expressions added to names.xml are checked regardless of the language used.
Use [Word lists], switch to English, add a new word or phrase. From now on you will have the word added in your Italian and I will have the Polish dictionary.

I know the existence of that file but it would be really uncomfortable to exit SE every time I find a name, manually edit the file, run SE again and go on like that. A button inside the OCR would be much better.

Janusz · 1st November 2021, 12:45

You don't need to turn off the program during ocr etc.
Just stop ocr, add a word to the file, change the currently used dictionary to another or "none", return to your dictionary, then necessary - already corrected dictionary files will be read again. This is definitely not a comfortable solution - an extra button would be better to add a word to names.xml. At least today there is no other option. The facilitation is that the words added to ..._ names_user.xml are at the end and are not sorted, so it's easy to find and transfer them to names.xml

darksen · 2nd November 2021, 03:08

Quote:

Originally Posted by Nikse555

@Janusz: thx, slightly improved in beta 234: https://github.com/SubtitleEdit/subt...leEditBeta.zip

Quote:

Originally Posted by Janusz

@Nikse
Thanks for the fix. It works.
@darksen has to find a bug in his script that causes it to hang.

Thanks both, I found where the problem with the regex, it was searching nonstop

tormento · 3rd November 2021, 15:14

Quote:

Originally Posted by Janusz

You don't need to turn off the program during ocr etc.

Easy of use is always preferred.

22nd October 2021, 11:02	#1442 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	I assume you have "Settings/Tools/Fix common OCR errors - also use hard-coded rules" enabled. I have this option turned off so that the rules hidden under it do not change the text corrected according to my rules. __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 22nd October 2021 at 12:32.

29th October 2021, 21:42	#1444 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	@darksen You can give a regex that doesn't work and on what text (fragment, one sentence). My beta 231 works fine. She searched (ctr F later F3) for all words starting with a capital letter (237) in the test text (415 lines) and it did not hang. __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 29th October 2021 at 21:47.

29th October 2021, 23:51	#1445 \| Link
darksen Registered User Join Date: Apr 2019 Posts: 64	Sure, this is the regex I'm using: Code: (\{.+\})(\s+\|^)(\{.+\})([A-ZÁ-ÚÑ][A-ZÁ-ÚÑ]+(\s[A-ZÁ-ÚÑ0-9])+(\.\|,))\|([a-zá-úñ]$)\|([0-9]$)\|(^[0-9]+\s([A-ZÁ-ÚÑ]\|\s)+$)\|(^([A-ZÁ-ÚÑ]\|\s)+[0-9]+([A-ZÁ-ÚÑ0-9]\|\s*)+$) Right now I tested it again and this time it only searched without problems the first time (When using Ctrl+F) but then pressing F3 makes it hang. This is the srt with which I'm trying: https://app.box.com/s/qts6ml3cvvefxbxwc6l30z0b595fdoqq I've tried with another SRT and it doesn't hang with it, can you try with the SRT I shared?

30th October 2021, 07:53	#1446 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	Hi darksen, It seems the regex engine has problems with this pattern. SE will now check for timeout, and display a message like this (instead of hanging): Code: The RegEx engine has timed out while trying to match a pattern to an input string. This can occur for many reasons, including very large inputs or excessive backtracking caused by nested quantifiers, back-references and other factors. Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip SE 3.6.3 should be out soonish

30th October 2021, 09:37	#1447 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	@ darksen Your script hangs with the text: "A NUESTRO AMIGO HARRY HOUDINI DE LA GENTE DE KILLARNEY, IRELAND "- this is line 248. Debugger Message: "Catastrophic backtracking has been detected and the execution of your expression has been halted." @Nikse and this causes SE also the latest beta 232 to crash. __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 30th October 2021 at 12:52.

29th October 2021, 08:50	#1443 \| Link
darksen Registered User Join Date: Apr 2019 Posts: 64	SE is hanging when using a regex search, after some F3's it hangs and I have to force close it. I'm using latest beta downloaded 30 minutes ago. This didn't happen before.

30th October 2021, 10:39	#1448 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: Crash, how? After using find I get a msgbox with the timeout error... where did you make SE crash?

30th October 2021, 12:05	#1449 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	@Nikse In the Find window, do: Counting, Cancel Obviously for the script and text @darksen gave above. __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 30th October 2021 at 12:52.

30th October 2021, 14:56	#1450 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: thx, slightly improved in beta 234: https://github.com/SubtitleEdit/subt...leEditBeta.zip

30th October 2021, 15:42	#1451 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	@Nikse Thanks for the fix. It works. @darksen has to find a bug in his script that causes it to hang. __________________ Sorry for my mistakes - I'm using a translator.

31st October 2021, 17:43	#1453 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@tormento: I think the hard coded rules should probably be moved to the OCR fix replace list... at some point. I did a small test and mostly got stuff about periods (right part is with hard coded rules): Code: . ..is a meat by-product. <-> ...is a meat by-product. How did you.. .? <-> How did you...? The code is here: https://github.com/SubtitleEdit/subt...Engine.cs#L999 For now I guess you should disable the hard-coded rules, and add something for the periods. @Janusz: Did you add some rules to handle periods?

1st November 2021, 11:39	#1456 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	@tormento Words or expressions added to names.xml are checked regardless of the language used. Add a word or phrase directly to the file by editing, or use the "Name list manager" plug-in in SE. __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 1st November 2021 at 12:07.

1st November 2021, 12:45	#1458 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	You don't need to turn off the program during ocr etc. Just stop ocr, add a word to the file, change the currently used dictionary to another or "none", return to your dictionary, then necessary - already corrected dictionary files will be read again. This is definitely not a comfortable solution - an extra button would be better to add a word to names.xml. At least today there is no other option. The facilitation is that the words added to ..._ names_user.xml are at the end and are not sorted, so it's easy to find and transfer them to names.xml __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 1st November 2021 at 13:07.