Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 21st October 2021, 10:01   #1441  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,580
Quote:
Originally Posted by Janusz View Post
Sample content of ita_OCRFixReplaceList_User.xml
Code:
<ReplaceList>
  <WholeWords />
  <RegularExpressions>
    <RegEx find="II" replaceWith="ll" /> #
    <RegEx find="I'" replaceWith="l'" /> #
  </RegularExpressions>
  <RemovedWholeWords />
  <RemovedRegularExpressions />
</ReplaceList>
Why

<PartialWordsAlways>
<!-- Will be replaced always -->
<WordPart from="II" to="Il" />
<WordPart from="I'" to="l'" />
<WordPart from="Ii" to="li" />
<WordPart from="Ià" to="là" />
<WordPart from="Iè" to="lè" />
<WordPart from="Ié" to="lé" />
<WordPart from="Iì" to="lì" />
<WordPart from="Iò" to="lò" />
<WordPart from="Iù" to="lù" />
</PartialWordsAlways>


doesn't work?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 22nd October 2021, 11:02   #1442  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
I assume you have "Settings/Tools/Fix common OCR errors - also use hard-coded rules" enabled.
I have this option turned off so that the rules hidden under it do not change the text corrected according to my rules.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 22nd October 2021 at 12:32.
Janusz is offline   Reply With Quote
Old 29th October 2021, 08:50   #1443  |  Link
darksen
Registered User
 
Join Date: Apr 2019
Posts: 64
SE is hanging when using a regex search, after some F3's it hangs and I have to force close it. I'm using latest beta downloaded 30 minutes ago.
This didn't happen before.
darksen is offline   Reply With Quote
Old 29th October 2021, 21:42   #1444  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@darksen
You can give a regex that doesn't work and on what text (fragment, one sentence).
My beta 231 works fine. She searched (ctr F later F3) for all words starting with a capital letter (237) in the test text (415 lines) and it did not hang.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 29th October 2021 at 21:47.
Janusz is offline   Reply With Quote
Old 29th October 2021, 23:51   #1445  |  Link
darksen
Registered User
 
Join Date: Apr 2019
Posts: 64
Sure, this is the regex I'm using:

Code:
(\{.+\})*(\s+|^)(\{.+\})*([A-ZÁ-ÚÑ][A-ZÁ-ÚÑ]+(\s*[A-ZÁ-ÚÑ0-9]*)+(\.|,)*)|([a-zá-úñ]$)|([0-9]$)|(^[0-9]+\s([A-ZÁ-ÚÑ]*|\s*)+$)|(^([A-ZÁ-ÚÑ]*|\s*)+[0-9]+([A-ZÁ-ÚÑ0-9]*|\s*)+$)
Right now I tested it again and this time it only searched without problems the first time (When using Ctrl+F) but then pressing F3 makes it hang.
This is the srt with which I'm trying: https://app.box.com/s/qts6ml3cvvefxbxwc6l30z0b595fdoqq

I've tried with another SRT and it doesn't hang with it, can you try with the SRT I shared?
darksen is offline   Reply With Quote
Old 30th October 2021, 07:53   #1446  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Hi darksen,

It seems the regex engine has problems with this pattern.

SE will now check for timeout, and display a message like this (instead of hanging):

Code:
The RegEx engine has timed out while trying to match a pattern to an input string. This can occur for many reasons, including very large inputs or excessive backtracking caused by nested quantifiers, back-references and other factors.
Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip

SE 3.6.3 should be out soonish
Nikse555 is offline   Reply With Quote
Old 30th October 2021, 09:37   #1447  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@ darksen
Your script hangs with the text: "A NUESTRO AMIGO HARRY HOUDINI
DE LA GENTE DE KILLARNEY, IRELAND "- this is line 248.
Debugger Message: "Catastrophic backtracking has been detected and the execution of your expression has been halted."

@Nikse
and this causes SE also the latest beta 232 to crash.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 30th October 2021 at 12:52.
Janusz is offline   Reply With Quote
Old 30th October 2021, 10:39   #1448  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: Crash, how?
After using find I get a msgbox with the timeout error... where did you make SE crash?
Nikse555 is offline   Reply With Quote
Old 30th October 2021, 12:05   #1449  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Nikse
In the Find window, do: Counting, Cancel
Obviously for the script and text @darksen gave above.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 30th October 2021 at 12:52.
Janusz is offline   Reply With Quote
Old 30th October 2021, 14:56   #1450  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: thx, slightly improved in beta 234: https://github.com/SubtitleEdit/subt...leEditBeta.zip
Nikse555 is offline   Reply With Quote
Old 30th October 2021, 15:42   #1451  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Nikse
Thanks for the fix. It works.
@darksen has to find a bug in his script that causes it to hang.
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 30th October 2021, 19:09   #1452  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,580
Quote:
Originally Posted by Janusz View Post
I assume you have "Settings/Tools/Fix common OCR errors - also use hard-coded rules" enabled.
I have this option turned off so that the rules hidden under it do not change the text corrected according to my rules.
@Nikse555 could you expose the hidden OCR rules?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 31st October 2021, 17:43   #1453  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@tormento:
I think the hard coded rules should probably be moved to the OCR fix replace list... at some point.

I did a small test and mostly got stuff about periods (right part is with hard coded rules):
Code:
. ..is a meat by-product.   <->   ...is a meat by-product.
How did you.. .?            <->   How did you...?
The code is here: https://github.com/SubtitleEdit/subt...Engine.cs#L999

For now I guess you should disable the hard-coded rules, and add something for the periods.


@Janusz: Did you add some rules to handle periods?
Nikse555 is offline   Reply With Quote
Old 1st November 2021, 03:24   #1454  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: Did you add some rules to handle periods?
@Nikse
Yes. A few more rules that were missing when I turned off "hard-coded rules", for example removing spaces but only between "1" and the next digit, setting correct entries for: , . ; : ! ? <i> - .
Since my character base does not contain an "I", I had to add the replacement of "l" with "I".

@tormento
Here you have the test files:
ita_OCRFixReplaceList.xml, test_8.20.237.100e.nocr with character base (contains "l" and "I") - options for ocr set by name: No of ... 8, Max wr ... 20, threshold ... 237
From test.txt, test.srt I created test.sup, from which I got test_ocr.srt. In my opinion everything works as it should, even with the "hard coded rules" option turned on.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 1st November 2021 at 11:04.
Janusz is offline   Reply With Quote
Old 1st November 2021, 10:58   #1455  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,580
Quote:
Originally Posted by Nikse555 View Post
I think the hard coded rules should probably be moved to the OCR fix replace list... at some point.
YES, please.

Plus, as I addressed some time ago, it would be really helpful to have an additional "common" name list button in OCR, not to have to add it multiple times when you recognize multiple languages. I usually OCR original language + italian and the proper names are the same in all the languages, i.e. Luke is always Luke and so on.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 1st November 2021, 11:39   #1456  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@tormento
Words or expressions added to names.xml are checked regardless of the language used.
Add a word or phrase directly to the file by editing, or use the "Name list manager" plug-in in SE.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 1st November 2021 at 12:07.
Janusz is offline   Reply With Quote
Old 1st November 2021, 11:51   #1457  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,580
Quote:
Originally Posted by Janusz View Post
@tormento
Words or expressions added to names.xml are checked regardless of the language used.
Use [Word lists], switch to English, add a new word or phrase. From now on you will have the word added in your Italian and I will have the Polish dictionary.

I know the existence of that file but it would be really uncomfortable to exit SE every time I find a name, manually edit the file, run SE again and go on like that. A button inside the OCR would be much better.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 1st November 2021, 12:45   #1458  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
You don't need to turn off the program during ocr etc.
Just stop ocr, add a word to the file, change the currently used dictionary to another or "none", return to your dictionary, then necessary - already corrected dictionary files will be read again. This is definitely not a comfortable solution - an extra button would be better to add a word to names.xml. At least today there is no other option. The facilitation is that the words added to ..._ names_user.xml are at the end and are not sorted, so it's easy to find and transfer them to names.xml
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 1st November 2021 at 13:07.
Janusz is offline   Reply With Quote
Old 2nd November 2021, 03:08   #1459  |  Link
darksen
Registered User
 
Join Date: Apr 2019
Posts: 64
Quote:
Originally Posted by Nikse555 View Post
@Janusz: thx, slightly improved in beta 234: https://github.com/SubtitleEdit/subt...leEditBeta.zip
Quote:
Originally Posted by Janusz View Post
@Nikse
Thanks for the fix. It works.
@darksen has to find a bug in his script that causes it to hang.
Thanks both, I found where the problem with the regex, it was searching nonstop
darksen is offline   Reply With Quote
Old 3rd November 2021, 15:14   #1460  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,580
Quote:
Originally Posted by Janusz View Post
You don't need to turn off the program during ocr etc.
Easy of use is always preferred.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 06:46.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.