Subtitle Edit [Archive] - Page 21

tormento

20th May 2020, 14:35

@tormento: Ah, did you set the proper "italic factor"? Right click in the list view, and choose "Set un-italic" factor (I think it's called).
What value did you put in the un-italic? Tried many and no luck.

Janusz

20th May 2020, 14:35

@Janusz: Is the crash fixed in this beta?
https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Yes too.
But now I don't have the "error_log.txt" file, the warning window is the same as before.

tormento

20th May 2020, 14:36

Don't be a child.
Oh.

My.

God.

41 posts and you sermonize.

:rolleyes:

Nikse555

20th May 2020, 17:46

@Janusz: OK, think I got the crash now: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@tormento: Right click on image for line 3 in Apollo13 and choose "Set align angle" (previously "Set un-italic factor"). Looks like "0,21" is a good value. Does that work for you? (#pixels is space=15)

Janusz

20th May 2020, 18:02

@Janusz: OK, think I got the crash now: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Yes, it works, thank you.
I have a few more comments for this wonderful program that do not depend on the OCR method chosen, but first I need to prepare the appropriate files.

tormento

20th May 2020, 18:46

Right click on image for line 3 in Apollo13 and choose "Set align angle" (previously "Set un-italic factor"). Looks like "0,21" is a good value. Does that work for you? (#pixels is space=15)
Unfortunately, even with less space pixels between letters, SE recognizes "of" outside italic and attaches it to "Apollo".

https://i.lensdump.com/i/j3Ik5z.md.png (https://lensdump.com/i/j3Ik5z)

Nikse555

20th May 2020, 20:18

@tormento: thx :) This should now work (did not work because it was half italic / half regular): https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip
But it was also working before for me... and for you too, if you had been using the dictionaries included with SE, like "eng_OCRFixReplaceList.xml". Why would you not use them?

Janusz

21st May 2020, 14:17

@tormento: thx :) This should now work (did not work because it was half italic / half regular)

Beta 129.
For this example I created a sup file from the text "the tragedy ofAlabama (https://drive.google.com/uc?export=view&id=14y2YPojPvhiPJOy94E8r3yhjiLohuf1R)" where "the tragedy of" I marked italic.

I only installed the following dictionaries: French, German, Italian, English without additional OCRFixReplaceList.xml files.
For: French, German, Italian, English - the patch works ok. The text after OCR looks like this: "the tragedy of Alabama",
for: Polish and "none" like this: "the tragedy ofAlabama".
I did not check others, but I think the amendment should work in all languages because the word "ofAlabama" is not correct in any language, and any division in this case may occur between italics/regular or regular/italics always regardless of the language chosen how many new words exist in the selected dictionary. Example from Poland: "fotografAdam" (photographer Adam).

Nikse555

21st May 2020, 16:50

Beta 129.
For this example I created a sup file from the text "the tragedy ofAlabama (https://drive.google.com/uc?export=view&id=14y2YPojPvhiPJOy94E8r3yhjiLohuf1R)" where "the tragedy of" I marked italic.

I only installed the following dictionaries: French, German, Italian, English without additional OCRFixReplaceList.xml files.
For: French, German, Italian, English - the patch works ok. The text after OCR looks like this: "the tragedy of Alabama",
for: Polish and "none" like this: "the tragedy ofAlabama".
I did not check others, but I think the amendment should work in all languages because the word "ofAlabama" is not correct in any language, and any division in this case may occur between italics/regular or regular/italics always regardless of the language chosen how many new words exist in the selected dictionary. Example from Poland: "fotografAdam" (photographer Adam).

Yes, the OCR process benefits from a good OCR fix replace list.
I've added a Polish one based on your input here: https://github.com/SubtitleEdit/subtitleedit/blob/dfad7c2e5e90b43e8215af711d1edbff3f27e4fe/Dictionaries/pol_OCRFixReplaceList.xml
Feel free to add to it :)

Janusz

21st May 2020, 21:57

Yes, the OCR process benefits from a good OCR fix replace list.

<WordPart from = "f" to = "f " /> <! - "f" will be two words ->

I had this line in <OCRFixReplaceList> so there had to be something else here.
In the first version of the file, the line contained only one phrase "photographerAdam".
"Adam" is only 5 letters, I thought maybe this is it?

I have created a new file (https://drive.google.com/uc?export=view&id=1Zr30Eu2Ae0EGnf1Mw7cbKCGEs56suVHy). I added a few lines and longer words starting with "A".

https://drive.google.com/uc?export=view&id=1rHvddcUtWsdBTiOmhvobpmRBKDnZkKN7

OCR worked, but as you can see above - not quite.
The division has happened, but it is not everywhere it should be. Only on line 2 is good.
I disabled split after "f" in "pol_OCRFixReplaceList.xml". The effect of this is at the bottom.
The division is correct, is where it should be, also on line 1.

Conclusion: The rare case of such a combination of words means that we have to choose ourselves:
enable or disable this option and when in our "OCRFixReplaceList.xml",
because we can do more damage than it is worth.

If you really don't have anything to do, you could look into the source, because changing the dictionary repeatedly to any one installed
and each time OCR with a new dictionary causes that what now looks so nice at the bottom will look like at the top again.
Only starting OCR restores order again.
I know that nobody will mix dictionaries under normal use, but the problem is.

tormento

23rd May 2020, 10:16

But it was also working before for me... and for you too, if you had been using the dictionaries included with SE, like "eng_OCRFixReplaceList.xml". Why would you not use them?
It does work with OCR fix, not without.

I tend not to use it because the I have trained the OCR so well that I can postprocess I-l after OCR and have a faster job.

Perhaps you could implement a OCR fix with OCR errors only and not word dictionary aware.

Janusz

23rd May 2020, 12:56

@Nikse555
To report a mistake.
Occurs since beta 119, beta 112 works fine.

The previously reported bug in beta 123 and later concerned a missing dictionary.
Because I rarely use "Prompt for unknown words" so the option was not enabled and was not checked.

In my previous thread I used "Binary image compare" so I didn't notice this error.
Today I returned to nOCR. My Settings: for the function to work, the dictionary must be selected so it is selected.
"Draw missing texts" - disabled so that the program does not call for every new unknown letter.
(:) Even for this function of the program it is worth using nOCR :) ).
"Prompt for unknown words" - enabled.
"Fix OCR errors" - disabled - OCR does not use user files.
"Try to guess unknown words" - does not matter with "Fix ..." = disabled. It doesn't work though it's turned on.

Start OCR begins to process the text until it encounters the first unknown word.
With a well-constructed character base, it will be a word not in the dictionary, otherwise an unrecognized character in the word.
The process calls the "Spell check" window. "Skip one", "Skip all", "Abort" causes an error window to be called:

https://drive.google.com/uc?export=view&id=1Rwh-vy_6HSbKEQz2En7mSsPA-V99yeko

Depending on what we choose, we will return to Windows - the program will crash or to the Program.
I checked on various sup files, including those available from this forum.

@Tormento
I don't usually use it because I trained OCR so well that I can postprocess Il after OCR and have a faster job.
Perhaps you could only implement the OCR patch with OCR errors and not recognize the word dictionary.

I do not use "Fix common OCR errors - also use hard-coded rules" because this option does something more than just what results from its name - especially when it comes to Il and iL. For this reason I do not use "I" in the character database. I have definitely fewer mistakes to improve, at least I know which ones.
My assumption is that changes made for English should not affect other languages available in the program.

Nikse555

23rd May 2020, 20:33

@Janusz: thx for the crash info :)
Should hopefully be fixed here: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Also, Ctrl+T in the OCR window will start some auto-training... probably not too useful, but it's a little fun to play with.

Janusz

23rd May 2020, 21:55

Patch works, thank you. :thanks:

Also, Ctrl+T in the OCR window will start some auto-training... probably not too useful, but it's a little fun to play with.

This function was and is also available under the right mouse button.
However, using it did not bring up any additional windows as it does now.
Something was happening in the background, the effects of this work could not be seen.
I noticed this window yesterday, but I didn't have time to check exactly what it was.
I am curious myself how this file (https://drive.google.com/uc?export=view&id=1zSiOA9OOplCFnzmIdVMd69piN3bm7w1n) will look. ;)

Nikse555

24th May 2020, 08:58

Forgot... for training you need a .srt file with spaces around characters, like:
1
00:00:00,490 --> 00:00:02,350
a b c d e f g h i j k l m n o p q r s t u

2
00:00:02,530 --> 00:00:04,150
v w x y z

3
00:00:04,240 --> 00:00:06,240
0 1 2 3 4 5 6 7 8 9 , . ( ) [ ] ' " $ % ♫ ♪ &

4
00:00:06,510 --> 00:00:08,200
A B C D E F G H I J K L M N O P Q R S T U

5
00:00:08,320 --> 00:00:10,570
V W X Y Z

6
00:00:11,510 --> 00:00:13,510
: ; - ! ?

7
00:00:13,540 --> 00:00:15,540
é É Č Ę Ė Į Š Ū Ž č ę ė į š ų ž

8
00:00:15,560 --> 00:00:17,560
ß ü Ü æ ø å ä ö Æ Ø Å Ä Ö

9
00:00:17,584 --> 00:00:19,584
ff ft fi fj fl rf rt rv rw ry rt ryt tt TV tw yt yw

Also, unattended OCR alarm (taskbar blink/beep) is now customizable (via Settings.xml) and these settings ( in latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip ):
<UnfocusedAttentionBlinkCount>50</UnfocusedAttentionBlinkCount>
<UnfocusedAttentionPlaySoundCount>2</UnfocusedAttentionPlaySoundCount>
<UnfocusedAttentionPlaySoundEvery>2</UnfocusedAttentionPlaySoundEvery>

Janusz

25th May 2020, 11:57

And the game is over.
The text with 6052 lines (31736 words, 189721 characters) was read without the need to add at least 1 character. I used nOCR. I'm really shocked how it worked for the "Arial Black" font.
The one thing I've corrected before is that I've added a few triple and a dozen double characters to your train.srt file
A great tool.

tormento

25th May 2020, 12:54

I used nOCR.
What is nOCR?

Janusz

25th May 2020, 14:01

What is nOCR?

Disable Subtitle Edit, in settings.ini find "<ShowBetaStuff>" and replace "False" with "True".
Launch the program. In [OCR Method] you will have a new method: "OCR via nOCR".
From the parameter name you can see that not everything can work as it should.
And that's how it is now. I didn't take notes of what I was doing and I can't reproduce what I wrote above. Fortunately, I have saved the character base and it can be repeated with it, but I can't generate the same database a second time.

---
For sure @Nikse555 will read it so I will add that:
the original character base entered from the hand to read the entire file error-free contains 367 elements, the new one was created by N-OCR training 481 characters so it may contain already recognized characters. I don't have the tools to check it.
---
It turned out that my admiration turned out to be premature. My mistake - I left my character base in the working directory, thanks to which the generated new characters were added to my base and hence the sensational result. Detriment. It seems that this project is no longer being developed. In the state in which it is now it can only serve as a curiosity.
--------------------------------------------------------------------------------------------

@Nikse555

There was a problem with beta 161.

- nOCR has stopped recognizing: . , - (three characters) and calls for each character encountered as a new one - unknown.

- 'o' recognizes as '0' or 'c', but this does not occur for everyone 'o' in the text.

Example:
beta 145: Nie. Na United Fusion Corporation. To co innego. (I worked on this version until then).
beta 161: Nie* Na United Fusion Corporation* T0 co inneg0*

tormento

27th May 2020, 17:06

@Nikse555

Out of curiosity, would you please do a x64 compile? I am curious to see if it gets faster on binary OCR.

GCRaistlin

27th May 2020, 17:50

>The latest beta still allows to add an empty better multi match.
I think that "empty string" could be a valid text... perhaps a warning?
Sure, a warning would be nice.

>Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window?
I don't follow... ?
I call 'Inspect compare matches for current image' window. By default, the 1st item is selected on the left. I want to add better multi match for, say, the 5th item. I do a right-click on the 5th item and select 'Add better multi match' - but I don't get what I expected because right click doesn't select anything so this way I add better multi match for the 1st item, not the 5th one.

Nikse555

27th May 2020, 17:52

@Janusz: nOCR training + ,.- should be improved here: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@tormento: SE already runs 64-bit if you have a 64-bit OS. Normally 64-bit programs run a little slower...

Latest beta now does fallback to "Latin.nocr" (nOCR) from "Binary image compare" db "Latin.db".
Included large (auto-trained) "Latin.nocr" db in beta.

Janusz

28th May 2020, 01:14

@Janusz: nOCR training + ,.- should be improved here: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Unfortunately. Version 168 does not work well. I sent files for testing and comparison to the email address.

nOCR Training works much better.
In the sentence: "The quick brown fox jumps over the Iazy do*." there are only 2 errors. "l" was read as "I" and no "g".
It is poor with punctuation marks.
In testing the same files that I sent, I am unable to determine the source of the character swap compared to beta 145.
In any case, beta 168 compared to beta 145 recognizes characters created in the new nOCR training much better.

tormento

28th May 2020, 09:33

@tormento: SE already runs 64-bit if you have a 64-bit OS.
The only x64 part I can see is the Hunspell spell checker. Main exe is x86, tesseract (both) are x86. What part of your app runs in x64?
Normally 64-bit programs run a little slower...
That is a very questionable statement. All the x64 programs I use are definitely faster.

ACKR

28th May 2020, 11:28

Hello i want to change the framerate of a large number of subs from 24 to 29 fps how to do this?

varekai

28th May 2020, 14:23

I'm guessing you talk about .srt subtitles?
Subtitle Edit
Tool -> Batch convert -> Change framerate
Or you can try this:
https://www.videohelp.com/software/Subtitle-framerate-changer
It's no longer developed so I have no idea if it works for you.
Tried a few subtitles and it seems to work...

Best regards
varekai

ACKR

28th May 2020, 15:52

I'm guessing you talk about .srt subtitles?
Subtitle Edit
Tool -> Batch convert -> Change framerate
Or you can try this:
https://www.videohelp.com/software/Subtitle-framerate-changer
It's no longer developed so I have no idea if it works for you.
Tried a few subtitles and it seems to work...

Best regards
varekai

.ass subtitles

Also can Batch convert be used to add delays?

sneaker_ger

29th May 2020, 17:57

Also can Batch convert be used to add delays?
SubtitleEdit's batch converter calls it "Offset time codes".

18fps

30th May 2020, 12:28

When correcting capitalization of all caps subtitles, it would be of great help if the program could read the names of characters from the imdb page of the film (the user would give the http addresss).

Nikse555

30th May 2020, 14:53

@tormento: SubtitleEdit.exe (all the C# code) runs 64-bit on 64-bit OS (check task manager, there should be no "(32-bit)" after the name). I've not actually tested SE 32-bit vs SE 64-bit performance... at least you have more memory with 64-bit programs.
Tesseract exe runs 32-bit (the 32-bit tesseract is faster than the 64-bit tesseract - well tested)

@18fps:
>When correcting capitalization of all caps subtitles, it would be of great help if the program could read the names of characters from the imdb page of the film (the user would give the http addresss)
Actually not a bad idea... but there seems to a lot of "Cute Girl" and "Prison Guard" which I guess would make this hard. Ideas?

@Janusz: The nOCR (line ocr) works best with larger subtitles, like from bluray sup files, so go for "Binary image compare" with small subtitles or even Tesseract 5.
nOCR is still missing a lot of work - latest beta has added "fallback to nOCR" from "Binary image compare" which I think will work nicely. Also added a "Max bad pixels" for nOCR (based on not matching pixels from lines).
The nOCR auto-training now calculates correct top margin (I hope) which was also missing in the normal OCR run (so all existing nOCR dbs will work less well).
Also fixed in training: quote + percentage sign. Missing in training: combined letters like "ff" and "rt" are not working.
Many changes has been made regarding OCR (mostly nOCR): https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Janusz

30th May 2020, 20:57

@Janusz: The nOCR (line ocr) works best with larger subtitles, like from bluray sup files, so go for "Binary image compare" with small subtitles or even Tesseract 5.
nOCR is still missing a lot of work - latest beta has added "fallback to nOCR" from "Binary image compare" which I think will work nicely. Also added a "Max bad pixels" for nOCR (based on not matching pixels from lines).
The nOCR auto-training now calculates correct top margin (I hope) which was also missing in the normal OCR run (so all existing nOCR dbs will work less well).
Also fixed in training: quote + percentage sign. Missing in training: combined letters like "ff" and "rt" are not working.
Many changes has been made regarding OCR (mostly nOCR): https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

The problem is not about OCR itself. The program changes the character assignment in the character database. Change from ż to Ż, ć to Ć, from z to Z.

Attempt on _index.html file with "Batman Begins".
The character base created only for the first two lines for each new unrecognized character consists of 17 characters: ! , ? a c e h k l o p P R s t z ż
OCR was stopped at "n" on the third line.
As you can see, a small "z" instead of "Z" appeared in the third line.
We can repeatedly start OCR from the first line, each time OCR will stop at "n". Character Database content does not change.
However, if, for example, on the first line we call "Inspect nocr matches for ..." the "nOCR inspekt" window opens and click in the "Inspect items" box, then select OK or Cancel to close the window without any changes.
Reopening this window will show us that "ż" was assigned to "Ż" although we did not. These changes are now saved permanently. Another OCR will show us that "ż" on lines 1 and 2 has been replaced with "Ż" and "z" on "Z" on line 3. The re-OCR is again calling for "ż".
If you need pictures I can attach.
-----
Beta 145 doesn't have this problem.
It started with beta 161. I could check this version. I wrote about beta 168 earlier, but at that time I didn't know where to look for the cause.

jlw_4049

30th May 2020, 21:23

The problem is not about OCR itself. The program changes the character assignment in the character database. Change from ż to Ż, ć to Ć, from z to Z.

Attempt on _index.html file with "Batman Begins".
The character base created only for the first two lines for each new unrecognized character consists of 17 characters: ! ,? a c e h k l o p P R s t z ż
OCR was stopped at "n" on the third line.
As you can see, a small "z" instead of "Z" appeared in the third line.
We can repeatedly start OCR from the first line, each time OCR will stop at "n". Character Database content does not change.
However, if, for example, on the first line we call "Inspect nocr matches for ..." the "nOCR inspekt" window opens and click in the "Inspect items" box, then select OK or Cancel to close the window without any changes.
Reopening this window will show us that "ż" was assigned to "Ż" although we did not. These changes are now saved permanently. Another OCR will show us that "ż" on lines 1 and 2 has been replaced with "Ż" and "z" on "Z" on line 3. The re-OCR is again calling for "ż".You can also make changes to the characters in the settings yourself

Sent from my SM-G986U1 using Tapatalk

Janusz

30th May 2020, 21:49

You can also make changes to the characters in the settings yourself

Yes. Where?
The program cannot change the image assignment to a character by itself. If you have once determined that "a" is "a", then where suddenly "a" is "A".
-----
https://drive.google.com/uc?export=view&id=1onLeBTQK-R1yaABauzds38KRMvlZPycT

Please explain to me what "in the settings" setting causes such a change. The left side, although the subtitles look strange, is correct.
Sup and nocr files to download (https://drive.google.com/uc?export=view&id=11abfRgCGZZxtb-dxTZcrkvxCX4m6bVHs).
The first OCR call will be OK. Second and subsequent OCRs on the same file will replace.
-----
Everything indicates that the problem concerns only the Polish language, so it went unnoticed by other users.

18fps

31st May 2020, 10:31

@18fps:
>When correcting capitalization of all caps subtitles, it would be of great help if the program could read the names of characters from the imdb page of the film (the user would give the http addresss)
Actually not a bad idea... but there seems to a lot of "Cute Girl" and "Prison Guard" which I guess would make this hard. Ideas?

Well, many of these will not actually be in the text of the subtitles ("man in the counter"), so maybe, just the way the program shows the text for impaired hearing that is going to remove, it could show the list of identified names it found in the actual subtitles, for approval.

Nikse555

31st May 2020, 20:47

@Janusz: Yes, nOCR is not finished yet... in SE 3.5.15 it was probably about 60% done, and now it's about 85% done.
I was close to giving up on it, but after a few fixes in auto-traning it's actually working very well (besides words that are stuck together like "rw", "ff" etc... - it's on my todo list + also italic font might be a problem).
I've fixed the OCR inspect in latest beta and also added some code for correcting casing Polish letters... do let me know how that works: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@18fps: OK, I might give it a try.

Janusz

1st June 2020, 09:28

@Nikse555:
It is certainly good for characters without accents: a b c ... A B C ...
It is not bad for lowercase letters with an accent: é č ę ė š ž ü å ä ö ą ś ż ś ...
Unfortunately, capital letters with accent: polish: Ś Ó Ż Ź Ć, spanish, portuguese, czech are recognized as two separate signs: accent and capital letter. This cannot be improved by "Add better match ...".
This does not apply to German. Here Ü Ä Ö is recognized as one character.
Until the problem with single characters is solved, you can forgive yourself "besides words that are stuck together like "rw","ff"etc .."

Nikse555

1st June 2020, 11:09

@Nikse555:
Unfortunately, capital letters with accent: polish: Ś Ó Ż Ź Ć, spanish, portuguese, czech are recognized as two separate signs
.."
Latest beta should work a little better with accents... do you have a sample file with problematic accents?

@Nikse555:
Until the problem with single characters is solved, you can forgive yourself "besides words that are stuck together like "rw","ff"etc .."
Latest beta can now train letters that are stuck togeter :)

https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Janusz

1st June 2020, 11:48

@Nikse

https://drive.google.com/uc?export=view&id=1V2c_eNlY_BgkUj5TRziB8p33l0otiSv7

File to download: (https://drive.google.com/uc?export=view&id=1ocRt8ae4d4pWcUm1GcCVA0-MNmYNaJpj)

Because there was a problem with capital letters with accents, e.g. Ś, Ć, Ó, Ř, Í, Š, É etc. I prepared a text consisting of sentences containing all letters used in such languages: English, Polish, German, Spanish and Czech . If we compare the line marked in blue on the right side, we will notice that between upper case letters lower case letters hide, but not everywhere. There is "Ó" which has not been replaced with "ó" or "Ń" and several others. Other capital letters also contain substitutions of this type. The exception is English, for obvious reasons - there are no letters with accent.

During the OCR I did not make any corrections, I did not add any characters manually. In both cases: beta 145 and beta 187, the text in the form we see has been fully read by the character base created by nOCR Training beta 187. Comparing pages line by line, you can see how much progress has been made since beta 145.
-----
02.06
Why some characters, e.g. Czech Ř, Ď, Á ... are remembered as one character, while others, e.g. Polish Ó, Ż, Ź as letters O, Z with an accent.
WARNING! Characters memorized in the character database as "." "´", I think they can be edited, but they cannot be deleted in any way because deletion causes our character base to crash.
Removal would be possible, but then all associated characters should also be removed from the database.
-----
I will return to _index.html file with "Batman Begins" with the character base attached.
If we start ORC with the [Draw missing text] option enabled, the program will ask for "." or "," in the middle or at the end of a sentence. We can add, it will be good. When in line 74 we are asked not to add "Ż" but to "." located above "Z" and we will add it - our character base will crash.
From now on, all "-" in the dialogs at the beginning of the line will be replaced with "." If we run OCR from the beginning from the first line without recognizing new characters, it will turn out that all "-" at the beginning of the line will be changed to ".". An additional gift will be exchanging ś into Ś and vice versa, z into Z. Long to exchange.
I don't know what it looks like in other languages with uppercase letters in indexes - I don't have the right files, but for single letters it works the same way.
-----
I can already see good changes in beta 193. Keep it up. Good job. Thank you.

Melan

1st June 2020, 18:12

I received the message as below. My .nocr file is created from scratch.

https://i.imgur.com/4TJc4kQ.png

Nikse555

2nd June 2020, 15:07

@Melan: Should hopefully be fixed in latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@Janusz: thx for the test file - good idea :)
I see the problem with batman begins and "Ż" - but that's about line splitting (also happens in "Binary image compare")

>deletion causes our character base to crash.
Does this still happen?

Melan

2nd June 2020, 16:35

Still the same.

https://i.imgur.com/tysac5k.png

Nikse555

2nd June 2020, 17:43

Still the same.

https://i.imgur.com/tysac5k.png

thx for re-testing, could you supply the steps to re-create the crash?

Nikse555

2nd June 2020, 19:38

Beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip
Now extended chars in nOCR can also be edited/deleted.

Please give the new "nOCR" a go :)
It's based on lines rather than images, so it works better with scaling than "Binary image compare". Works best with larger fonts. Can be "auto trained" with your own supplied letters/language + fonts (Ctrl+T in OCR window starts training window)
"Binary image compare" can also be combined with a fallback-to-nOCR.

kerry7

2nd June 2020, 21:07

I just saw that on the Github repo, they have committed an .exe file... that really hurts :scared:

Nikse555

2nd June 2020, 21:16

I just saw that on the Github repo, they have committed an .exe file... that really hurts :scared:

Where?

EDIT: It's totally normal to include 3rd party software as binaries... but "Subtitle Edit" should be committed as source (I've seen a few project where they ONLY committed the .exe file - now that's scary!)

Janusz

2nd June 2020, 23:57

@ Nikse555

https://drive.google.com/uc?export=view&id=1kChrJc88kdiHcDY9FjTW1OmQVnUrcz-a

"Batman" problem. Sup file to download (https://drive.google.com/uc?export=view&id=1MntEYfNYlAnaXms72JhmR9PwOYbUYo3b).

The file can be read using Latain.nocr or train yourself a new character set only for arial 65/100, and then it will be perfect.
To show what I have a problem with, I selected several correct lines and several lines from the original file.
I also added 4 lines from "Ź" that are not in the original file.
Due to the character set used and OCR errors, the resulting image may differ so I will explain:

- good lines are: 1, 2, 7, 8, 9, 10, 12, 13. Why? Certainly not because Ż, Ź, Ś are only in the top line, but I explain it to myself.

- problematic lines are: 3, 4, 5, 6, 11, and 14. Here Ż, Ś, Ź is in the bottom line. The place where "*" appears depends on which line is longer.
Sometimes it is the beginning of the line, other times we additionally lose the character from the top line.
In these lines, when the [Draw missing texts] option is enabled, OCR calls for a character,
but not for the capital letter with the index, that is: Ż, Ś, Ź, and only for the index itself.

- finally the "pearl" line 15. Characters with the index are in both the top and bottom line, and yet the line was read correctly.

I understand why this is happening and I think it can be solved.

tormento

3rd June 2020, 01:31

@Nikse555

Is there any format that allows OCR recognition of both upper and lower screen text (tipically anime)?

I have tried to set .ass in the main window but when OCR founds both upper and lower screen text, it skips the formatting, while it works when it's only on upper part (.srt works too).

If it's a limitation of OCR, would it be possible to add the feature?

Example (https://www.mediafire.com/file/okrj9cap5ysquhf/Evangelion_2.22_%5Bita%5D.7z/file).

Please notice that some negative values for subtitles are present too, perhaps because OCR doesn't know how to manage upper and lower screen text at the same time.

Melan

3rd June 2020, 06:46

thx for re-testing, could you supply the steps to re-create the crash?

Nothing special (I hope) :). I run SE, parse the file and run the nocr module.
I sent the files by mail.

Janusz

3rd June 2020, 08:50

@tormento

This is what it consists of:

1
00:00:01,000 --> 01:00:01,000
{\an8}Żeby budzić strach w innych,
musisz zapanować nad własnym.

2
00:00:06,209 --> 00:00:09,163
Żeby pokonać strach,
trzeba się nim stać.

3
00:00:10,163 --> 00:00:13,782
Wiesz, czemu upadamy?
Żebyśmy mogli się pozbierać.

4
00:00:14,782 --> 00:00:18,474
Podobno twój tata błagał o litość.
Żebrał jak pies.

kerry7

3rd June 2020, 08:56

Where?

EDIT: It's totally normal to include 3rd party software as binaries... but "Subtitle Edit" should be committed as source (I've seen a few project where they ONLY committed the .exe file - now that's scary!)

On the root of the project, the file is `vswhere.exe`. And it is a bit confusing because the last tag of the project is around 3.0.X, however, the comment of the commit says Update `wswhere to 2.3.2`, what it that means? (just for curiosity, would like to learn)

tormento

3rd June 2020, 11:12

This is what it consists of
Try to ocr my sup. ;)

SE can't really do a good job and mixes upper with lower when both are present.