Subtitle Edit [Archive] - Page 22

View Full Version : Subtitle Edit

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [22] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

Nikse555

3rd June 2020, 19:02

Nothing special (I hope) :). I run SE, parse the file and run the nocr module.
I sent the files by mail.

Thx for the info + files :)
Yes, I got the error too - now fixed in latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@kerry7:
"vswhere" is a small tool that helps (in exe form) to compile Subtitle Edit: https://github.com/microsoft/vswhere
"vswhere" was version "2.3.2"... which has nothing to do with the SE version number. I just updated "vswhere" to 2.8.4 - see https://github.com/SubtitleEdit/subtitleedit/commit/cc640df5d97a27ff88731b7c2c4aa29b1a33e13c

@tormento:
Sorry, SE does not support this (besides all text at top).
This is pretty complex - text can be all over and even vertical.

tormento

3rd June 2020, 20:29

Sorry, SE does not support this (besides all text at top). This is pretty complex - text can be all over and even vertical.
It would be more than enough support overlap subtitles with top and bottom.

Melan

4th June 2020, 17:16

Thx for the info + files :)
Yes, I got the error too - now fixed in latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

(...)

I parsed a few files and the error didn't appear.
However, something strange happened. SE doesn't choose Polish characters. :rolleyes:

https://i.imgur.com/IZGtqQF.png

Edit.
After the restart, everything returned to normal.

Janusz

4th June 2020, 23:57

@ Nikse555
In Beta 203 import of nOCR character database does not work.
The last one where imports were still active was Beta 194.

Melan

5th June 2020, 09:20

When two characters from both lines are interpreted as one letter then the initial dash always turns into a dot.
https://i.imgur.com/WFhQbb7.png
https://i.imgur.com/D9vXeK8.png

Maybe I'm blind :P, but I really don't see the difference between zero after digit 6 and zero after digit 1.
https://i.imgur.com/oMcXc0a.png

Nikse555

5th June 2020, 09:54

@Janusz: The nOCR import should be fixed now, thx :)
Also, I'm testig a new line splitter - how does that work for you? It will never be perfect...
Latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@Melan:
Could you post or email the subtitle file? (you can e.g. right-click on the ocr-window and export as blu-ray sup)
About the "O"... you have to "Add better match" and enter "0"...

Melan

5th June 2020, 10:30

Could you post or email the subtitle file? (you can e.g. right-click on the ocr-window and export as blu-ray sup)

Unfortunately, I can't. Screen 20.05 - chat.

I downloaded the B208 version and ...
https://i.imgur.com/6P8JvB2.png

Nikse555

5th June 2020, 13:25

@Melan: OK, got the crash too... should be fixed here: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@Janusz: Also, made some fixes (hopefully) to the new line-splitter in above beta too.

Nikse555

5th June 2020, 15:05

And due to some bugs in the new image line splitter... a new beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Janusz

5th June 2020, 15:47

And due to some bugs in the new image line splitter... a new beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip
Beta 8 already worked well, but it crashed on line 174 with the _index.html file from the "Batman Begins" directory and the character database added with the file. I do not know why? Beta 12 passes through this line without failure and feels flawlessly.
[Draw Missing texts] was awarded. If [Draw Missing texts] is checked, on line 74 OCR will call for ",". This sign is strangely marked in the top window, although the image at the bottom is correct. It looks the same in Beta 12.
Well done, thank you.
-----
Edit 1:
As we are at Batman, please note what is happening now with the 758 line. Earlier versions did not do that.
If the error cannot be reproduced, I will insert a picture. It looks like some noise picked up by OCR.
-----
I would add that in 1137 images it appears in this one. I also did an OCR file that consists of 5489 images and nothing like this ever happened. OCR by importing images only from this one image does not generate an error.

Edit 2:
Just like @Melan showed here:

https://i.imgur.com/yqGFwSW.png

It's just that the whole sign is visible and I have some scraps of different signs from the bottom line.

Melan

5th June 2020, 17:13

File sent to mail.
I still have a crash after right-click on ocr window.

https://i.imgur.com/dPmYu6O.png

I sent another file. Look at line 4th, please. An interesting thing appears after clicking "skip".
https://i.imgur.com/yqGFwSW.png

Nikse555

6th June 2020, 06:58

File sent to mail.
I still have a crash after right-click on ocr window.

https://i.imgur.com/dPmYu6O.png

How/where did you get this error?

@Melan/Janusz: Hopefully fixed issue with strange split/position: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Melan

6th June 2020, 08:00

How/where did you get this error?

https://i.imgur.com/CUOAb0T.png

Nikse555

6th June 2020, 08:32

https://i.imgur.com/CUOAb0T.png

Thx :)
Should now be fixed: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Janusz

6th June 2020, 11:15

It's not good. The questionable character on line 30 was written on the third line. This bug has already occurred in earlier beta versions.

https://drive.google.com/uc?export=view&id=16NWU5qIKMAMCMtR7iRiNR7Ek1C3mErwX

Files to download: (https://drive.google.com/uc?export=view&id=1gzvY6AclUVNYLTm6Hk8jcE6RmU5iqUOH)

Edit 1
This effect will not occur if you start the scan from this line, but each subsequent scan breaks the line and places the character on the third line.
The effect is invisible if we have any dictionary enabled.

Melan

6th June 2020, 12:51

It seems that sometimes, the area of letter identification is too small. This applies to the letter "ę".

https://i.imgur.com/CBvfiiB.png

Nikse555

6th June 2020, 13:02

@Janusz: It's really difficult to split lines...
New version up which hopefully better splitting: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip
(also enabled threading)

It seems that sometimes, the area of letter identification is too small. This applies to the letter "ę".

https://i.imgur.com/CBvfiiB.png

Subtitle?

Janusz

6th June 2020, 14:42

This effect of the rearranged words came out when I removed "." and "," from the character database, then I started the scan from the first line.
I added these characters when OCR asked for them. The first 6 lines are good, later it gets worse until the end.
The second scan from the first line restores the correct word order.

Edit:
Beta 18 after removing and adding these two characters back to the base works well up to line 30, above.

https://drive.google.com/uc?export=view&id=1F6MZ03GRqG_U9OdZgRZqsqq2YtX4ou_A

Nikse555

6th June 2020, 19:18

I've rolled back the line splitter... I could find no logic of how to divide lines :(

https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Janusz

7th June 2020, 01:17

I've rolled back the line splitter... I could find no logic of how to divide lines :(

There is some logic in this mixing of parts of the line. I applied 6 markers to the image.
Red points indicate places where the first dot or comma appears in the correct text (green points).
As you can see, from the red point to the end of the line, the text is correct, including breaking lines.
I have reviewed several further lines and this rule is confirmed. This division does not occur if there is no dot or comma in the line.
If these characters are the last in the line, the whole line is repeated.

We are talking about two characters all the time, because I changed these two characters.
Another test, when I removed only "," changes occur only on lines where a comma appears. The same is with "!", "?".
The most important thing is that the second scan should repair everything.
I have no idea why errors do not appear immediately in the first line and only from seven.

Edit:
Note 1. After deleting a character from the database, we start the scan with the [Draw missing texts] option turned off, in the place of the missing characters we get "*", the text does not spill - it's good, it should be.
Note 2. Add all the missing characters (2) using [Inspect nOCR .../Add better match], enable the [Draw missing texts] option and start the scan, the text does not spill - it's good.

From this you can see that the text crumbles only if previously added characters are added during the scan with the [Draw missing texts] option enabled.

During all this game such a message was displayed. During its display, the program worked until the scan was completed. After selecting Ignore, you returned to the program.

https://drive.google.com/uc?export=view&id=1-JZQFoGgt9JPB2iRJ8PSPLDzDb5nV2Ha

Edit 2:

Will this not be related to this change. https://forum.doom9.org/showpost.php?p=1911621&postcount=964

Melan

7th June 2020, 10:18

Why? :rolleyes:
B224

https://i.imgur.com/BVCoLHr.png

Janusz

7th June 2020, 11:59

@Melan
For "-" make a better match and state that this " ' " accent.
Remember that you do it on the basis of "Latin" and after reinstalling the program or update you will lose these changes.

GCRaistlin

7th June 2020, 13:50

I've got a crash trying to open a m2ts file (22649739264 bytes):
https://i112.fastpic.ru/thumb/2020/0607/e3/62c4461666893bc7ca86533263d0e4e3.jpeg (https://fastpic.ru/view/112/2020/0607/62c4461666893bc7ca86533263d0e4e3.jpg.html) https://i112.fastpic.ru/thumb/2020/0607/bb/065584017e30e05ace108e93d902fcbb.jpeg (https://fastpic.ru/view/112/2020/0607/065584017e30e05ace108e93d902fcbb.jpg.html) https://i112.fastpic.ru/thumb/2020/0607/de/372d458f5f790e3de32e7f4c352961de.jpeg (https://fastpic.ru/view/112/2020/0607/372d458f5f790e3de32e7f4c352961de.jpg.html) https://i112.fastpic.ru/thumb/2020/0607/e1/bf1449ce3d50a932f65d2c22734922e1.jpeg (https://fastpic.ru/view/112/2020/0607/bf1449ce3d50a932f65d2c22734922e1.jpg.html)

Nikse555

7th June 2020, 14:58

Thx for the bug reports :)
New beta up with at least a few crashes fixed: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@GCRaistlin: Does the first error still occur? How can I re-create the error if it still occurs? Did it also occur in the "old" 3.5.15 "final" ?

GCRaistlin

7th June 2020, 17:15

Nikse555
Both 3.5.15 and the latest beta crash. See PM.

Melan

7th June 2020, 19:49

@Melan
(...)
Remember that you do it on the basis of "Latin" and after reinstalling the program or update you will lose these changes.

I never overwrite files except SE.exe :)

varekai

8th June 2020, 08:50

:D
https://i.imgur.com/LbYxzIO.gif

varekai

8th June 2020, 09:31

Janusz

8th June 2020, 11:24

Error message.

https://drive.google.com/uc?export=view&id=11M7PbD4uv0xYKLKWphYR8GbEEvmNKws9

An error occurs when in the main window of the program in the [List view] tab the [End time] column is invisible and you want to save the results of the comparison in [Compare subtitles].
This also occurs in stable versions 3.5.14 and 3.5.15.

Nikse555

8th June 2020, 17:20

An error occurs when in the main window of the program in the [List view] tab the [End time] column is invisible and you want to save the results of the comparison in [Compare subtitles].
This also occurs in stable versions 3.5.14 and 3.5.15.

Thanks, nice catch :)
Beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

nOCR seems close to first public release... I just OCR'ed some Swedish/German Bluray sup files and only had to enter 4-8 letters/double-letters :)

Melan

8th June 2020, 17:42

It's not good. The result of extracting subtitles using the OCR method. Letters with Polish diacritical marks are not correctly read from the database.
https://imgur.com/a/nJjYlR5

In addition, I noticed that the "scanning" in the SE changed and the SE seemed to return to the line, which was already checked.

nOCR method - duplicate line. This error appeared for the first time.
https://i.imgur.com/DdNUglJ.png

Nikse555

8th June 2020, 20:24

Janusz

8th June 2020, 20:47

@Melan
 is not a valid regular expression and will not work.
If that is <\/i> it will search for empty strings between italic flags.

Melan

8th June 2020, 21:35

Sorry, I don't know what you mean...
Second image is not found...

The first screen shows the effect of extracting subtitles from a file that I sent to an email.

I reuplauded the second screen.

https://i.imgur.com/9vLi9d0.png
https://i.imgur.com/DdNUglJ.png

GCRaistlin

9th June 2020, 12:37

It is unable to add better multi match for a dot. Hence, it is unable to recognize three dots in a row as a horizontal ellipsis (U+2026).

GCRaistlin

9th June 2020, 20:03

Merging and splitting subtitles work in a strange way sometimes:

175
00:12:45,863 --> 00:12:47,447
Креветки и раки были лучше всего.

176
00:12:47,530 --> 00:12:48,864
Они быстро сплавливались.

After merging:

175
00:12:45,863 --> 00:12:48,864
Креветки и раки были лучше
всего. Они быстро сплавливались.

Why to move the last word of the first sentence to the next line?

163
00:12:20,006 --> 00:12:24,177
Но что Джимми любил
больше всего... Что он по-настоящему
любил, так это воровать.

After splitting:

163
00:12:20,006 --> 00:12:23,003
Но что Джимми любил больше
всего... Что он по-настоящему -

164
00:12:23,027 --> 00:12:24,177
- любил, так это воровать.

Why not to split right after dots?

GCRaistlin

9th June 2020, 20:13

A new strange splitting example:

185
00:13:12,803 --> 00:13:16,223
Вы будете вместе работать, понял?
Помоги ему. Давай.

After splitting:

185
00:13:12,803 --> 00:13:14,937
Вы будете вместе
работать, понял?

186
00:13:14,961 --> 00:13:16,223
Помоги ему. Давай.

Why to make the first subtitle two-lined?

Janusz

9th June 2020, 23:12

@GCRaistlin
Merging and splitting subtitles work in a strange way sometimes:

It depends on the values you have set in Settings / General:
[Single line max. length]
[Unbreak subtitles shorter than] - here you must have 33 or less. Set e.g. 35 and you will have what you want.
You can always use the [Text] window and make [Split line at cursor position].

Janusz

10th June 2020, 01:59

@ Nikse555

Strange not to say bad OCR behavior.
I noticed this in other subtitles, but again I will use the _index.sup file from "Batman".
My Settings:
[Image preprocessing] = 142 (after 49 the first value that makes visible changes in the image)
[No of pixels is space] = 4 (value checked and is correct for this file)
[Max wrong pixels] = 0 (now zero to eliminate randomness)
[Constains italic] = off
[Draw missing texts] = on
[Line split ...] = Auto (I use it most often because it works, besides I don't quite understand how it should work)
[Lines to draw]=100

1. We create a new character base - any name.
We start and enter two lines of text and then stop by selecting [Abort]. That's enough to see what's going on.
In the database of our characters, with the exception of "ż" we have one character for each new letter entered. It would seem great. But unfortunately this is not the case. Let's go back to the first line, uncheck [Draw missing texts] and select [Start], after several lines we stop the process. We return to the first line. You don't need to know the language - we compare with the image to see that on lines 3, 4, 5, 9 and further instead of "." We have "s". In this case it fell on "s", but it can also be any other letter, e.g. "e" "a" (checked). I have no idea where it comes from.

Often, instead of "-", "." Is inserted in the text if it is already in our database. Even when "-" is in the database but the character was not recognized. I know that this can be improved by using [Add better match], but I also know that such a conversion of "s, e, a" into "." whether "." on "-" will carry with it the need for other corrections. I have already experienced this.
My question: why was the "s" sign 12x17 sized and fitted into a 4x4 square? The same "-" 7x4 and "." 4x4 ". The characters" ż "11x21 and 12x21 were treated as two different characters.
The worst is that by the end of the characters for this text I will never be asked to enter "." And how many other characters will be omitted?

2. The same settings, new base, we only scan the third line.
Let's see what we have. Instead of a small "z" we have a large "Z". Question as above. Why.

3. Finally, let's [Edit] our character base and delete all characters by [Delete character], then OK. The logic should be empty, we check - it's ok - zero elements in data base, ok,
but out of curiosity, press [start], the text reads again [Edit] and what do we see? Our base is still like nothing.
Character base support requires refinement. Without this, there is nothing to move forward as we do not know what our character base really contains. Sometimes, I delete a character from the database, in a moment I have it back, mark italics, in a moment I have italics back next to the character.

Nikse555

10th June 2020, 06:20

@Melan:
One error was caused by a bad "auto min line height" (which is used for line splitting - press Ctrl+H to see how a subtitle is split into lines).
Second error I think was due to some multi-threading error.
Both hopefully fixed now: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

@Raistlin: There's also settings for auto-break in Options -> Settings -> Tools.

@Janusz: nOCR is vector based and is easy to scale (unlike image compare) and different sized images will be compared. It possible something could be improved of course. I'll try to re-create the "." vs "-" issue.
About the deletion of characters... hopefully that was a multi-threading issue and is fixed now.

Melan

10th June 2020, 07:08

Previous errors no longer occur. Thx.

After pressing ctrl+H I received something like this.
https://i.imgur.com/WIfQyac.png
https://i.imgur.com/JTv2c9q.png

GCRaistlin

10th June 2020, 09:57

Nikse555
'Options - Settings - Unbreak lines shorter than' seems to count non-displayable characters like and and treats italic and non-italic subtitles differently:

175
00:12:45,863 --> 00:12:47,447
Креветки и раки были лучше всего.

176
00:12:47,530 --> 00:12:48,864
Они быстро сплавливались.

After merging:

Креветки и раки были лучше
всего. Они быстро сплавливались.

170
00:12:45,863 --> 00:12:47,447
Креветки и раки были лучше всего.

171
00:12:47,530 --> 00:12:48,864
Они быстро сплавливались.

After merging:

170
00:12:45,863 --> 00:12:48,864
Креветки и раки были лучше всего.
Они быстро сплавливались.

I was unable to make this subtitle:

163
00:12:20,006 --> 00:12:24,177
Но что Джимми любил
больше всего... Что он по-настоящему
любил, так это воровать.

to be split after a dot. I have 'Break early for end of sentence' checked but it is still being split the same old way:

163
00:12:20,006 --> 00:12:22,898
Но что Джимми любил больше всего...
Что он по-настоящему

164
00:12:22,922 --> 00:12:24,177
любил, так это воровать.

UPD: I got it. I mixed up splitting and breaking. Is there a way to control how subtitles are being split?

GCRaistlin

10th June 2020, 10:18

Another unexpected result when splitting a subtitle (Unbreak subtitles shorter than: 35).

1096
01:17:21,598 --> 01:17:25,519
А пока, Джимми и Томми едут в
Темпу в эти выходные, кое-что забрать.

After splitting:

1096
01:17:21,598 --> 01:17:23,547
А пока, Джимми и
Томми едут в Темпу

1097
01:17:23,571 --> 01:17:25,519
в эти выходные, кое-что забрать.

The new subtitle 1096 is 34 characters long - why was it broken then?

Janusz

10th June 2020, 12:14

@GCRaistlin
Set [Single line max. length] = [Unbreak subtitles shorter than], e.g. 40 and 40. Remember that there is no golden mean. In one place you will improve in another you can spoil. Also remember that the division will take place after 40 characters in the first line, the rest of the characters will be in the second.

Nikse555

10th June 2020, 13:05

@Melan: sorry, I cannot make line splitting better... you can play with "split min line height" and see if that helps.

Still got the double text error in nOCR, but hopefully fixed in beta 255: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.15/SubtitleEditBeta.zip

Melan

10th June 2020, 13:25

I usually have this value set to 25. For files that I have, it is optimal. For the case from the screen I set the entire possible range from 5 to 150 and without the expected result.

BTW. It would be good if the value from the "Min. Line height" field was "remembered" by SE.

Nikse555

10th June 2020, 14:05

BTW. It would be good if the value from the "Min. Line height" field was "remembered" by SE.

It's remembered here... it's not remembered for you in beta 255?

Melan

10th June 2020, 16:28

No. Always starts with "Auto"

https://i.imgur.com/wCQqhLF.png

GCRaistlin

10th June 2020, 17:53

@GCRaistlin
Set [Single line max. length] = [Unbreak subtitles shorter than], e.g. 40 and 40.
It is already set this way. Besides that, as far as I see, it has nothing to do with the issue I reported.

Melan

10th June 2020, 19:11

After changing the dot (.) to dash (-) in .nocr db, with "Add better match", my base is not read correctly.

https://i.imgur.com/0a578XS.png