Log in

View Full Version : Subtitle Edit


Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [18] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

Nikse555
29th March 2020, 12:08
@tormento: I don't think those files are correctly encoded, sorry.
EDIT: The UTF-8 files have UTF-8 BOM (EF BB BF) but they are not using UTF-8 encoding, they are ANSI encoded! Yes, really a mess :)

tormento
30th March 2020, 12:28
@tormento: I don't think those files are correctly encoded, sorry. EDIT: The UTF-8 files have UTF-8 BOM (EF BB BF) but they are not using UTF-8 encoding, they are ANSI encoded! Yes, really a mess :)
They come from Sub Rip, latest version. I dunno if author is still active to report him this mess.

What about the other message about "L" OCR?

Nikse555
30th March 2020, 14:41
What about the other message about "L" OCR?

I've updated latest beta somewhat: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.14/SubtitleEditBeta.zip

Your subtitle runs very well through the OCR using "Binary image compare" with number-of-pixels-is-space=7 and max-error-pct=1.
I did not have any problems with "L".

What OCR method are you using and what lines are problematic?

tormento
31st March 2020, 12:57
Your subtitle runs very well through the OCR using "Binary image compare" with number-of-pixels-is-space=7 and max-error-pct=1. I did not have any problems with "L".
Clean installation, same settings of yours. I drop idx to SE, no OCR auto correction enabled.

Many many "L":

570
00:51:39,520 --> 00:51:42,478
L know the book is tough,
but l liked it.

571
00:51:42,600 --> 00:51:43,476
L know.

Tried with a fresh installation and "untrained" OCR database?

Nikse555
31st March 2020, 14:32
Clean installation, same settings of yours. I drop idx to SE, no OCR auto correction enabled.

Many many "L"...

I get "l" (lowercase L) instead of "I" (uppercase i)... because the two images are exactly alike. Enabling "Fix OCR errors" should fix those...

Result here, starting with lowercase "L":
l know the book is tough,
but l liked it.

tormento
31st March 2020, 15:20
I get "l" (lowercase L) instead of "I" (uppercase i)... because the two images are exactly alike. Enabling "Fix OCR errors" should fix those...



Result here, starting with lowercase "L":

l know the book is tough,

but l liked it.



I get capital L!

Boulder
2nd April 2020, 18:15
I get capital L!

I've added the OCR fix list pair (Options -> Settings -> Word lists) l --> I to fix this, if I remember correctly. You can also do that after the OCR run.

That subtitle example was very straightforward to OCR and the characters look like most DVDs, so you get a lot of good matches for future subs.

I uploaded my dictionary files and latin.db in case someone finds them useful (Nikse555 can freely use the content with SE if he wants to):

https://drive.google.com/open?id=1BoeF5_dwzIVpbxJYWS8tAwXRn-IWRgnz
https://drive.google.com/open?id=1Nz1JLOmhO8PIYajV5RCBOW9K93DZVxHb

tormento
3rd April 2020, 11:07
One of the best things of SubRip was the possibility to save different matrixes and automatically scan for the most effective one on OCR when loading the sub bitmap file.

That gives the possibility not to pollute the good trained ones with some unusual subtitle, plus the possibility to organize them effectively.

Moreover, Subtitle Edit could come with pretrained ones like SupRip did for the most commonly used fonts.

@nikse would you, please?

Boulder
3rd April 2020, 11:14
SE does have the ability to use different databases for OCR and the scanning could be useful, I agree.

tormento
3rd April 2020, 11:21
SE does have the ability to use different databases for OCR and the scanning could be useful, I agree.



Another missing thing is the possibility to expand selection, such as for % that sometimes goes wrong on OCR. A point and click expansion thing would be even better.

Boulder
3rd April 2020, 11:37
If the character is not recognized, it's possible to expand. If it's detected wrong, afterwards, still in the OCR dialog you can right click on the line with the issue and choose to inspect the matches. Then right click on the incorrect match and you get an option to select a better multi match. It's something I reported as an issue a long time ago so I just happen to know where it is. It's not easy to come by by accident, the same with all those special characters that you can add using the right click in the OCR dialog where it asks which character the image represents. I've used the software for years and found this one out last week :)

tormento
3rd April 2020, 12:05
If the character is not recognized, it's possible to expand.
How?

Manually changing the OCR recognition is something I already know.

Expanding the bitmap is new to me.

Boulder
3rd April 2020, 12:08
How?

Manually changing the OCR recognition is something I already know.

Expanding the bitmap is new to me.

There should be the button to expand selection if the process runs into a character it does not recognize. Off the top of my head, it's in the top area of the window. Using it is so automatic to me that I don't recall the exact place.
EDIT: the problem remains if the first part of % (a dot) is recognized, expansion only works forwards and not backwards. In these cases, I abort the OCR and check the matching for that line manually concerning the incorrect detection and restart the process from there.

tormento
3rd April 2020, 12:13
There should be the button to expand selection if the process runs into a character it does not recognize.
Where?

https://i.lensdump.com/i/jkrp7z.md.png (https://lensdump.com/i/jkrp7z)

Boulder
3rd April 2020, 12:15
It's not in that main dialog. It appears when you start the OCR process, in the bitmap/character matching phase if there is no match.

tormento
3rd April 2020, 12:17
It's not in that main dialog. It appears when you start the OCR process, in the bitmap/character matching phase if there is no match.
Yep, you are right.

Boulder
3rd April 2020, 12:19
And if you already haven't: set/download and set the correct dictionary and enable Fix OCR errors to make your life a bit easier.

tormento
3rd April 2020, 12:40
And if you already haven't: set/download and set the correct dictionary and enable Fix OCR errors to make your life a bit easier.
I tried to use it but it doesn't work really well for italian.

Boulder
3rd April 2020, 13:15
I tried to use it but it doesn't work really well for italian.

There's two dictionaries, are the equally bad? You can of course modify the dictionaries if there are clear errors. In my opinion, they are the key to getting the whole process fast and accurate but it takes a lot of time in the beginning.

tormento
4th April 2020, 11:14
Ok, I reset Latin.db and started to create a better OCR database, using italic and so.

I think the fact to have to reopen the same IFO for different subs different times is really annoying.

Am I doing something wrong?

jlw_4049
15th April 2020, 19:18
Anyway to minimize the program while it's OCR'ing?
Also is there anyway to make the program flash when it has a prompt on the task bar, so you know?
Also last question, when updating from older versions, where is the user dictionary/name dictionary saved? That way it's carried over to the newer version?
Also, is there anyway to minimize the OCR window? To let it do it's work in the background instead of being forced above all other applications?

Thanks! Program is awesome!

Nikse555
16th April 2020, 10:37
@tormento/Boulder: A few years back I actually tried to programmatically create a Binary-image-compare-db with all windows fonts in 5 different sizes... it's was slow and to my surprise not even very good.
I'm not really sure what the best solution is, but I'm pretty sure it's not one db.

Binary image compare and "backward expansion": If you have a difficult character (like "%" where first part is recognized as "o") - you can fix it in the "Inspect compare matches" (dbl click on line in list view), right click in "Inspect items" and choose "Add better multi match".


@tormento:
>I think the fact to have to reopen the same IFO for different subs different times is really annoying.
In the "Choose language" window you can do a "Save as..." for each language stream id.


@jlw_4049:
To minimize the program while it's OCR'ing: latest beta now has enabled the minimize icon.
Latest beta blinks in the taskbar when OCR has a prompt (up to 25 times, but only if OCR window is not focused).
Dictionaries are saved in... press Win+R, paste "%appdata%\Subtitle Edit\Dictionaries" and press enter. Most (*not all*) dictionaries have a "_user" file... e.g. the Binary-image-compare-db "latin.db" does not.

Please test latest beta :)
https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.14/SubtitleEditBeta.zip
(can also use "Tesseract 5 alpha")

tormento
16th April 2020, 11:55
@tormento/Boulder: A few years back I actually tried to programmatically create a Binary-image-compare-db with all windows fonts in 5 different sizes... it's was slow and to my surprise not even very good.
Not necessary to build for every single font. Subtitles manly use arial/helvetica derivatives and for the "strange ones" we can build db on our own. That's why I suggested you to use more than one db, so we can add more characters without "polluting" our standard font sets. Please give a look to SupRip and SubRip. Boh uses more than one set of fonts and the last has the ability to scan thru them to recognize the most fitted.

Nikse555
16th April 2020, 13:54
SE already scans through dbs and chooses the best fit (I hope).
I guess that similar fonts should have their own db for each font size - to avoid "polluting".
What fonts should be used?

tormento
16th April 2020, 15:24
I hope
It's you the programmer! You should know! :D
What fonts should be used?
SupRip uses (you can find the file inside, did you look?):
arial-bold.font.txt
arial.font.txt
arial2.font.txt
arial3.font.txt
arial4.font.txt
calibri-variant.font.txt
geneva-bold.font.txt (I think it's Helvetica)
greek-arial.font.txt
narrow.txt (I think it's Arial Narrow)
SubRip uses cryptical file names but there are 106 font matrixes.

The least could be to start create multiple databases on our own, with your assurance that SE scans for them or it would be useless.

Nikse555
16th April 2020, 17:59
@tormento: SE does search for the best match of OCR DBs (only via first sub now, but that could easily be changed)

jlw_4049
16th April 2020, 18:01
@jlw_4049:
To minimize the program while it's OCR'ing: latest beta now has enabled the minimize icon.
Latest beta blinks in the taskbar when OCR has a prompt (up to 25 times, but only if OCR window is not focused).
Dictionaries are saved in... press Win+R, paste "%appdata%\Subtitle Edit\Dictionaries" and press enter. Most (*not all*) dictionaries have a "_user" file... e.g. the Binary-image-compare-db "latin.db" does not.

Please test latest beta :)
https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.14/SubtitleEditBeta.zip
(can also use "Tesseract 5 alpha")

Thank you! I'll try it out and let you know! :)

Once minimized it cannot be re-opened until it's done. Which isn't a negative for me, but just letting you know!

Edit:
Is there anyway to do a batch OCR. I had 10 copies of the program last night and it had a memory leak and I had to hard power cycle it.

That way I can add multiple files at a time and they get done 1 by 1?

eddified
20th April 2020, 03:53
I tried this app out for the first time today. Seems pretty awesome. Thank you so much! My first try reading PGS, I went with default OCR settings, binary image compare (don't do that, results were terrible -- just about every "o" was interpreted as a G). Though almost all text was italic, maybe that was the problem with the bad "

The very first time, it asked me to confirm character by character. Ex: it showed me an "a" and said, "what is this?". I just clicked ok for awhile. It took me a few subtitles before I realized I was telling it every character was "" (the empty string) in text. Then I wondered if I had ruined some dictionary by having it save away those bad values. So I ended up starting again from scratch -- which requires a time-consuming re-load. By the 3rd try I realized Tesseract was much, much better than the default binary image compare. :)

Some questions:
1) Loading a Blu-ray rip file takes several minutes. Any tricks to speed it up?
2) Is there a better (or different) online forum for this software, other than this thread?
3) If a file has two sets of subs (same language), is there a way to work on one, then load up the other, without re-loading the file from scratch? Looking at your comment, "In the "Choose language" window you can do a "Save as..." for each language stream id." --> I didn't see any "save as" under "options-> choose language". Is that where you were talking about?
4) What does a red duration cell indicate?
5) My biggest, worst problem: I load the file, which takes several minutes, then I do OCR on PGS subs and spend a bunch of time fixing them all up, then I hit "OK" to leave the OCR dialog, and .... all my work is gone. The program doesn't crash. It's just that nothing at all shows up in the main window. So I had nothing to show for all my work. It happened to me several times, so I never got any complete srt out of it, except one time when I thought I was going to lose all my work yet again, I got it to write out an SRT with only a few subs as a test. Not sure what's going on. I'm afraid to put any work into editing subs, for it to all just go away when I click "OK". Please advise. (TL;DR: sometimes the OCR work is copied into the main window, but often not, and I haven't figured out why it behaves one way sometimes, and the other way sometimes.) This is a huge deal breaker for me, as I don't know if all of my work will be lost, or not. I think it might have something to do with click-and-dragging the mkv file (with subs) into the window to load it, vs loading using the "Open" dialog.

Thanks for the hard work!

Nikse555
20th April 2020, 07:23
@jlw_4049:
>Once minimized it cannot be re-opened until it's done. Which isn't a negative for me, but just letting you know!
If I click on the "Maximize icon" in the lower left corner, it comes back...
"Tools" - "Batch convert" can also OCR, but you'll loose the possible bad words etc.

@eddified:
"Binary image compare" takes a bit longer to get running (and learn). You must find the best "x pixels is space value" and you must add letters and fix wrong letters (double click on a line in the list view to inspect/fix). It does not work well for *all* subtitles, but it's especially nice if you got more than one file with the same font.
Tesseract is good too, very easy, but if problems occur they tend to be harder to fix. Also a little slower.

1) It takes about 1 second here for a 25 mb file...
2) This is probably the best forum for OCR stuff
3) That's for Vob ripping from DVD...
4) Red background color in duration cell indicates that the duration if too short or too long - see options - settings - general (min/max/cps)
5) Sorry about that, it's a bug in SE 3.5.14 with drag'n'drop (File - open, works fine)

And please test latest beta as SE 3.5.15 should be out soon :)
https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.14/SubtitleEditBeta.zip

GCRaistlin
20th April 2020, 13:14
Nikse555
Bugs:

I use Ctrl-Win-X as a system-wide hotkey. Surprisingly pressing Ctrl-Win-X in SE calls 'Delete one line?' prompt window. I believe SE don't make the difference between Ctrl-X and Ctrl-Win-X. What is more I was unable to find how to disable Ctrl-X keyboard shortcut.
Trying to shift forward one subtitle so it would overlap the next subtitle is being corrected silently by SE. Example:

9
00:08:29,018 --> 00:08:30,810
<some text 1>

10
00:08:31,510 --> 00:08:34,018
<some text 2>

If we try to shift subtitle 9 for 800 ms forward then we get "00:08:31,653 --> 00:08:34,161" instead of the expected "00:08:32,310 --> 00:08:34,818".



Feature requests:

It's better to pre-select 'Selected line(s) and forward' instead of 'All lines' in Adjust all times... dialog window. If we want to shift everything we'll probably do it without selecting another line after opening the dialog window so 'Selected line(s) and forward' would mean the same as 'All lines' in most cases; if we select another line we probably don't want to shift everything so default 'All lines' has no sense in this case. 'All lines' option seems to me superfluous at all.
Ot at least remember the last used selection in this window.
Disable mouse wheel in 'Start time' and 'Duration' fields. Here's why. We click on Up or Down arrow in 'Duration' field to adjust the subtitle. Then we want to scroll the subtitle list with the mouse wheel. But instead of scrolling the subtitle list we get scrolling of the current subtitle duration!
Don't place the currently selected subtitle to the center of the screen when perform Undo/Redo action. It prevents us to see what we undo/redo.
Show the asterisk that indicates unsaved changes before the filename, not after. This way it will be always visible. Now it isn't if SE window isn't maximized or if there is an OSD of some other application in the top right corner of the screen.


Just in case you missed it: what you think of my suggestion #1 here (https://forum.doom9.org/showthread.php?p=1904263#post1904263)? It would give us great flexibility for subtitles adjusting. Now we have to perform many manual calculations, e. g. if we want to shift the subtitle start time without changing the subtitle end time.

GCRaistlin
20th April 2020, 15:12
'History (for undo)' window isn't intuitive. Example: I wanted to apply delay +200 ms to the selected subtitle and forward. I did it but I'm not sure if I have selected "Selected line(s) and forward". I go to Edit -> Show history (for undo) and see there: "Before show selected lines earlier/later: 00:00:00,200". It is completely unclear what does it mean. But okay. I press "Compare with current" and see that no, I didn't select the right option before applying the delay. I press 'Rollback'. The window closes; it is unexpected - I'd prefer it to stay open to give me the possibility to "Compare with current" again. But okay. For some reason I want to redo the action - I press Ctrl-Y. Then I press Ctrl-Z again. I perform these actions a couple of times as I'm sure that it doesnt' touch anything but the last action. But then I found out that I was wrong - now I can't undo anything that I've done before applying this delay!

That's how it should work according to my opinion:

Undoing the action isn't an action itself and should not overwrite other actions in the undo stack. It's the most important thing.
Actions in the history should be named natively. In my case above - "Delay +00:00:00,200 for all lines". If the delay was applied for the selected lines: "Delay +00:00:00,200 for lines ##10-14". And so on.
Actions that are undone should be displayed as Italic.
There should be two buttons in the dialog window: "Undo" and "Redo". If the selected action is undone the "Redo" button is active and "Undo" button is disabled. And vice versa.
If the selected action isn't the last action Undo/Redo are applied to the selected action and all actions after it.

GCRaistlin
20th April 2020, 20:58
The cursor in 'Start time' (in the main window) and 'Hour:min:sec:ms' (in 'Adjust all times...' dialog window) fields is in Overwrite mode: what we enter overwrites the current value. The cursor in 'Duration' field is in Insert mode: what we enter doesn't overwrite the current value. It would be better if the cursor was in Overwrite mode everywhere.

Nikse555
21st April 2020, 15:48
@GCRaistlin: thx for the feedback :)
1) About "Ctrl-Win-X" hotkey - I presume that SE should ignore all shortcuts if a Win-key is down, right?
2) Undo... I think what I actually wanted to show is more an "Event log" - where *all* entries get saved and can be rolled back to. I am really annoyed by the undo in many gfx application where when you undo something that you want to re-do later, but suddenly the history has been cleared due to a new change and you cannot re-do as you planned (and changes are lost)! I can see that the current "History for undo" is kinda stuck in the middle...
3) Duration field overwrite mode... yes, please. But I actually don't know how.
4) I can see the idea in the asterisk that indicates unsaved changes before the filename, not after...

GCRaistlin
21st April 2020, 17:10
1) About "Ctrl-Win-X" hotkey - I presume that SE should ignore all shortcuts if a Win-key is down, right?
Yes. Besides that I believe that all keyboard shortcuts should be customizable (since you have implemented this for some of them anyway).
I am really annoyed by the undo in many gfx application where when you undo something that you want to re-do later, but suddenly the history has been cleared due to a new change and you cannot re-do as you planned (and changes are lost)!
This maybe makes sense but repeatedly undoing/redoing kills the possibility of undoing previous actions this way. I believe it is worse than a killing new change because it is completely unexpected.
3) Duration field overwrite mode... yes, please. But I actually don't know how.
You did it for 'Start time' field but you don't know how to do it for 'Duration' field?

Can you please make 'Adjust all times' dialog window auto-closing on 'Show earlier' or 'Show later' press?

Nikse555
21st April 2020, 18:02
Yes. Besides that I believe that all keyboard shortcuts should be customizable (since you have implemented this for some of them anyway).
Yes, not many hardcoded shortcuts left now :)
Latest beta should ignore shortcuts when the Windows-Key is pressed:
https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.14/SubtitleEditBeta.zip
Does that work for you?


You did it for 'Start time' field but you don't know how to do it for 'Duration' field?
Yeah, sorry. Winforms does not have the best controls in the world. It's of course possible in some way but there's no quick fix to do this for a "NumericUpDownControl" as far as I know.


Can you please make 'Adjust all times' dialog window auto-closing on 'Show earlier' or 'Show later' press?
I often keep the window open or click "Show later" multiple times...

Nikse555
22nd April 2020, 06:49
Titlebar now also changed so asterisk is first: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.14/SubtitleEditBeta.zip

tormento
26th April 2020, 09:05
Titlebar now also changed so asterisk is first
Hi Nikse!

One regression (to me) was the "fix continuation style". I was really happy with the removal of leading … and now that new rule is making sometimes a mess.

Some requests:
in the binary compare dialogue, right clicking in the input field gives some used characters in different language families. You forgot italian: we use same vowels as spanish but with open accent, i.e. àèìòù and é plus all their capitals. For the lower case it's easy, as we have on keyboard, for the upper I have to use charmap. And please add french letters too, such as âêîôû and capitals. Perhaps it's better to do a separate menu for vowels and variations only, as they repeat in a lot of languages.
set a new rule to find mixed case words, usually OCR errors, such as RAvEN or raVen, would you?
the possibility to order the rule order in the fix errors dialogue or preferences
move the red big Italic warning to the right of the character to type, it's more visible there
add a green Regular warning too
add a flag not to add the typed character in the OCR DB: sometimes there are strange ones that would cause issues on later OCR
report the OCR progress bar on the application bar Subtitle Edit icon (such as other app do) and/or ring a bell when it needs attention

Thanks!

Nikse555
26th April 2020, 15:03
Hi tormento!

I'm just finishing up SE 3.5.15 and only bug fixes right now - but I will keep you input in mind.

The "Continuation style" is now default "not-enabled" (works like it used to be). You can enable (change style or disable) it via Options -> Setting -> General - and "Continuation style" in profile.

The SE OCR taskbar icon will now blink (for a while) when input is required and SE is not focused. Is that what you mean by #7?

Last chance to find bugs before 3.5.15 final :)
https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.14/SubtitleEditBeta.zip

tormento
26th April 2020, 15:11
The "Continuation style" is now default "not-enabled"
Nice and I would like to see "remove leading ..." back :)
Is that what you mean by #7?
Yep and I'd like some bells too. :)
Last chance to find bugs before 3.5.15 final :)
There is a really strange bug when smart fixing italian subs that transform every I or l in L. I reported few posts ago. It's rare but sometimes it happens. I will check it out to see if it happens again and give you some sub to test.

jlw_4049
27th April 2020, 05:35
Does the batch tool use tesseract? There is no way to select which one it uses. So I am unsure as to which is being used.

Nikse555
27th April 2020, 06:53
@tormento: "Rremove leading ..." is back when not using "Continuation style".
Bells... at first blink only or at every blink? or ?

@jlw_4049: Yes, the batch tool uses Tesseract, with last used language.

tormento
27th April 2020, 08:15
@tormento: "Rremove leading ..." is back when not using "Continuation style". Bells... at first blink only or at every blink? or ?

Thanks, you could move remove leading to continuation style in options.

For every, let’s say, 5 blinks. If you have time put a value in options, with enabling or disabling sound.

Janusz
28th April 2020, 09:35
Hello.

Sorry, I don't know English so I used a translator.

Many thanks to the author for this excellent program and to everyone who develops it.
I use it for many years mainly to extract subtitles from the * .ts stream broadcast by TV stations. I use nOCR because I think this method is unrivaled in terms of speed and accuracy.

Now my comments:

1. With one station it is such that there is not enough space between the first and second line of subtitles and OCR cannot deal with it.

https://forum.doom9.org/attachment.php?attachmentid=17310&stc=1&d=1588067414

https://forum.doom9.org/attachment.php?attachmentid=17311&stc=1&d=1588067512


2. ',' (comma) normally and ',' (comma) used as ' (accent) in e.g. English spelling. I bypassed this problem by combining ' with a letter, but such a combination of characters by "expand" e.g. <' a> - you can't see the nocr character database. You also don't see, for example, % ("o" extended by another two characters "/ o"). Characters created by "expand" are lost when exporting / importing the character base.

3. <No of pixels is space> in <OCR method>
- If I import images from the * .ts file <No of pixels is space = 4>, (this is the configuration file <Settings.xml> and this value is saved for future reference.
- If I import images from index.html - <No of pixels is space = 12> - always,
no matter that I changed this value before.
- <No of pixels is space> for italics. In this case, it would be useful to use a different space than, for example, 4, and for italics 3. Decreasing the space by 1 for italics generates much less errors in the combined words. Especially for <j> in italics after <w, r, A>

Best regards, Janusz.

GCRaistlin
28th April 2020, 13:36
Latest beta should ignore shortcuts when the Windows-Key is pressed:
Does that work for you?

Yes, it does, thanks.

I often keep the window open or click "Show later" multiple times...
You can implement it as an option. At my opinion it's handier to close the dialog and then open it again by a keyboard shortcut.
Also, it would be nice to have the following things implemented here:

Time input field is active after opening the dialog.
Underscored letters for quick access the radio buttons and keys on Alt-<letter>: Show earlier, Show later, Selected line(s) only, Selected line(s) and forward, All lines.

This way one can call the dialog and have access to all its controls with the keyboard and, what is more, by using only his left hand (except 'All lines' but it doesn't really matter).

Is there a way to expand the selection forward when performing OCR? The first part of Cyrillic "ы" coincides with Cyrillic "ь".
Found a solution: delete "ь" from the database, define "ы", define "ь" again.
It isn't the best solution though as "ь" could be defined when performing OCR of a different sup file so it may not be easy to redefine "ь" again. Can you please add the possibility to add a better match with expanding the selection?

It is unable to load a DVD sup file from within UI while it is possible supplying it as a command line argument.

GCRaistlin
30th April 2020, 13:10
Feature requests:

Check the file being open for errors (erroneous example (https://mir.cr/6FD66P54)). DVDSubEdit does.
'Skip current subpic' button in 'Manual image to text' OCR dialog. It will allow not to abort OCR if the scanned file contains errors (erroneous example (https://mir.cr/018C3DT1), see #358).

GCRaistlin
30th April 2020, 14:15
The issue that is similar to what I reported earlier (https://forum.doom9.org/showthread.php?p=1904263#post1904263) (# 2). The same example as for FR # 2 above. Try to OCR it with my Cyrillic.db (https://mir.cr/12QAJSAB). If you start from the beginning then the 2nd word in the 2nd line of # 231 will be recognized as "остаьаться". If you start from # 231 itself it'll be "оставаться".
What is interesting is that in 'Inspect compare matches for current image' dialog the problematic character seems to be recognized properly:
https://i111.fastpic.ru/thumb/2020/0430/73/b3adfdb2898006944048a7f8ef9c4073.jpeg (https://fastpic.ru/view/111/2020/0430/b3adfdb2898006944048a7f8ef9c4073.jpg.html)

darksen
30th April 2020, 20:59
SE 3.5.12 is out :)
@darksen: Did you get a lot of false corrections? Perhaps that could be improved if you posted some.

Sorry for taking so damn long to reply back.
Yes, I always get a lot of that. I make subtitles in Spanish and obviously I use the Spanish rules, some errors I get very often are about the OCR correction, sometimes, and this is just one example, SE wants to replace Marlboro with Mariboro (l for i), other times it wants to replace names for their localized ones, like Ivan with Iván, other times it wants to fix quotes when it doesn't have to, in Spanish the main quotes we use are angled quotation marks (« and ») and if there is a quote inside that quote we use double quotes (" and ") and if there is another quote inside that we use single quotes (' and ' ) but SE sometimes wants to fix the double quotes and replace them with the angled ones despite they being inside angled already, i.e:

Ivan said: «Blahblahblah "blah-blah-blah" blahblah»

SE wants to fix it like:

Ivan said: «Blahblahblah blah-blah-blah blahblah»

Or sometimes:

«Blahblahblah «blah-blah-blah» blahblah»

It also happens when the dialogue is split in more than one subtitle because SE doesn't detect that angled quotes have been used in the previous sub.
And there have been more errors but I can't remember them all right now. That's why I'm asking to split the "Fix common OCR errors (using OCR replace list)" into two, one informing that the default list is used for said correction and the other that the correction comes from the user list.

Thank you.

GCRaistlin
30th April 2020, 22:34
Feature request: split 'No of pixels is space' ('Import/OCR Blu-ray (.sup) subtitle' dialog window) to 'No of pixels is space after an Italic character' and 'No of pixels is space after an non-Italic character'. These values should definitely differ.

GCRaistlin
30th April 2020, 23:35
Feature requests:

Quick paste of characters with acutes and umlauts in 'Inspect compare matches for current image' dialog (as in 'Manual image to text' dialog).
The possibility to paste a subtitle from the clipboard. Here's what I mean. Let's consider I have two srt files open in two different copies of SE. In one copy, I do a right-click on a line and select 'Copy as text to clipboard'. In another copy, I do a right-click on a line and select 'Paste from clipboard before'. SE adds a subtitle from the clipboard preserving everything but subtitle number (i. e. start time, duration, text). Why do I need this? It may be useful when syncing subtitles (*) that we downloaded from some site to the ones (**) that we ripped from the disc we want to view with (*) and when subtitles (**) are lacking of some lines that are present in subtitles (*). This way, we could easily copy-and-paste missing lines from (*) to (**), then shift them by applying a delay, then import time codes from (**) to (*).

Janusz
1st May 2020, 12:23
Feature request: split 'No of pixels is space' ('Import/OCR Blu-ray (.sup) subtitle' dialog window) to 'No of pixels is space after an Italic character' and 'No of pixels is space after an non-Italic character'. These values should definitely differ.

That's exactly what I mean, as I wrote above.

Feature requests:

Quick paste of characters with acutes and umlauts in 'Inspect compare matches for current image' dialog (as in 'Manual image to text' dialog).
The possibility to paste a subtitle from the clipboard. Here's what I mean. Let's consider I have two srt files open in two different copies of SE. In one copy, I do a right-click on a line and select 'Copy as text to clipboard'. In another copy, I do a right-click on a line and select 'Paste from clipboard before'. SE adds a subtitle from the clipboard preserving everything but subtitle number (i. e. start time, duration, text). Why do I need this? It may be useful when syncing subtitles (*) that we downloaded from some site to the ones (**) that we ripped from the disc we want to view with (*) and when subtitles (**) are lacking of some lines that are present in subtitles (*). This way, we could easily copy-and-paste missing lines from (*) to (**), then shift them by applying a delay, then import time codes from (**) to (*).


Instead of starting two sessions of the program, did you try "translator mode"
with the second file as "original text"
Whereby: <Opion / Settings / Allow edit of original subtitle> = "enabled"

I would need a new function:
If I mark a line in the "Compare subtitle" window, it is automatically selected in the main program window.