Standalone Faster-Whisper-XXL - AI auto-transcription-translation [Archive]

View Full Version : Standalone Faster-Whisper-XXL - AI auto-transcription-translation

Pages : [1] 2 3

VoodooFX

28th April 2023, 13:00

Whisper is a state of the art auto-transcription-translation model - Robust Speech Recognition via Large-Scale Weak Supervision

https://i.imgur.com/DYVm3u6.png

There are my compiled binaries for newbies: https://github.com/Purfview/whisper-standalone-win

If someone doesn't have fast CUDA GPU, you can use one for free in Colab.

Here I setup ready to use Jupiter's Notebook with Faster-Whisper-XXL (https://colab.research.google.com/drive/17EE-Ty6do7LKYGGUAoog-tz0QOx5V0q0).

Guide how to use it:
1) Press to run the first cell to download and setup Faster-Whisper-XXL. ["triangle at the front of the cell"]
2) When a session connects press "Files" icon at the left and drag & drop your audio file(s) there ( its "/content" folder)
Note: Do not drop video files as uploading is kinda slow. Remux video with MKVToolNix deselecting video and other not needed tracks.
3) When the first cell is done [stopped spinning] you can run the second cell to transcribe [on the first run a model will be downloaded].
Note: Adjust Faster-Whisper-XXL parameters to your liking.
4) When you are done working "Runtime" > "Disconnect and delete runtime".

StainlessS

28th April 2023, 15:14

WOW! VX, you be da man.

VoodooFX

1st May 2023, 13:09

@StainlessS How is internet at the pub, did you downloaded a release with GPU?

StainlessS

1st May 2023, 18:03

Not yet, I thought that it seemed a little bit wierd, what with all of the various downloads necessary.
I did not have a clue what to do with stuff on the github site, did not seem to be anything downloadable.
However, I did find some stuff here:- https://github.com/openai/whisper/discussions/63
which would seem to be the model thingies.
Maybe I down them in pub, but in no great hurry at the moment.
I'll also down the GPU thingy.

EDIT: Yes I know it can auto download the models, but I want offline download and want to know where they come from.

VoodooFX

1st May 2023, 18:53

I did not have a clue what to do with stuff on the github site, did not seem to be anything downloadable.

It seemed that you knew where to download:

EDIT: What kind of speed can one expect (I downed the 170MB-ish CPU version, 1.6-ish GB for GPU ver$ is a bit rich for me, is that much faster ?) ?.

Anyway, in GitGub at the right side you should see "Releases" button.

Model[s] are downloaded separately, automatically or manually, link for the models is in the front page of the repo.

You don't want "OpenAI" stuff, it's very slow.

StainlessS

4th May 2023, 05:24

OK, I got it working [with auto download of the model, I downloaded the pytorch models [.pt extension] by mistake, in pub].

I tested with both CPU and GPU versions, GPU pleasantly faster,
01:33:xx movie under GPU/medium.en model took 269 seconds for 933 subtiles.

Thanks for prodding me in this direction.

Emulgator

4th May 2023, 22:30

CPU: I get
2023-05-04 23:28:28.0660903 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1671 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.
but after that recognition starts. Aborted manually to see CUDA blow...

CUDA: the same report, then it starts munching...

Still it repeats some lines over and over, but
WOW IS THAT QUICK ! Like 10-fold going from a .wav file.
635s for a 1:43:40 musical movie in English, 1769 subs, songs included.
103 mins playing time transcribed in 10,5min with ~95% accuracy.
Found slang stuff I was helpless to translate before.

CPU 11900K 17%, GPU RTX3080 1% How, well...

Awesome work, many thanks, VoodooFX !

VoodooFX

4th May 2023, 23:09

That message from onnxruntime is "normal", expectable nonsense from Microsoft's lib. :)

Still it repeats some lines over and over
Can you cut and share shorter sample of that audio where it happens? [Test if cut sample still loops]

EDIT:

going from a .wav file
Use original audio, don't convert it to anything, results from ffmpeg conversion are noticeably worse for some reason.

EDIT2:

Btw, some people reported that on CPU it's a bit more accurate.

Emulgator

4th May 2023, 23:36

No time to continue right now but I guess it will be the same behaviour as with the other Whisper versions within SubtitleEdit:
If I just restrict and give only a smaller range it works most of the time.

Emulgator

5th May 2023, 08:14

Now let's have this engine on Android (I once had to help a hearing-disabled person using their mobile phone for live transcription),
and finally Big Goo's online-listening would be out of the water...

StainlessS

5th May 2023, 09:12

Btw, some people reported that on CPU it's a bit more accurate.
My CPU and GPU outputs were definitely different, GPU quite often split longer single CPU sub
into two shorter GPU subs. [I think]

EDIT:
I wonder how it would do on movie SNATCH, Brad Pitt's irish gypsy accent [fantastic job by Pitt to make it totally unintelligible].

Snatch/Pitt: https://www.youtube.com/watch?v=Gfzxz7asbZs

VoodooFX

13th May 2023, 06:57

On English audio I get better results with multilingual "medium" model than with "medium.en". :confused:

@Emulgator
@StainlessS
Could you test if new Faster-Whisper r117 runs OK on CUDA? [Executable is small]
And I would be interested in transcription accuracy and speed benchmark vs previous version.

StainlessS

13th May 2023, 18:05

[Executable is small]
Yeah, maybe, but requires
cuBLAS 11.x @ 3.4GB, and
cuDNN 8.x @ unknown size (I have an nVidia devs account somewhere, I'll havta find it).
I'm currently on 50GB data per month [EE @ £20/month], I think maybe I'll up it to 130GB/month next month [think next up is 130GB for £30/month].
Anyway, I'll down them in pub next time I'm there.

EDIT: The models were updated 22hours ago.
EDIT: Actually EE 125GB for £30/month for my 4G+ Router. https://shop.ee.co.uk/sim-only/pay-as-you-go-phones#

VoodooFX

13th May 2023, 19:56

Yeah, maybe, but requires
cuBLAS 11.x @ 3.4GB, and
cuDNN 8.x @ unknown size (I have an nVidia devs account somewhere, I'll havta find it).

EDIT: The models were updated 22hours ago.

Maybe it would work out of the box. Shouldn't that be already installed with CUDA drivers/stuff?

EDIT:
Actually all that stuff should be present in previous version, copying dlls to the same folder from there should work.

VoodooFX

13th May 2023, 20:02

I'm currently on 50GB data per month [EE @ £20/month], I think maybe I'll up it to 130GB/month next month [think next up is 130GB for £30/month].

Get GiffGaff PAYG, unlimited* for £25. [GiffGaff is basically O2]

*Full 4G speed till 80GB, after that it limits to 386kb at daytime, but at night it's full speed again.

StainlessS

13th May 2023, 21:13

Thanx VFX, but I'll stick with faster EE, says in my given link that speed is max 25mbps.
EtherNet, USB etc, tend to have a management overhead of 20%, I assume same for 4G,
So to convert mbps to MB/s, just divide by 10.
Despite what EE says (max 25 mbps) I regularly get 3 or 4 MB/s (as for 30 or 40 mbps), and have
once noticed it at 8MB/s during the night.
Thats pretty good speed considering that I'm quite a way from city urban area (I'm near green parkland),
and only get 2 out of 5 bar signal.

EDIT: I presume that we get charged for the management overhead, and that it is included in max 25 mbps.

*Full 4G speed till 80GB, after that it limits to 386kb at daytime, but at night it's full speed again.
Did not know that, thanx.

VoodooFX

15th May 2023, 18:10

@StainlessS
I could make a release [or separate download] including Nvidia libs, but I don't know which libs are actually needed, I don't wanna include whole 4GB stuff.
Could someone check which libs are actually needed by copying dlls [to same folder with whisper.exe from "r117"] one by one on error from "b103" release, there libs are located at "Whisper-Faster\torch\lib"?
That would need Windows with only Nividia drivers installed without CUDA Toolkit & cuDNN.

Emulgator

16th May 2023, 16:47

If swapped into the same python-laden folder as r103, r117 throws error:
"Could not load cudnn_ops_infer64_8.dll. Error code 126.
Make sure that cudnn_ops_infer64_8.dll is in your path."
That file was indeed in torch\lib, together with 36 more .dlls.
I copied that side by side to 117, then
"Could not load cudnn_cnn_infer64_8.dll. Error code 126.
Make sure that cudnn_cnn_infer64_8.dll is in your path."
I copied it side by side, then still the same fault.

Win10P64, i9-11900K, RTX3080, cudart64_110.dll 6.14.11.11080 in system32.
P.S. This is my replacement system, so no CUDA 11.8 installed yet.

VoodooFX

16th May 2023, 18:49

@Emulgator Thanks for testing it.

r117 is standalone single executable, no need to copy it to r103 folder. [It's not some incremental patch]

"cudnn_cnn_infer64_8.dll" has dependency on "zlibwapi.dll", so copy it too.

Emulgator

17th May 2023, 15:17

Ah, ok. Continuing.
Removed cudart64_110.dll 6.14.11.11080 from system32,
(still no CUDA 11.8 installed).

Now running from a separate folder, containing only the .exe, the .bat and the _models folder,
and adding dependencies as we speak.

We had
cudnn_ops_infer64_8.dll
cudnn_cnn_infer64_8.dll

You hinted
zlibwapi.dll
I added that.

Now it asked for
cublasLt64_11.dll
I added that.

Then it asked for
cublas64_11.dll
I added that.

Start: Success ! Now it runs with just these 5 dependencies.
14..17..18% on CPU, 3 of 16 cores (1 is 90%, 1 is 66%, 1 is 33%), 1% GPU.
Load distribution looks the same as with r103.

Speed: will tell when it is finished. Feels the same range as with r103.
Then I will compare again r103 vs. r117 apples-to-apples.

Finished: 949s for the same movie. 50%slower.
Was stuck quite a bit at 01:37:40.700 (song beginning)
generating of subtitles ended there at #1757,
so it did not reach the movie's end at 01:43:40

Repeated r103: 633s for the same movie, ran until the end.
At the moment r103 within its full dependency bag looks better.

nVidia Driver on this (older, clone father, now replacement) system SSD: 462.75

VoodooFX

17th May 2023, 23:53

Removed cudart64_110.dll 6.14.11.11080 from system32
Why did you removed it? Maybe it's present in other folder?

generating of subtitles ended there at #1757,
so it did not reach the movie's end at 01:43:40
Subtitle numbers doesn't mean anything, some lines could be in one sub, you need to check actual subtitles.

At the moment r103 looks better.
Did you check transcription differences?

By default r117 runs int8 quantization on GPU, r103 runs float16. [on CPU both use int8]
I changed that because few users reported that int8 is more accurate than float16 and that speed is same.

Quantization can be set by "--compute_type".

EDIT:
@Emulgator
Could you do tests these 2 short files with "medium": https://we.tl/t-S5gnRvMuQB , with "--compute_type=float16" & "--compute_type=int8" on CUDA and share 4 srt files?

Emulgator

18th May 2023, 11:41

Originally Posted by Emulgator View Post
Removed cudart64_110.dll 6.14.11.11080 from system32

Why did you removed it? Maybe it's present in other folder?

To make sure that only the .dlls I should introduce are loaded and not an additional dependency not accounted for.

Ah, well, 8bit vs. 16bit can make all the difference !
I give a .wav 32bit float decode from the DVD .ac3 track and use the large multilingual model only.

Comparing the 3 runs from a 25fps-speed-up 1961 musical movie English soundtrack,
quick, cockney and other slang talking, interleaved with songs
using WinMerge triple comparison:

r103 GPU from 04.05.2023
r103 GPU from 17.05.2023
r117 GPU from 17.05.2023

All versions have their uses and guess differently.
Which is good for me: a wealth to choose from.
Now it is up to the subtitler (me) just to merge the best parts.

Will have to talk Nikse into having 3 editor tabs in SubtitleEdit, muhahaha ;-)

Downloaded your sample, testing soon.

Emulgator

18th May 2023, 14:15

C:\_PROG\! Subtitle Tools\Whisper-Faster_Win.x64_2023.05.13.b117_GPU>whisper.exe "C:\_PROG\! Subtitle Tools\! Testfile VoodooFX 2023 05 18\test_original.aac" --language en --model "large" --compute_type=float16

Standalone Faster-Whisper r117 running on: CUDA

Estimating duration from bitrate, this may be inaccurate
2023-05-18 15:05:00.1132781 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1671 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.

[00:00.760 --> 00:02.760] Feeling inspired yet?
[00:02.760 --> 00:03.760] No.
[00:03.760 --> 00:08.890] Thank you.
[00:08.890 --> 00:10.890] I thought you said you were hungry.
[00:12.890 --> 00:17.890] There's a boat tour of the Trakla Island formations this afternoon.
[00:17.890 --> 00:22.890] I was thinking we could go on that and make reservations in town for dinner.
[00:23.890 --> 00:25.890] You could try the Chinese place.
[00:25.890 --> 00:28.890] I don't think I'd survive another dinner in town.
[00:29.890 --> 00:31.890] Even the idea that...
[00:31.890 --> 00:33.890] Does anyone think it's a real town?
[00:36.890 --> 00:38.890] Why would they have a Chinese place?
[00:43.280 --> 00:46.280] Is it okay if I go? I'll meet you on the beach.
[00:47.280 --> 00:49.280] Yeah, sure.
[01:22.270 --> 01:24.270] Someone's making a statement.
[01:24.270 --> 01:26.270] One of the locals, I guess.
[01:29.270 --> 01:31.270] What do you think he's trying to say?
[01:31.270 --> 01:35.270] He's saying that he wants to put a long knife right through her.
[01:35.270 --> 01:40.270] And after you die, he'll hang your body at the airport to scare off the other tourists.
[01:42.270 --> 01:44.270] Seems a bit extreme.
[01:46.270 --> 01:49.270] The Latokans are a melodramatic people.
[01:54.740 --> 01:56.740] I loved your book.
[01:58.740 --> 01:59.740] Sorry?
[02:00.740 --> 02:03.740] You're James Foster. I loved your book.
[02:06.740 --> 02:09.740] Sorry, is that good? I don't mean to put you in the spot.
[02:09.740 --> 02:11.740] No, thank you.
[02:11.740 --> 02:13.740] It's just, um...
[02:13.740 --> 02:15.740] Not a lot of people read my book.
[02:15.740 --> 02:17.740] I'm Gabby Bauer.
[02:17.740 --> 02:19.740] I'm James Foster.
[02:21.740 --> 02:22.740] Alvin!
[02:25.520 --> 02:27.520] This is James Foster.
[02:27.520 --> 02:29.520] Hi, nice to meet you. Albon Bauer.
[02:29.520 --> 02:30.520] Pleasure.
[02:30.520 --> 02:32.520] He wrote your book that I love, The Variable Sheath.
[02:32.520 --> 02:34.520] Oh, yeah, I remember.
[02:34.520 --> 02:36.520] I thought it was brilliant.
[02:36.520 --> 02:37.520] Yes.
[02:37.520 --> 02:41.520] James, do you think I could convince you to join us for dinner this evening?
[02:42.520 --> 02:46.520] I've been seeing you around the resort for a few days now and I would love to get to know you.
[02:46.520 --> 02:49.520] We have a reservation tonight at Yang's.
[03:00.450 --> 03:02.450] Yeah, it was a good...
[03:04.450 --> 03:06.450] ...learning experience.
[03:06.450 --> 03:07.450] All right.
[03:07.450 --> 03:10.450] Is there anything else I can get you?
[03:10.450 --> 03:12.450] Um, that's all I think.
[03:12.450 --> 03:14.450] All right, everyone, please have a great meal.
[03:14.450 --> 03:15.450] Thank you.
[03:15.450 --> 03:19.450] And let me know any time if I can make your experience even more enjoyable.
[03:22.450 --> 03:24.450] He's an interesting guy.
[03:24.450 --> 03:25.450] Yes.
[03:25.450 --> 03:29.450] This resort is labelled in the resort guide as a multicultural dining experience.
[03:30.450 --> 03:32.450] Well, it certainly is an experience.
[03:33.450 --> 03:36.450] So, Albon, what is it you do for a living?
[03:36.450 --> 03:39.450] Oh, architecture. But I'm mostly retired.
[03:39.450 --> 03:42.450] Now I run a journal out of Los Angeles called Glass Pane.
[03:42.450 --> 03:43.450] You're French?
[03:43.450 --> 03:46.450] Oh, no. Swiss first, from Geneva.
[03:46.450 --> 03:48.450] Then Paris, then LA.
[03:49.450 --> 03:52.450] I'm from London first. Then Paris.
[03:52.450 --> 03:53.450] We met there.
[03:53.450 --> 03:54.450] That's how we met.
[03:54.450 --> 03:57.450] But I couldn't get work there, so I made Albon move with me.
[03:58.450 --> 04:00.450] And what do you do?
[04:00.450 --> 04:03.450] Well, I'm an actress, of course.
[04:03.450 --> 04:05.450] Oh, really? She's great.
[04:06.450 --> 04:07.450] For commercials.
[04:07.450 --> 04:09.450] I have a contract with an LA company.
[04:09.450 --> 04:11.450] They've been grooming me.
[04:11.450 --> 04:13.450] I specialize in failing naturally.
[04:14.450 --> 04:17.450] What does that mean? Failing naturally?
[04:18.450 --> 04:22.450] Finding a natural-seeming way to fail at any given task.
[04:22.450 --> 04:24.450] In each of the commercials that I'm in,
[04:24.450 --> 04:27.450] I'm the one who simply can't go on without the product.
[04:27.450 --> 04:29.450] It's ridiculous for me not to have the product.
[04:30.450 --> 04:31.450] Okay.
[04:31.450 --> 04:32.450] Show them.
[04:32.450 --> 04:33.450] No.
[04:33.450 --> 04:34.450] No, you should.
[04:34.450 --> 04:35.450] Yeah.
[04:35.450 --> 04:36.450] Please.
[04:36.450 --> 04:37.450] Do you want to see?
[04:37.450 --> 04:38.450] I want to see.
[04:38.450 --> 04:39.450] Here.
[04:42.450 --> 04:43.450] She's amazing.
[04:56.660 --> 05:02.450] I just...
[05:04.450 --> 05:05.450] I...

Standalone Faster-Whisper operation finished in: 25 seconds

C:\_PROG\! Subtitle Tools\Whisper-Faster_Win.x64_2023.05.13.b117_GPU>pause
Drücken Sie eine beliebige Taste . . .

Emulgator

18th May 2023, 14:16

C:\_PROG\! Subtitle Tools\Whisper-Faster_Win.x64_2023.05.13.b117_GPU>whisper.exe "C:\_PROG\! Subtitle Tools\! Testfile VoodooFX 2023 05 18\test_original.aac" --language en --model "large" --compute_type=int8

Standalone Faster-Whisper r117 running on: CUDA

Estimating duration from bitrate, this may be inaccurate
2023-05-18 15:08:01.9382416 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1671 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.

[00:00.760 --> 00:02.760] Feeling inspired yet?
[00:02.760 --> 00:08.890] No, thank you.
[00:08.890 --> 00:10.890] I thought you said you were hungry.
[00:12.890 --> 00:17.890] There's a boat tour of the Trakla Island formations this afternoon.
[00:17.890 --> 00:22.890] I was thinking we could go on that and make reservations in town for dinner.
[00:23.890 --> 00:25.890] You could try the Chinese place.
[00:25.890 --> 00:28.890] I don't think I'd survive another dinner in town.
[00:29.890 --> 00:31.890] Even the idea that...
[00:31.890 --> 00:33.890] Does anyone think it's a real town?
[00:35.890 --> 00:38.890] Why would they have a Chinese place?
[00:43.280 --> 00:46.280] Is it okay if I go? I'll meet you on the beach.
[00:47.280 --> 00:49.280] Yeah, sure.
[01:22.270 --> 01:24.270] Someone's making a statement.
[01:24.270 --> 01:27.270] One of the locals, I guess.
[01:29.270 --> 01:31.270] What do you think he's trying to say?
[01:31.270 --> 01:35.270] He's saying that he wants to put a long knife right through her.
[01:35.270 --> 01:40.270] And after you die, he'll hang your body at the airport to scare off the other tourists.
[01:42.270 --> 01:44.270] Seems a bit extreme.
[01:46.270 --> 01:49.270] The Latokans are a melodramatic people.
[01:54.740 --> 01:56.740] I loved your book.
[01:58.740 --> 01:59.740] Sorry?
[02:00.740 --> 02:03.740] You're James Foster. I loved your book.
[02:06.740 --> 02:09.740] Sorry, is that good? I don't mean to put you in the spot.
[02:09.740 --> 02:11.740] No, thank you.
[02:11.740 --> 02:13.740] It's just, um...
[02:13.740 --> 02:15.740] Not a lot of people read my book.
[02:15.740 --> 02:17.740] I'm Gabby Bauer.
[02:17.740 --> 02:19.740] I'm James Foster.
[02:21.740 --> 02:22.740] Alvin!
[02:25.520 --> 02:27.520] This is James Foster.
[02:27.520 --> 02:29.520] Hi, nice to meet you. Albon Bauer.
[02:29.520 --> 02:30.520] Pleasure.
[02:30.520 --> 02:32.520] He wrote your book that I love, The Variable Sheath.
[02:32.520 --> 02:34.520] Oh, yeah, I remember.
[02:34.520 --> 02:36.520] I thought it was brilliant.
[02:36.520 --> 02:37.520] Yes.
[02:37.520 --> 02:41.520] James, do you think I could convince you to join us for dinner this evening?
[02:42.520 --> 02:46.520] I've been seeing you around the resort for a few days now and I would love to get to know you.
[02:46.520 --> 02:49.520] We have a reservation tonight at Yang's.
[03:00.450 --> 03:02.450] Yeah, it was a good...
[03:04.450 --> 03:06.450] ...learning experience.
[03:06.450 --> 03:07.450] All right.
[03:07.450 --> 03:10.450] Is there anything else I can get you?
[03:10.450 --> 03:12.450] Um, that's all I think.
[03:12.450 --> 03:14.450] All right, everyone, please have a great meal.
[03:14.450 --> 03:15.450] Thank you.
[03:15.450 --> 03:19.450] And let me know any time if I can make your experience even more enjoyable.
[03:22.450 --> 03:24.450] He's an interesting guy.
[03:24.450 --> 03:25.450] Yes.
[03:25.450 --> 03:29.450] This resort is labelled in the resort guide as a multicultural dining experience.
[03:30.450 --> 03:32.450] Well, it certainly is an experience.
[03:33.450 --> 03:36.450] So, Albon, what is it you do for a living?
[03:36.450 --> 03:39.450] Oh, architecture. But I'm mostly retired.
[03:39.450 --> 03:42.450] Now I run a journal out of Los Angeles called Glass Pane.
[03:42.450 --> 03:43.450] You're French?
[03:43.450 --> 03:46.450] Oh, no. Swiss first, from Geneva.
[03:46.450 --> 03:48.450] Then Paris, then L.A.
[03:49.450 --> 03:52.450] I'm from London first. Then Paris.
[03:52.450 --> 03:53.450] We met there.
[03:53.450 --> 03:54.450] That's how we met.
[03:54.450 --> 03:57.450] But I couldn't get work there, so I made Albon move with me.
[03:58.450 --> 04:00.450] And what do you do?
[04:00.450 --> 04:03.450] Well, I'm an actress, of course.
[04:03.450 --> 04:05.450] Oh, really? She's great.
[04:06.450 --> 04:07.450] For commercials.
[04:07.450 --> 04:09.450] I have a contract with an L.A. company.
[04:09.450 --> 04:11.450] They've been grooming me.
[04:11.450 --> 04:13.450] I specialize in failing naturally.
[04:14.450 --> 04:17.450] What does that mean? Failing naturally?
[04:18.450 --> 04:22.450] Finding a natural-seeming way to fail at any given task.
[04:22.450 --> 04:24.450] In each of the commercials that I'm in,
[04:24.450 --> 04:27.450] I'm the one who simply can't go on without the product.
[04:27.450 --> 04:29.450] It's ridiculous for me not to have the product.
[04:30.450 --> 04:31.450] Okay.
[04:31.450 --> 04:32.450] Show them.
[04:32.450 --> 04:33.450] No.
[04:33.450 --> 04:34.450] No, you should.
[04:34.450 --> 04:35.450] Yeah.
[04:35.450 --> 04:36.450] Please.
[04:36.450 --> 04:37.450] Do you want to see?
[04:37.450 --> 04:38.450] I want to see.
[04:38.450 --> 04:39.450] Here.
[04:42.450 --> 04:43.450] She's amazing.
[04:56.660 --> 05:02.450] I just...
[05:04.450 --> 05:05.450] I...

Standalone Faster-Whisper operation finished in: 38 seconds

C:\_PROG\! Subtitle Tools\Whisper-Faster_Win.x64_2023.05.13.b117_GPU>pause
Drücken Sie eine beliebige Taste . . .

Emulgator

18th May 2023, 14:16

C:\_PROG\! Subtitle Tools\Whisper-Faster_Win.x64_2023.05.13.b117_GPU>whisper.exe "C:\_PROG\! Subtitle Tools\! Testfile VoodooFX 2023 05 18\test_ffmpeg6.wav" --language en --model "large" --compute_type=float16

Standalone Faster-Whisper r117 running on: CUDA

2023-05-18 15:09:37.3092070 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1671 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.

[00:00.760 --> 00:02.760] Feeling inspired yet?
[00:02.760 --> 00:03.760] No.
[00:03.760 --> 00:08.890] Thank you.
[00:08.890 --> 00:10.890] I thought you said you were hungry.
[00:12.890 --> 00:17.890] There's a boat tour of the Trakla Island formations this afternoon.
[00:17.890 --> 00:22.890] I was thinking we could go on that and make reservations in town for dinner.
[00:23.890 --> 00:25.890] You could try the Chinese place.
[00:25.890 --> 00:28.890] I don't think I'd survive another dinner in town.
[00:29.890 --> 00:31.890] Even the idea that...
[00:31.890 --> 00:33.890] Does anyone think it's a real town?
[00:36.890 --> 00:38.890] Why would they have a Chinese place?
[00:43.280 --> 00:46.280] Is it okay if I go? I'll meet you on the beach.
[00:47.280 --> 00:49.280] Yeah, sure.
[01:22.270 --> 01:24.270] Someone's making a statement.
[01:24.270 --> 01:26.270] One of the locals, I guess.
[01:29.270 --> 01:31.270] What do you think he's trying to say?
[01:31.270 --> 01:35.270] He's saying that he wants to put a long knife right through her.
[01:35.270 --> 01:40.270] And after you die, he'll hang your body at the airport to scare off the other tourists.
[01:42.270 --> 01:44.270] Seems a bit extreme.
[01:46.270 --> 01:49.270] The Latokans are a melodramatic people.
[01:54.740 --> 01:56.740] I loved your book.
[01:58.740 --> 01:59.740] Sorry?
[02:00.740 --> 02:03.740] You're James Foster. I loved your book.
[02:06.740 --> 02:09.740] Sorry, is that good? I don't mean to put you in the spot.
[02:09.740 --> 02:11.740] No, thank you.
[02:11.740 --> 02:13.740] It's just, um...
[02:13.740 --> 02:15.740] Not a lot of people read my book.
[02:15.740 --> 02:17.740] I'm Gabby Bauer.
[02:17.740 --> 02:19.740] I'm James Foster.
[02:21.740 --> 02:22.740] Alvin!
[02:25.520 --> 02:27.520] This is James Foster.
[02:27.520 --> 02:29.520] Hi, nice to meet you. Albon Bauer.
[02:29.520 --> 02:30.520] Pleasure.
[02:30.520 --> 02:32.520] He wrote your book that I love, The Variable Sheath.
[02:32.520 --> 02:34.520] Oh, yeah, I remember.
[02:34.520 --> 02:36.520] I thought it was brilliant.
[02:36.520 --> 02:37.520] Yes.
[02:37.520 --> 02:41.520] James, do you think I could convince you to join us for dinner this evening?
[02:42.520 --> 02:46.520] I've been seeing you around the resort for a few days now and I would love to get to know you.
[02:46.520 --> 02:49.520] We have a reservation tonight at Yang's.
[03:00.450 --> 03:02.450] Yeah, it was a good...
[03:04.450 --> 03:06.450] ...learning experience.
[03:06.450 --> 03:07.450] All right.
[03:07.450 --> 03:10.450] Is there anything else I can get you?
[03:10.450 --> 03:12.450] Um, that's all I think.
[03:12.450 --> 03:14.450] All right, everyone, please have a great meal.
[03:14.450 --> 03:15.450] Thank you.
[03:15.450 --> 03:19.450] And let me know any time if I can make your experience even more enjoyable.
[03:22.450 --> 03:24.450] He's an interesting guy.
[03:24.450 --> 03:25.450] Yes.
[03:25.450 --> 03:29.450] This resort is labelled in the resort guide as a multicultural dining experience.
[03:30.450 --> 03:32.450] Well, it certainly is an experience.
[03:33.450 --> 03:36.450] So, Albon, what is it you do for a living?
[03:36.450 --> 03:39.450] Oh, architecture. But I'm mostly retired.
[03:39.450 --> 03:42.450] Now I run a journal out of Los Angeles called Glass Pane.
[03:42.450 --> 03:43.450] You're French?
[03:43.450 --> 03:46.450] Oh, no. Swiss first, from Geneva.
[03:46.450 --> 03:48.450] Then Paris, then LA.
[03:49.450 --> 03:52.450] I'm from London first. Then Paris.
[03:52.450 --> 03:53.450] We met there.
[03:53.450 --> 03:54.450] That's how we met.
[03:54.450 --> 03:57.450] But I couldn't get work there, so I made Albon move with me.
[03:58.450 --> 04:00.450] And what do you do?
[04:00.450 --> 04:03.450] Well, I'm an actress, of course.
[04:03.450 --> 04:05.450] Oh, really? She's great.
[04:06.450 --> 04:07.450] For commercials.
[04:07.450 --> 04:09.450] I have a contract with an LA company.
[04:09.450 --> 04:11.450] They've been grooming me.
[04:11.450 --> 04:13.450] I specialize in failing naturally.
[04:14.450 --> 04:17.450] What does that mean? Failing naturally?
[04:18.450 --> 04:22.450] Finding a natural-seeming way to fail at any given task.
[04:22.450 --> 04:24.450] In each of the commercials that I'm in,
[04:24.450 --> 04:27.450] I'm the one who simply can't go on without the product.
[04:27.450 --> 04:29.450] It's ridiculous for me not to have the product.
[04:30.450 --> 04:31.450] Okay.
[04:31.450 --> 04:32.450] Show them.
[04:32.450 --> 04:33.450] No.
[04:33.450 --> 04:34.450] No, you should.
[04:34.450 --> 04:35.450] Yeah.
[04:35.450 --> 04:36.450] Please.
[04:36.450 --> 04:37.450] Do you want to see?
[04:37.450 --> 04:38.450] I want to see.
[04:38.450 --> 04:39.450] Here.
[04:42.450 --> 04:43.450] She's amazing.
[04:56.660 --> 05:02.450] I just...
[05:04.450 --> 05:05.450] I...

Standalone Faster-Whisper operation finished in: 21 seconds

C:\_PROG\! Subtitle Tools\Whisper-Faster_Win.x64_2023.05.13.b117_GPU>pause
Drücken Sie eine beliebige Taste . . .

Emulgator

18th May 2023, 14:17

C:\_PROG\! Subtitle Tools\Whisper-Faster_Win.x64_2023.05.13.b117_GPU>whisper.exe "C:\_PROG\! Subtitle Tools\! Testfile VoodooFX 2023 05 18\test_ffmpeg6.wav" --language en --model "large" --compute_type=int8

Standalone Faster-Whisper r117 running on: CUDA

2023-05-18 15:11:27.5628509 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1671 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.

[00:00.760 --> 00:02.760] Feeling inspired yet?
[00:02.760 --> 00:08.890] No, thank you.
[00:08.890 --> 00:10.890] I thought you said you were hungry.
[00:12.890 --> 00:17.890] There's a boat tour of the Trakla Island formations this afternoon.
[00:17.890 --> 00:22.890] I was thinking we could go on that and make reservations in town for dinner.
[00:23.890 --> 00:25.890] You could try the Chinese place.
[00:25.890 --> 00:28.890] I don't think I'd survive another dinner in town.
[00:29.890 --> 00:31.890] Even the idea that...
[00:31.890 --> 00:33.890] Does anyone think it's a real town?
[00:36.890 --> 00:38.890] Why would they have a Chinese place?
[00:43.280 --> 00:46.280] Is it okay if I go? I'll meet you on the beach.
[00:47.280 --> 00:49.280] Yeah, sure.
[01:22.270 --> 01:24.270] Someone's making a statement.
[01:24.270 --> 01:26.270] One of the locals, I guess.
[01:29.270 --> 01:31.270] What do you think he's trying to say?
[01:31.270 --> 01:35.270] He's saying that he wants to put a long knife right through her.
[01:35.270 --> 01:40.270] And after you die, he'll hang your body at the airport to scare off the other tourists.
[01:42.270 --> 01:44.270] Seems a bit extreme.
[01:46.270 --> 01:49.270] The Latokans are a melodramatic people.
[01:54.740 --> 01:56.740] I loved your book.
[01:58.740 --> 01:59.740] Sorry?
[02:00.740 --> 02:03.740] You're James Foster. I loved your book.
[02:06.740 --> 02:09.740] Sorry, is that good? I don't mean to put you in the spot.
[02:09.740 --> 02:11.740] No, thank you.
[02:11.740 --> 02:13.740] It's just, um...
[02:13.740 --> 02:15.740] Not a lot of people read my book.
[02:15.740 --> 02:17.740] I'm Gabby Bauer.
[02:17.740 --> 02:19.740] I'm James Foster.
[02:21.740 --> 02:22.740] Alvin!
[02:25.520 --> 02:27.520] This is James Foster.
[02:27.520 --> 02:29.520] Hi, nice to meet you. Albon Bauer.
[02:29.520 --> 02:30.520] Pleasure.
[02:30.520 --> 02:32.520] He wrote your book that I love, The Variable Sheath.
[02:32.520 --> 02:34.520] Oh, yeah, I remember.
[02:34.520 --> 02:36.520] I thought it was brilliant.
[02:36.520 --> 02:37.520] Yes.
[02:37.520 --> 02:42.520] James, do you think I could convince you to join us for dinner this evening?
[02:42.520 --> 02:46.520] I've been seeing you around the resort for a few days now and I would love to get to know you.
[02:46.520 --> 02:49.520] We have a reservation tonight at Yang's.
[03:00.450 --> 03:02.450] Yeah, it was a good...
[03:04.450 --> 03:06.450] learning experience.
[03:06.450 --> 03:07.450] All right.
[03:07.450 --> 03:10.450] Is there anything else I can get you?
[03:10.450 --> 03:12.450] Um, that's all I think.
[03:12.450 --> 03:14.450] All right, everyone, please have a great meal.
[03:14.450 --> 03:15.450] Thank you.
[03:15.450 --> 03:20.450] And let me know any time if I can make your experience even more enjoyable.
[03:22.450 --> 03:24.450] He's an interesting guy.
[03:24.450 --> 03:25.450] Yes.
[03:25.450 --> 03:30.450] This resort is labeled in the resort guide as a multicultural dining experience.
[03:30.450 --> 03:33.450] Well, it certainly is an experience.
[03:33.450 --> 03:36.450] So, Albon, what is it you do for a living?
[03:36.450 --> 03:39.450] Oh, architecture. But I'm mostly retired.
[03:39.450 --> 03:42.450] Now I run a journal out of Los Angeles called Glass Pane.
[03:42.450 --> 03:43.450] You're French?
[03:43.450 --> 03:46.450] Oh, no. Swiss first, from Geneva.
[03:46.450 --> 03:48.450] Then Paris, then LA.
[03:49.450 --> 03:52.450] I'm from London first. Then Paris.
[03:52.450 --> 03:53.450] We met there.
[03:53.450 --> 03:54.450] That's how we met.
[03:54.450 --> 03:57.450] But I couldn't get work there, so I made Albon move with me.
[03:58.450 --> 04:00.450] And what do you do?
[04:00.450 --> 04:03.450] Well, I'm an actress, of course.
[04:03.450 --> 04:05.450] Oh, really? She's great.
[04:06.450 --> 04:07.450] For commercials.
[04:07.450 --> 04:09.450] I have a contract with an LA company.
[04:09.450 --> 04:11.450] They've been grooming me.
[04:11.450 --> 04:13.450] I specialize in failing naturally.
[04:14.450 --> 04:17.450] What does that mean? Failing naturally?
[04:17.450 --> 04:22.450] Finding a natural-seeming way to fail at any given task.
[04:22.450 --> 04:24.450] In each of the commercials that I'm in,
[04:24.450 --> 04:27.450] I'm the one who simply can't go on without the product.
[04:27.450 --> 04:29.450] It's ridiculous for me not to have the product.
[04:30.450 --> 04:31.450] Okay.
[04:31.450 --> 04:32.450] Show them.
[04:32.450 --> 04:33.450] No.
[04:33.450 --> 04:34.450] No, you should.
[04:34.450 --> 04:35.450] Yeah.
[04:35.450 --> 04:36.450] Please.
[04:36.450 --> 04:37.450] Do you want to see?
[04:37.450 --> 04:38.450] I want to see.
[04:38.450 --> 04:39.450] Here.
[04:42.450 --> 04:43.450] She's amazing.
[04:56.660 --> 05:02.450] I just...
[05:04.450 --> 05:05.450] I...

Standalone Faster-Whisper operation finished in: 38 seconds

C:\_PROG\! Subtitle Tools\Whisper-Faster_Win.x64_2023.05.13.b117_GPU>pause
Drücken Sie eine beliebige Taste . . .

Emulgator

18th May 2023, 14:22

Tiny differences, float16 was quicker then int8, ffmpeg6.wav float16 was quickest.
.aac was slower.
As I thought: Precision pays off ?
After all it is about cross-comparing spectrograms,
and tiny losses in density differences can lead to costlier because more exhausting searches.

Alle these on model large, only this was available on that system for now,
and I did not want to let that one go into internet again after being bluescreened twice
by the last 2 forced M$ Win10 updates, the last time leaving me with unrepairable system.

I was not aware that M$ had decided the unspeakable from W10 r1803 on:
NOT to perform any registry backups anymore by default...
To save HDD space. WTF?

https://learn.microsoft.com/en-us/troubleshoot/windows-client/deployment/system-registry-no-backed-up-regback-folder

NOT to perform any system restore points anymore by default...
Even deleting manually made ones. WTF?

https://answers.microsoft.com/en-us/windows/forum/all/windows-10-restore-points-are-being-deleted/4ea28db7-105e-420d-924e-b605ea095ab1

https://answers.microsoft.com/en-us/windows/forum/all/why-is-system-restore-off-by-default-for-many/3049a3ac-f77f-4bff-af19-f6fd51184185

https://learn.microsoft.com/en-us/troubleshoot/windows-client/deployment/system-restore-points-disabled

VoodooFX

18th May 2023, 15:24

I need and asked for medium model and srt files. [uploaded somewhere like Wetransfer] :)
But I'll check these large tests too. [saved, so those posts are not needed anymore]

Ah, well, 8bit vs. 16bit can make all the difference !
There are many other quantization types, run --verbose to see all supported on your device.

I give a .wav 32bit float decode from the DVD .ac3 track
Don't. Use original audio.

r103 GPU from 04.05.2023
r103 GPU from 17.05.2023
r117 GPU from 17.05.2023

All versions have their uses and guess differently.

But there was only one "b103" version. Why you need old version?

Will have to talk Nikse into having 3 editor tabs in SubtitleEdit, muhahaha ;-)

You can open other SE instances.

Tiny differences, float16 was quicker then int8
Benchmarks on short files doesn't mean much.

As I thought: Precision pays off ?

Did you meant compute types? I'm not sure how they correlate to accuracy or speed. So far for me int8 looks best when float32 is fastest. Some users reported opposite effects.

EDIT:
Or did you meant something with audio? That "wav" test file is only to check some quirks with FFmpeg v6. For some reason results from v6 can be worse or different, it affects int types.

Emulgator

18th May 2023, 15:38

Originally Posted by Emulgator View Post
r103 GPU from 04.05.2023
r103 GPU from 17.05.2023
r117 GPU from 17.05.2023

All versions have their uses and guess differently.
But there was only one "b103" version. Why you need old version?

These were my test dates, not the .exe dates.
Sorry for the ambiguity.

Originally Posted by Emulgator View Post
I give a .wav 32bit float decode from the DVD .ac3 track
Don't. Use original audio.

This (in this case .ac3) will have to be decoded to uncompressed before FFTing anyway,
and I want to be in control about the decoding precision.

VoodooFX

20th May 2023, 13:44

@Emulgator Could you do few more tests on aac with CUDA: "--language en --model=large --compute_type=float32" and "--language en --model=medium --compute_type=float16"?
[Results in the same form like you did previous tests.]

Btw, for your own tests you can try "--beam_size=5", it's slower but should produce better results.

Emulgator

22nd May 2023, 23:23

Soon (...still trying to get my main system up and running as before)

StainlessS

13th July 2023, 23:03

These are basically the two that I've tried [only on a few occasions, maybe 5 or 6],

Whisper-Faster\whisper.exe --model_dir ".\_models" --language en --model "large-v2" ".\audio.wav"

Whisper-Faster\whisper.exe --model_dir ".\_models" --language en --model "large-v2" ".\audio.dts"

I just use above to paste into command line.
Its weird how some subs are flagged <during non talkative periods> maybe up to a minute ahead of the actual start of speech, and stop pretty much at end of speech.
Also,

EDIT: I did one recently on music video {live gig} containing Eng and Spanish, some of the Spanish speech
came out in Spanish, some of it came out translated to English. {& Eng came out Eng}.

EDIT: A few hiccoughs can occur, in one instance, the name "Hiller" was transformed throughout video, into "Hitler" :)
{Perhaps "Captain Steve Hitler" rings a bell}

EDIT: I wonder if its worth giving it a go on some Star Trek with lots of Klingon, I bet that some of that stuff was
scanned during A.I. training, might auto convert to earthling English.
Spanish/English thingy is Odd.

VoodooFX

14th July 2023, 01:44

whisper.exe --model_dir ".\_models" --language en --model "large-v2" ".\audio.wav"

I think you are using an old version, current is r134.6.
model_dir parameter is redundant in your example, at least in latest version.

...up to a minute ahead of the actual start of speech

Probably in latest version you'll not see that.

Spanish/English thingy is Odd.
Not odd, Whisper models doesn't support transcription of multilingual audio. You can try to process it twice, first with English then with Spanish parameter.

SaurusX

14th July 2023, 15:15

Does this have the word-level timing feature of the original Whisper? I've yet to find a version of this with that word-level timing, CUDA, and the ability to use the HF models.

VoodooFX

14th July 2023, 15:31

Does this have the word-level timing feature of the original Whisper? I've yet to find a version of this with that word-level timing, CUDA, and the ability to use the HF models.

Yes, it's enabled by default. It includes all those things.

SaurusX

14th July 2023, 16:00

VoodooFX

14th July 2023, 17:44

But the examples shown in the OP screenshot and by Emulgator do not show this. They're all specific second-based intervals that are seemingly locked into a particular fraction-of-a-second start point.

No idea what you mean by that, it shows same thing as original Whisper.
Post your screenshot of what your "original Whisper" shows.

SaurusX

14th July 2023, 18:23

No idea what you mean by that, it shows same thing as original Whisper.
Post your screenshot of what your "original Whisper" shows.

"Original Whisper" as in from OpenAI's github repo.

https://github.com/openai/whisper

When using their CLI I add "--word_timestamps True" and the timing of each sentence or segment is more precise. To the fraction of a second usually, though it can hiccup.

I'll add some screenshots later today when I get to my computer.

VoodooFX

14th July 2023, 19:15

When using their CLI I add "--word_timestamps True" and the timing of each sentence or segment is more precise.
Yeap, it includes that. In the first post is the old screenshot.

SaurusX

15th July 2023, 00:18

I was getting a dll error saying that I was missing "cudnn_ops_infer64_8.dll" and to put it into my system path. I downloaded it from this zip and dropped into by CUDA bin folder.

https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.0/

The word_timestamps is working as you said it would be. Doing other tests now with the different model sizes.

http://i.ibb.co/pW5yrcr/Whisper-faster.jpg
OK, that's fast. Using the large-v2 model!

http://i.ibb.co/jkDf0mF/whisper-medium-en.jpg
Using the medium.en model.

StainlessS

15th July 2023, 01:22

I think you are using an old version, current is r134.6.
Yeah, probably bout 5 weeks since update.

Not odd, Whisper models doesn't support transcription of multilingual audio. ...

Well whether it supports it or not, some (but not all) of it (Spanish) was translated to English.

VoodooFX

15th October 2023, 02:00

Compiled Linux and Mac OS X executables. Enjoy it.

xarzu

5th December 2023, 10:52

I got an error which, I assume, is due to the fact that my input is a 4 hour long video, or maybe I have too many windows open, or both.
https://www.likablelogic.org/images/VUE.JS/whisper-faster.png

Emulgator

5th December 2023, 13:19

I can only suggest to split long files into multiple jobs, it helps a great deal with everything whisper.
For sequences which are hard to guess: restrict a whisper job to only that part, and repeats or overlooked, bad guesses can be mended.

VoodooFX

5th December 2023, 19:58

I got an error which, I assume, is due to the fact that my input is a 4 hour long video, or maybe I have too many windows open, or both.

No, that's irrelevant.
From the screenshot you can see that it's out of the video memory.
How much of VRAM there is in your GPU, and what is the model of your GPU?

VoodooFX

7th April 2024, 13:04

Released Standalone Faster-Whisper-XXL (https://github.com/Purfview/whisper-standalone-win/discussions/231) with the additional features.
You want to use them on movies or noisy podcasts containing music & ect.

Yosho

25th April 2024, 22:07

I could use some help with the syntax if you wouldn't mind..

So far I've got C:\Users\Downloads\Faster-Whisper-XXL_r192.3.4_windows\Faster-Whisper-XXL\faster-whisper-xxl.exe "C:\Users\Downloads\blah.mkv" --language English -m large -d CPU -o "C:\Users\Downloads" -f srt --task transcribe

Additionally, is there a way to include this in subtitle edit by chance? Or would it need to be put in by the programmer themselves of the app?
Thanks for the help!

VoodooFX

26th April 2024, 01:13

I could use some help with the syntax if you wouldn't mind..

I don't mind.

Additionally, is there a way to include this in subtitle edit by chance? Or would it need to be put in by the programmer themselves of the app?

That you should ask the developer of Subtitle Edit.

Yosho

26th April 2024, 06:14

I don't mind.
Great, thanks! What am I missing from the commands I posted to run it to first download the _models and then do the transcription with the whisper?

So far I've got C:\Users\Downloads\Faster-Whisper-XXL_r192.3.4_windows\Faster-Whisper-XXL\faster-whisper-xxl.exe "C:\Users\Downloads\blah.mkv" --language English -m large -d CPU -o "C:\Users\Downloads" -f srt --task transcribe

Thank you for your help!

VoodooFX

26th April 2024, 11:07

What am I missing from the commands...

No idea what you are missing there, you tell me what you are missing... only the input is needed to run it.

I can tell you what I'm missing - I don't see you posting your problem, if you have any.