PDA

View Full Version : VSFilter and unicode


vlada
6th October 2008, 12:18
Hi,

I have a question regarding VSFilter, AviSynth and Unicode support. Is it possible to use UTF-8 encoded subtitles? There is a field to set encoding in SSA style file. If UTF-8 is supported what should I set here?

Btw. what version of VSFilter should I use? The one from Guliverkli2 or from MPC-HC? Are there any differences between them?

TheRyuu
6th October 2008, 12:32
AFAIK you can use UTF-16 encoded subtitles with vsfilter.
http://www.animereactor.dk/aegisub/vsfilter-2.39c.rar (there's a newer one bundled with latest cccp, too lazy to upload/find a link)

clsid
6th October 2008, 13:19
Btw. what version of VSFilter should I use? The one from Guliverkli2 or from MPC-HC? Are there any differences between them?No difference afaik.

vlada
6th October 2008, 18:53
Thanks for the advice. UTF-16 works correctly. I'm just afraid that Matroska subtitles are UTF-8, so I will have to convert them after demuxing. But hopefully this shouldn't be difficult.

edogawaconan
6th October 2008, 19:12
vsfilter can read UTF-8 encoded subtitles just fine.

mikeytown2
6th October 2008, 19:52
I'm not sure which one is the best, but here are 2 versions of vsfilter
Guliverkli2 - DirectVobSub 2.39 (http://sourceforge.net/project/showfiles.php?group_id=205650&package_id=246121&release_id=541232) 2008-08-28

MPC - Homecinema - mpchc_x86_v1.1.604.0_VSFilter.zip (http://sourceforge.net/project/showfiles.php?group_id=170561&package_id=264678&release_id=609982) 2008-06-28

jiifurusu
7th October 2008, 00:21
If a subtitle file starts with an UTF-8 BOM (Byte Order Mark) then VSFilter assumes it's UTF-8 and reads it as that, and does not perform any character set conversions. (Other than to UTF-16 which is the in-memory format for all Win32 Unicode applications.)
When "streaming" subtitles from a file such as MKV, the stream headers contain information about what encoding the subtitles are in. That is usually ASCII (ie. 7-bit clean) or UTF-8, rarely anything else.
UTF-8 files generally take up less space than UTF-16 files, potentially UTF-16 takes up double as much space. There's almost never any good reason to use UTF-16 in files. (UTF-8 and UTF-16 are two different ways of storing Unicode data, there's also UTF-32 which is by far the easiest to work with programmatically, but it takes up by far the most space.)

Some very old versions of VSFilter (then mostly known as TextSub or DirectVobSub) do not understand UTF-8 files with BOM, those are generally 2.2x versions. Don't use those, they're vastly obsolete. (It'd be like installing Windows 95 on your brand new computer. You wouldn't do that.)

Long story short: UTF-8 works just fine and is a better choice than UTF-16.


Related: The Encoding field of a style, and the \fe override tag in SSA/ASS formats slightly changes its meaning when the subtitle data are stored in a Unicode encoding. When they're used in non-Unicode subtitle data (ie. files in some ANSI encoding) they control what character set/encoding is used for translating the raw byte stream from the file into Unicode. They also control what font encoding is used for translating the text into glyphs in the font. When they're used in Unicode subtitle files they only control the font encoding. In the same of some older font files that don't have a Unicode mapping table it might affect whether some characters "work" in the chosen font, for example some Japanese fonts only work if the encoding is set to 128. (It might also pick different language variations of a character based on the language the encoding is usually used for, due to the han-unification in Unicode.)

vlada
7th October 2008, 00:58
Thanks for the clarification, maybe the UTF-8 subtitle file I tried didn't start with the UTF-8 BOM. I converted it to UTF-16 and suddenly it was displayed correctly.

IanB
7th October 2008, 09:35
Open your subtitle file in a hex editor (VirtualDub has one under the Tools menu), if it is UTF-8 it will start with the 3 byte BOM code,0xEF 0xBB 0xBF. The BOM code is invisible in most text editors.

If it starts with ordinary text it, i.e. "Now is the time to party.", then it is just plain ordinary ANSI in your current code page and you may need to convert it to UTF-8 (Notepad, Save-As can do this for you).