Unicode File Paths [Archive]

MysteryX

24th March 2017, 04:10

stax76

24th March 2017, 09:08

The ANSI limitation is true for the script file paths as well, most languages are covered by their ANSI code page however, there are hardly any request for unicode support, only from time to time from tool makers like us.

Since you raised this topic, unicode and console/batch don't work either in Win 7 because Win 7 has unicode bugs, batch files are needed for x265 piping for instance, that makes in pactical x265 not supporting unicode as well, at least not for Win 7 users which are still a lot.

Last but not least .NET and Windows are getting long file path support (more then MAX_PATH/260 characters), it can already be used but only with group policy change and manifest entry. I use it already in personal scripts.

Groucho2004

24th March 2017, 09:45

Here's something strange I noticed. I know script files must be in ANSI format and don't support Unicode characters. What about the script file name itself? It works with characters from a variety of languages without problem.

However, if I open a file name with Chinese or Thai characters, it crashes saying "Import: couldn't open ..." followed with the file name with the Chinese characters replaced with ???

Why is this not working? Neither MPC-HC nor VirtualDub opens it.

How can I either open the files, or validate file names to make sure they will work?

I'm using Avisynth+
You have to set your system locale to the correct language. MPC-HC will play the file happily:
https://s9.postimg.org/tvbb9b2pb/Image1.png

Same for a console app (AVSMeter):
https://s9.postimg.org/6fte3yiy7/Image2.png

Strangely, VirtualDub (1.10.4) will not open the file.

In WinNT using NTFS, file names are conveniently stored in Unicode internally, it's up to the programmer to interpret them correctly.

MysteryX

24th March 2017, 17:43

Detecting the character language and shifting the system locale for every file is not an option.

What I'm trying to do is create a AVS file with the same file name as the video and only replacing the extension. It works 96% of the time but a few videos are crashing. I need some way of determining which ones won't work as I can decide to use another file name.

If only ANSI characters were supported, then video names with Arabic characters would also fail, but they work.

Groucho2004

24th March 2017, 18:03

Detecting the character language and shifting the system locale for every file is not an option.You asked a question, I answered it.

What I'm trying to do is create a AVS file with the same file name as the video and only replacing the extension. It works 96% of the time but a few videos are crashing.The 96% indicates that you tried at least 25 different names. What languages did you try? What do mean by "crashing"?

If only ANSI characters were supported, then video names with Arabic characters would also fail, but they work.Arabic (CP1256) is not a multi byte character set which might explain that. Chinese, Korean and Japanese are MBCS. Do you have trouble with other single byte character sets?

MysteryX

24th March 2017, 18:31

The 96% indicates that you tried at least 25 different names. What languages did you try? What do mean by "crashing"?
I already mentioned the error message.

I have a database of 500+ videos, so yes I tried at least 25 names. Only 2 or 3 file names with Chinese or Thai characters failed to load.

Arabic (CP1256) is not a multi byte character set which might explain that. Chinese, Korean and Japanese are MBCS. Do you have trouble with other single byte character sets?
This would explain why only Chinese and Thai fail while Arabic and other languages work.

However, detecting those isn't so simple

It's the encoding (characterset), which decides whether a specific character is encoded as a single or multiple bytes. For example, if you use ISO-8859-1 as encoding, the character Ø is encoded as a single byte, but if you use UTF-8 as encoding, it's encoded as 2 bytes. So to know how many bytes a character will be encoded with, you need to know which characterset, you're going to transport the text in.

real.finder

24th March 2017, 19:17

I was thinking to suggest add utf-8 support to avs+ script some time ago, and about the compatibility I was thinking about suggest auto convert ANSI to utf-8 for old scripts internally if the encode script is utf-8, the convert done for the used scripts in encoder script not all scripts in autoload folder

don't know if this can be done or not

Groucho2004

24th March 2017, 20:02

I already mentioned the error message.We appear to have different interpretations of the word "crashing".
Only 2 or 3 file names with Chinese or Thai characters failed to load.Can you post the names that fail?

TheFluff

24th March 2017, 20:26

Adding Unicode support to Avs+ is probably pretty trivial. You can pass UTF-8 around in AVSValues no problem, so all you need to do is wrap the file I/O functions that exist in a few places (like in import (https://github.com/pinterf/AviSynthPlus/blob/MT/avs_core/core/parser/script.cpp#L409)) with a trivial function that does MultiByteToWideChar and then calls the W-version of the I/O function.

You'll be incompatible with old scripts that use some local code page, but re-saving as UTF-8 should hardly be a huge problem for anyone. You should not support local code pages, there is absolutely nothing to be gained from that.

Oh, and then you get to fix the VFW interface. Have fun with that.

Win 7 has unicode bugs
what

no seriously, what

Last but not least .NET and Windows are getting long file path support (more then MAX_PATH/260 characters), it can already be used but only with group policy change and manifest entry. I use it already in personal scripts.
I know this whole "unicode" thing is painfully new to you guys, but seriously now. UNC paths have been around since Windows 2000. I know in Win10 they removed the MAX_PATH restriction for regular paths but that doesn't solve the problem because all the old garbage from the 90's still does "TCHAR filename[MAX_PATH];" somewhere and the user still has to opt in to it. So UNC or bust.

Groucho2004

24th March 2017, 20:38

TheFluff

24th March 2017, 20:46

The real problem here is that one can't expect every program to handle file names with Thai, Chinese, etc characters. If I recall correctly, Win32 CreateFileW() can handle these files. However, most tools use standard C library or STL functions to open/save files which will fail in some cases.

Unicode aware programs like MS Word or EMEditor handle them without problems independent of the system locale.

Also, we're just talking about file names, not the content of these files. Using these file names within scripts opens another can of worms.
That's what I said, though? Literally all the VFW interface does is call env->Import() with the filename it gets from VFW itself, so if you've fixed import you only need to switch the VFW API functions to the W variant (which I am quite sure exist, but can't be arsed to look up on MSDN).

Now, things that interact with the Avisynth API directly instead of going through VFW will of course have to be made aware of the fact that the new hot thing to do is to pass UTF8. Oh. Wait, this is Avisynth and you will never break API backwards compatibility ever. Never mind.

Pretty sure the FFMS2 Avisynth plugin supports UTF8 filenames but breaks on local code page, by the way.

e:
The real problem here is that one can't expect every program to handle file names with Thai, Chinese, etc characters.
I'm pretty sure that in 2017, doom9 is one of very few places on the internet where you will not only hear someone say this, but also expect it to be seen as a reasonable standpoint to have.

TheFluff

24th March 2017, 21:10

Using these file names within scripts opens another can of worms.
It does not. As I mentioned before, I'm prrrretty FFMS2 supports this right now and it definitely did so in the past (because I wrote the code that did it, but it has since been replaced with a simpler solution). The only thing you need to do is save the script as UTF8 without BOM. UTF8 can safely be treated as any other array of char. To Avisynth, it's just a regular string, which gets passed to FFMS2, which is char* everywhere in the API, so it goes straight to the libavformat I/O, which does the actual file opening and actually does support converting from UTF8 to the Windows style wchar_t API's.

Groucho2004

24th March 2017, 22:01

It does not. As I mentioned before, I'm prrrretty FFMS2 supports this right now and it definitely did so in the past (because I wrote the code that did it, but it has since been replaced with a simpler solution). The only thing you need to do is save the script as UTF8 without BOM. UTF8 can safely be treated as any other array of char. To Avisynth, it's just a regular string, which gets passed to FFMS2, which is char* everywhere in the API, so it goes straight to the libavformat I/O, which does the actual file opening and actually does support converting from UTF8 to the Windows style wchar_t API's.
You're singling out ffms2, what about other filters or Avisynth internal functions that take file names as arguments?

TheFluff

24th March 2017, 22:14

You're singling out ffms2, what about other filters or Avisynth internal functions that take file names as arguments?

AviSource and DirectShowSource are both trivial, they each have like one or two CreateFile or similar function calls that need to be wrapped, everything else can remain unchanged.

Everything else, no idea but it's almost definitely gonna be similar - you replace/wrap the call to fopen/CreateFile and that's it, everything else works by passing the handle around and doesn't need to be changed. That's the entire point of UTF-8; everything that uses 1-byte char encodings keeps working as normal.

It's not like the current situation is any good either - only accepting filenames in your local codepage simply isn't an acceptable solution today. I mean, 7-bit ASCII still works everywhere, but come on, this is 2017. The unicode consortium is so bored it's busy adding entire codepages full of emojis.

Groucho2004

24th March 2017, 22:41

That's the entire point of UTF-8; everything that uses 1-byte char encodings keeps working as normal.Hm, Russian encoded in CP1251 is a single byte character set. Converted to UTF-8, all (or most, not sure) characters use 2 bytes.
If you're referring to the ASCII character subset (0 - 127) you're right.

TheFluff

24th March 2017, 23:00

You're missing the point. UTF-8 is just a multibyte encoding, and how many bytes you use for encoding a single character isn't at all interesting to anyone, really (unless you're the kind of person who expects strlen to return the number of natural language characters, but in that case you're beyond hope). If your codepage is set to, say, 932 (Japanese, ShiftJIS) almost every character an actual Japanese person is interested in will take more than one byte. Avisynth handles that just fine - you can put Japanese characters in your script all you want as long as the script uses the local codepage. Functions that parse directories from a path string still work because most multibyte encodings (including ShiftJIS and UTF-8) leave the 7-bit ASCII range alone (it doesn't get used in the extra bytes so you can't mistake the second or third byte of some many-byte character for a regular 7-bit ASCII character). The problems arise when you encode your script in one charset and the win32 API non-W functions expect another, which is what MysteryX seems to have done above.

So, you need one charset that can represent all characters. Unicode is that, but for historical reasons Windows uses the UTF-16 encoding where one wchar_t is two bytes and you have nulls everywhere so none of the old functions work and there are ABI breaks and so on and so forth. Nobody wants that. That's why you use UTF-8, which is a regular multibyte encoding just like all the other local multibyte encodings so everything that expects a single null byte to terminate a string still works, strlen still works, parsing URL's and filepaths still work etc etc. But the Windows API functions don't support that so before passing strings to them you have to encode UTF-8 to UTF-16. After doing that though you're good.

That make things any clearer?

(e: this is what Windows has done internally for you all the time by the way when you called the old non-W API's, because both FAT32 and NTFS have used Unicode filenames on the filesystem level since the 1990's)

Wilbert

25th March 2017, 00:08

https://forum.doom9.org/showthread.php?p=1420439#post1420439

MysteryX

25th March 2017, 01:28

Wow this thread has gone into all sorts of directions.

We appear to have different interpretations of the word "crashing".
https://s28.postimg.org/3qs478g09/Unicode_Bug.png (https://postimg.org/image/3qs478g09/)

Can you post the names that fail?
SNH48 - 夏日主题泳装.mkv

My question is extremely simple: how to handle this to either make it work with those paths, or detect unsupported characters to remove them. I don't need anything else.

First question is: why does it crash to begin with? Which part is responsible for the crash? If I comment everything from the file and open an empty script file, I still get the same error, so we can discard plugins as being the cause. This more looks like a bug in Avisynth+ for Pinterf to fix.

But meanwhile, I must also find another work-around.

Groucho2004

25th March 2017, 09:51

SNH48 - 夏日主题泳装.mkvThat file name doesn't give me any trouble. I can open it in MPC-HC and VirtualDub even with the system locale set to my usual CP1252 (which I suppose you use too).

First question is: why does it crash to begin with? Which part is responsible for the crash?
The screen shot shows a .avs, not .mkv. I suppose you generate that file name in your software? If so, check your code for proper handling of such names.

Again, it's not a crash if the application displays an error message and can be terminated the usual way.

This is a crash:
http://www.nirsoft.net/utils/wincrashreport_windows_xp.png

TheFluff

25th March 2017, 13:15

That file name doesn't give me any trouble. I can open it in MPC-HC and VirtualDub even with the system locale set to my usual CP1252 (which I suppose you use too).
That's just because VDub and MPC-HC and everything else uses the unicode API's. Reminder that the year is 2017.

The screen shot shows a .avs, not .mkv. I suppose you generate that file name in your software? If so, check your code for proper handling of such names.
That won't help. The name isn't representable in cp1252 so when Windows attempts to translate it from 1252 (which is what you've told it that you're using) to the internal Unicode codepage used in the filesystem, it won't get the right filename and you'll get the "can't open file" message.

My question is extremely simple: how to handle this to either make it work with those paths, or detect unsupported characters to remove them. I don't need anything else.
The only way to really detect it is to try it and see if it fails. You can't "detect" unsupported characters, since the problem isn't really that the characters are unsupported, it's that you haven't told Windows what charset to translate from. A byte sequence that's perfectly valid 1252 and also perfectly valid ShiftJIS may open or not open depending on what the actual Unicode filename of the target file is.

First question is: why does it crash to begin with?
It doesn't crash. That's just the standard way the VFW interface does error reporting. It tries to env->import the .avs file but can't find it, and the nicest way to report an error like that in VFW is to print the error message on the video stream, so that's what it does.

But meanwhile, I must also find another work-around.
Generate a long random string with only 7-bit ASCII contents and use that. It's either that or patch Avisynth.

MysteryX

25th March 2017, 14:33

Generate a long random string with only 7-bit ASCII contents and use that. It's either that or patch Avisynth.
Before I was using the file name "Player.avs", but I wanted to see in the title bar the name of the file playing.

I might look into the source code to see where it crashes. AviSynth shouldn't have to worry about the script file name ... although it exposes it in a few properties, and that can be a problem.

There are only 2 solutions for ScriptFile to work (because obviously it won't work for Unicode characters, although I doubt that's the issue here -- but maybe).

1. Support Unicode characters in some way, but this can open up some can of worms.

2. If non-unicode characters are detected in the path, call GetShortPathName to get the DOS short path. That's what I'm doing when generating the script for the input file name.

The screen shot shows a .avs, not .mkv. I suppose you generate that file name in your software? If so, check your code for proper handling of such names.
I live in a .NET world. I don't ever have to care about character encoding format unless I need to convert strings into ASCII or any other specific encoding for some reason. Writing a file requires no special handling from my part. My OS knows what it's doing.

Again, it's not a crash if the application displays an error message and can be terminated the usual way.

This is a crash:
http://www.nirsoft.net/utils/wincrashreport_windows_xp.png
Playing with scemantics won't help us here.

TheFluff

25th March 2017, 14:39

I might look into the source code to see where it crashes. AviSynth shouldn't have to worry about the script file name ... although it exposes it in a few properties, and that can be a problem.
I've told you before that you kinda need to learn to read. It doesn't crash. It obviously needs to know the script file name because it needs to open the script so it can be parsed. Here you go, replace this call to CreateFile and you'll be good: https://github.com/pinterf/AviSynthPlus/blob/MT/avs_core/core/parser/script.cpp#L435

1. Support Unicode characters in some way, but this can open up some can of worms.
it doesn't, I already told you

2. If non-unicode characters are detected in the path, call GetShortPathName to get the DOS short path. That's what I'm doing when generating the script for the input file name.
the 8.3 filenames are not guaranteed to exist so this isn't a reliable solution either

MysteryX

25th March 2017, 14:45

the 8.3 filenames are not guaranteed to exist so this isn't a reliable solution either
http://stackoverflow.com/questions/843843/getshortpathname-unpredictable-results

From here it should be pretty straightforward to fix.

What's a bit trickier is that here the code is getting the path from ScriptName, ScriptFile and ScriptDir -- all 3 need to be converted to 8.3 filename as independent parts, and I need to find where they are set. Edit: sorry it's taking it from args[i] instead. Not sure why it's reading ScriptName, ScriptFile and ScriptDir at the beginning and resetting them to that value at the end.

Also, GetShortPathName doesn't work reliably... does it skip files without special characters, but reliably generate a 8.3 name for file names having unicode characters?

TheFluff

25th March 2017, 15:06

http://stackoverflow.com/questions/843843/getshortpathname-unpredictable-results

From here it should be pretty straightforward to fix.
lol

it really, really isn't, but have fun

What's a bit trickier is that here the code is getting the path from ScriptName, ScriptFile and ScriptDir -- all 3 need to be converted to 8.3 filename as independent parts, and I need to find where they are set.
what

I have no idea what you're doing but I don't think you've understood it either

Also, GetShortPathName doesn't work reliably... does it skip files without special characters, but reliably generate a 8.3 name for file names having unicode characters?

GetShortPathName doesn't generate an 8.3 path, it gets the existing path name from the filesystem... if it exists. Sometimes parts of it exists (some folders on the path may have an 8.3 name, but not all, for example), and then you can get a path that mixes regular filenames and 8.3 ones back, with no warning. The 8.3 names are (or are not) generally set by the system on file creation.

Also, stop thinking of things like "unicode characters" and "non unicode characters". You're fundamentally misunderstanding the technology.

I told you before, but you can't detect if a filename has "unicode characters" in it. It's just a byte array. What you really want to know is "if I translate this byte array from some arbitrary chosen charset to Unicode, will I get a string that says the same thing in natural language that the original byte array did when presented in that arbitrarily chosen charset?". Answering that question with a computer program is really obnoxious and nobody does it. What you can do is attempt the translation with the system's codepage and see if the string you get back will successfully open a file that you think has a matching name. If it doesn't, you at least know you have a problem.

tl;dr: there are number of retarded "solutions" for retards to this problem, but none of them are reliable and most of them are really annoying to implement. Fixing the problem the correct way (using UTF-8 in old API's that use char* for filenames and wrapping/writing shims for system calls with filename parameters) is fairly trivial and the only reasons you think there are "cans of worms" is because you don't understand the underlying technology. Please do it right instead of being retarded.

MysteryX

25th March 2017, 15:50

aahh... I see. Better if I leave this one to Pinterf so that it's fixed the right way instead of hacking around the issue.

I was thinking of reading the file name as Unicode and working from there -- but the issue is getting to the file to begin with. Need to work with the char array passed from args.

For now my best option is to discard non-ASCII characters when generating file names until Pinterf maybe fixes the code.

As for "crash" vs "error"... here's the thing. In the world I live in (.NET), I have a generic error handler for when unhandled exceptions occur to display the error to the user with the stack trace. Whether a problem is handled smoothly or not really makes no difference when it comes to solving those problems. Sure I can say "error" instead of "crash" if it creates confusion, but it makes no difference whatsoever.

MysteryX

26th March 2017, 04:00

For now I'll just use this to strip out non-ascii characters.

Regex.Replace(FilePath, @"[^\u0000-\u007F]+", string.Empty);

pinterf

10th April 2017, 18:16

I have made a patch to the VFW interface and now it is able to import avs files with unicode filenames and files that are behind an unicode directory name. Tested with Virtualdub and MPC-HC. See it in next release. (Along with utf8 option in SubTitle)

v0lt

24th May 2020, 09:10

I use the following code to work with AviSynth+ script files.
HMODULE hAviSynthDll = LoadLibraryW(L"Avisynth.dll");
IScriptEnvironment* (WINAPI *CreateScriptEnvironment)(int version) = (IScriptEnvironment * (WINAPI *)(int)) GetProcAddress(hAviSynthDll, "CreateScriptEnvironment");
IScriptEnvironment* ScriptEnvironment = CreateScriptEnvironment(6);
AVS_linkage = m_Linkage = ScriptEnvironment->GetAVSLinkage();
std::string ansiFile = ConvertWideToANSI(name);
AVSValue arg(ansiFile.c_str());
AVSValue avsvalue = ScriptEnvironment->Invoke("Import", AVSValue(&arg, 1));
I use the following code to work with AviSynth + script files.
But this does not work if the file name has the following form "Duck_Утка_Πάπια_오리.avs" (yes, my file name can contain any characters).

Is it possible to get AviSynth+ to work with such files without changing the file name?

pinterf

24th May 2020, 09:27

Use utf8 in path and specify a 2nd parameter utf8=true

v0lt

24th May 2020, 10:22

Use utf8 in path and specify a 2nd parameter utf8=true
Thank you, but I can’t. I tried the following options:
...
std::string utf8file = ConvertWideToUtf8(name);
AVSValue args[2] = { utf8file.c_str(), true };
AVSValue avsvalue = ScriptEnvironment->Invoke("Import", AVSValue(&args, 2));
...
std::string utf8file = ConvertWideToUtf8(name);
AVSValue args[2] = { utf8file.c_str(), "utf8=true" };
AVSValue avsvalue = ScriptEnvironment->Invoke("Import", AVSValue(&args, 2));
And got a runtime error.
How to do it right?