Foreign Language characters in filenames - Page 5

IanB · 25th July 2010, 13:42

Quote:

Originally Posted by Gavino

Note that Import is not the sole route for getting text into Avisynth. Applications using the library interface can also use Eval - for example, I believe AvsP uses Eval instead of Import. Does your proposal mean that such applications will in future have to pass a UTF-8 string to Eval instead of an ANSI one?

So some existing scripts (using these functions on strings with 8-bit ANSI) will stop working?

And yes, you stepped on the landmines as well.

I have been avoiding spoon feeding this discussion too much because I am hoping someone will come up with some ideas that do not paint one into a corner. Just about every way I look at this something always comes unglued somewhere.

My first thought about the API text was existing calls as ANSI, add new calls for UTF8, but that has problems.

And yes I really, really don't want existing scripts or applications to stop working.

Thoughts?

@krieger2005,

UTF16BE/LE, UTF32BE/LE are dead easy, I do not have any difficulties with 16 or 32 bit file encoding.

The hard part is 8 bit byte encodings.

The state of play has to be 8 bit ansi codepage encoding has priority.

That is what works now and should continue to work in the future.

UTF-8 is probably going to have to be externally identified. Either by a BOM, some reserved opening sequence on the first line of the file or some other devious scheme.

krieger2005 · 25th July 2010, 13:48

Quote:

Originally Posted by Gavino

How practical/reliable is this? Can you give a precise algorithm?

This algorithm is used when detecting HTML/XML Content Coding, but here the Parser search for "<".

Quote:

Originally Posted by Gavino

This is not backwards compatible - some existing scripts (using 8-bit ANSI strings) will stop working.

.
ANSI still have the first 7 Bits in common with ASCII and UTF-8. And all the scripts i know does not use special character above the value of 127. And even for this case: it would be practicable to Add the Codepage-Command, so that the Avisynth Parser can convert it internally to UTF-8.

Quote:

Originally Posted by Gavino

Depends what you mean by 'command'. At best, it would be some simple directive outside the script language

I thought about an Avisynth Command like SetMemoryMax (for Example: UseCodepage(ANSII). I don't think an external derective is a good solution because when people share scripts and one have set this external derective and the other one not it could result in problems reading Scripts.

Quote:

Originally Posted by Gavino

But existing plugins expect to see 8-bit ANSI characters in strings, not UTF-8. Filenames, for, example, would need to be handled differently.

Do they expect ANSI? Or do they just take what they get? I mean, plugins, which get Unicodestrings have to be updated in any case. For example mpeg2source must call the Widechar Functions to open the File.

EDIT: Sorry forgot the Algorithm:
1. Detect BOM (i will not describe this)
2. if BOM couldn't be detected:

Code:

*char_codes = ".{}()=";
for(int i =0; i < min(filesize,50); i++)
 b = read_byte_from_file
 if(strchr(char_codes, b) != NULL)
  look if previous byte was a 0 // UTF 16LE ??
  look if next byte is a 0 // UTF16BE ??
  if(UTF16LE)
   look if the three previous byte are 0 // UTF32LE
  if(UT16BE)
   look if the three next bytes are 0

If the script contains on of the searched character then the algorithm can determinate in 100% the correct filecoding. I Assume that 99.9% of all script contain one of the characters above. So this algorithm will work for 99.9% of all Files.

krieger2005 · 25th July 2010, 15:10

Quote:

Originally Posted by IanB

The hard part is 8 bit byte encodings.

Ah, i forgot this step... in my program i had the fallback to iso8559-1. I made the Parser for HTML-Files. Additionally i ran the ICU-Charset Detector and used a high relibiality to decide to trust the Detector or not.

But, as i stated above i prefer the Definition of the used Charsed inside the File istead definening somewhere outside something. This way it get just the same as HTML does in their Files.

There is still one difference. I have heard here from people that they use Codepages so the Script works in foreign Languages. Others use ANSI. But: What part of the Script use Characters above the 127 Bounds (and not in comments)? Based on the Answer of this Question, i think you can decide to Support ANSI as Fallback or just suppose the Script to be an UTF-8 Script.

kemuri-_9 · 25th July 2010, 16:04

Quote:

Originally Posted by IanB

And yes, you stepped on the landmines as well.

I have been avoiding spoon feeding this discussion too much because I am hoping someone will come up with some ideas that do not paint one into a corner. Just about every way I look at this something always comes unglued somewhere.

My first thought about the API text was existing calls as ANSI, add new calls for UTF8, but that has problems.

And yes I really, really don't want existing scripts or applications to stop working.

Thoughts?

the least evil way of having this work as i see it is to have an avisynth variable control what character encoding Eval() is expected to be given
(outside of Import's currently planned handling, as import will have the file encoding to handle the situation)

something similar to the other global variables like OPT_UseWaveExtensible, except that this new variable can't be set within scripts to avoid issues that would bring.

something like

Code:

AVSValue y = AVSValue( true );
env->SetVar( "$OPT_EvalIsUTF8", y );

would trigger Eval to expect UTF8 strings instead of ANSI and the value is defaulted to false.

this would technically complicate Import though, as it would need to handle dealing with this variable...

Groucho2004 · 25th July 2010, 16:19

Maybe I'm missing something but why would we need UTF-8 at all? What about either 8 bit Ansi or UTF-16 as choices?

The advantage would be that both encodings are easily identified, whether there is a BOM in UTF-16 or not.

mariush · 25th July 2010, 17:41

Notepad and most editors, when you select Unicode at Save as, they'd default at UTF-8. You would have to do 2-3 extra clicks to select on purpose UTF-16.
Once you say your app support Unicode, people would naturally assume you support UTF-8 too, so they'd be constantly surprised if their scripts fail. So you'd have to support UTF-8 anyways.

kemuri - you could also use an environment variable for that, I think.

ps. Would installing two libraries be possible, for example avisynth.dll and avisynthw.dll for unicode ? maybe have some "Stub" or whatever it's called that would then load the appropriate dll depending on how it's initialized?

ps2. Well ok, maybe not Notepad, which has ASCII, then Unicode, then Unicode Big Endian and then UTF-8, at least on my Windows 2003, but other editors have UTF-8 first, like for example the popular Ultraedit:

Groucho2004 · 25th July 2010, 17:50

Quote:

Originally Posted by mariush

Once you say your app support Unicode, people would naturally assume you support UTF-8 too, so they'd be constantly surprised if their scripts fail.

Well, that 's just a matter of documentation, isn't it? You don't see lots of people surprised that their car's engine is broken because they put diesel in it instead of petrol (that's gas for our American readers), right?

mariush · 25th July 2010, 18:27

When you have a 15-25 thousand car and chances of losing it for a day, you're more careful about that sort of thing. However, even so it will still happen to screw up.

People won't be as careful with a script as with a car, and for lots of people it would be routine, automatism, and they'd expect for something to work as it works with other programs - most won't even read the documentation, as I'm sure you didn't read the manual for Windows or other programs you have first. People will just make mistakes a lot and will be pissed and will fill the forum and most won't even understand the difference between UTF-8 and UTF-16 and won't care to learn - they just want to convert a video.

I've been reading Raymond Chen's blog for years (he's a guy that worked at Microsoft on Windows 95 and still works there) and I highly recommend you browse the posts in the History section... I've learned to appreciate the extend Microsoft went to preserve compatibility with lots of things as operating systems improved and often found some things I thought to be wrong to be actually quite smart and appropriate.

I've also found out that we often make assumptions that are not really true in the real world - open source programmers and small developers don't have access to millions of people using the software to know how they think so just assuming they won't use or that they'll learn not to use UTF-8 is bad and won't work in reality.

Gavino · 25th July 2010, 19:04

Quote:

Originally Posted by krieger2005

What part of the Script use Characters above the 127 Bounds (and not in comments)?

String literals, of course.
Especially (but not limited to) filenames - which is what I thought was the main point of this thread (or at least the starting point).

krieger2005 · 25th July 2010, 19:38

Quote:

Originally Posted by Gavino

String literals, of course.
Especially (but not limited to) filenames - which is what I thought was the main point of this thread (or at least the starting point).

Exactly - Main Point of this thread are Filenames. But who use a script with a filename multiple times? I for myself write for every Movie another script. So how many people use such scripts out there (PS: I forgot Filenames in Imports... okey)?

This is the reason why i asked if it is really usefull to be backward compatible.

The reason why i suggest UTF8 was (i does not know the internals of Avisynth), that i thought, that avisynth serve the Data of the script to the Plugins after it has parsed it. So the main point is: how to should save avisynth the data internally in the parsing step? Support two formats and supporting two interfaces is one possiblity. But why should one do this?

The Benefit of UTF8 is, that it is represented character by a bytestream. So, every old Plugin can be still used with the old interface without a recompile (no wchar, no int for UTF16 character, simply char*).

But maybe i misunderstood the point how avisynth works internally.

kemuri-_9 · 25th July 2010, 20:16

Quote:

Originally Posted by mariush

kemuri - you could also use an environment variable for that, I think.

no, that's asking to get shot with a handgun:
a user could have happened to set the environment variable to 'true' and then they use a program that isn't up-to-date in this regard expecting to use ASCII strings as things stand now.
But due to the environment variable, avisynth will be expecting UTF8 strings.
this would very likely cause unpredictable results and possibly even crashes.

don't even say 'use the registry' either, that's asking to get blasted with a shotgun.

If this route is actually taken, Import() would be required to do something like
1) backup the current value of the variable, whatever it is.
2) read in the file, detecting the encoding as necessary
3) set the variable to be false for the normal ASCII files, otherwise set to true and use UTF-16 -> UTF-8 conversion before passing to Eval(), if necessary.
4) restore the backed-up variable.

doing this with environment variables is ugly especially since they are string values.
the registry is unusable as it can break when multiple instances of avisynth are being used simultaneously.

this is why i proposed to just keep this handling localized within avisynth.

Gavino · 25th July 2010, 20:26

Quote:

Originally Posted by krieger2005

But who use a script with a filename multiple times? I for myself write for every Movie another script.

I get your point, but the non-ASCII characters could be in a folder name that is used for many movies. Or (as you say) in an Import of common script code. Also, I have many scripts that I use over and over again for playing (eg a specific combination of clips) rather than encoding.

Quote:

The reason why i suggest UTF8 was (i does not know the internals of Avisynth), that i thought, that avisynth serve the Data of the script to the Plugins after it has parsed it.

That's correct (assuming by 'data' you mean the parameters passed to the plugin function).

Quote:

The Benefit of UTF8 is, that it is represented character by a bytestream. So, every old Plugin can be still used with the old interface without a recompile (no wchar, no int for UTF16 character, simply char*).

But if the plugin doesn't 'know' that the string is UTF-8, it won't necessarily do the right thing when using the data. For example, if it is a filename, a different procedure must be used to open it.

henryho_hk · 27th July 2010, 07:03

Alternate proposal (similar to shell script headers) for file format:

1) Support 8-bit local codepage and UTF-8 only
2) BOM is optional and ignored (it is optional for UTF-8 plain text files anyway)
3) All UTF-8 encoded AVS files must have a special comment like "#! UTF-8" at the very first line. If it is not found, the file is assumed to be in 8-bit local code page encoding (irrespective of BOM).

99.999% full backward compatibility with existing scripts and easy to implement.

LoRd_MuldeR · 21st May 2011, 14:02

Doesn't look like that thread ever came to final conclusion. And even if it eventually will, Unicode (UTF8) support will only be in new versions, not in the old 2.5.x series, right?

So I guess I will stick with the VfW interface for now, which at least should support Unicode names...

I know that this won't work if Unicode strings are used inside the Avisynth script, but at least I can open the AVS file, if the name of the AVS file itself contains Unicode characters.

(Still I have no clue why VFW on my system apparently opens such files successfully, but then reports that there are no streams)

kemuri-_9 · 21st May 2011, 15:06

indeed, avisynth does not support utf-8 within scripts and API calls, this includes the basic 'Import(file.avs)' call necessary to load existing scripts into avisynth via the API.

when support does come, it is very likely that it will not be backported to earlier releases either.

you could always work around the vfw interface not failing for when the file doesn't exist by doing a _waccess( filepath, 0 ) check beforehand.

LoRd_MuldeR · 21st May 2011, 15:14

Quote:

Originally Posted by kemuri-_9

indeed, avisynth does not support utf-8 within scripts and API calls, this includes the basic 'Import(file.avs)' call necessary to load existing scripts into avisynth via the API.

But what happens in the case when we have an AVS file that does not contain any Unicode (non-Latin1) characters, but the AVS file itself has a Unicode (non-Latin1) name?

I currently won't be able to import that script file with Avisynth' native API, but I am able to open the file with AVIFileOpenW(). Does Avisynth' VFW wrapper fail internally in that case?

At least I have not been able to successfully load a simple AVS file with Unicode-characters in the file name this way...

Quote:

Originally Posted by kemuri-_9

you could always work around the vfw interface not failing for when the file doesn't exist by doing a _waccess( filepath, 0 ) check beforehand.

Currently I check for the existence/accessibility manually with wfopen(), but this way it's more convenient. Thanks for the hint!

kemuri-_9 · 21st May 2011, 16:35

Quote:

Originally Posted by LoRd_MuldeR

But what happens in the case when we have an AVS file that does not contain any Unicode (non-Latin1) characters, but the AVS file itself has a Unicode name?

I currently won't be able to import that script file with Avisynth' native API, but I am able to open the file with AVIFileOpenW(). Does Avisynth' VFW wrapper fail internally in that case?

At least I have not been able to successfully load a simple AVS file with Unicode-characters in the file name this way...

from what i read in the codebase, the (simplified) VFW wrapper flow goes:

Code:

entry -> Load( OLEStr, mode )

unchecked convert OLEStr -> char *filename (native/oem codepage - not utf-8)
copy filename to a member variable
...
IScriptEnvironment->Invoke( "Import", filename ) 

... same code flows as AVS API.

so the only difference is that the VFW does the conversion from utf-16 filename to a native/oem codepage filename itself.

It's Import() that throws an error if the file doesn't exist, and it primarily uses xxxxxA versions of the windows API (avisynth is compiled with MBCS after all).

So if it is working with utf-16 filenames that do not map to the native/oem codepage, that's quite surprising.

LoRd_MuldeR · 21st May 2011, 17:50

Yup, that's clearly the reason why it doesn't work with Unicode (non-Latin1) file names, even when AVIFileOpenW() initially succeeds. I think the only workaround at this time is converting the file names to "short" names and passing the short ones AVIFileOpenW(). Short names usually survive the conversion to the local 8-Bit Codepage. It's not nice and it's not guaranteed to work (AFAIK the MSDN doc nowhere says that "short" names can't contain Unicode chars, but it does say that files are not guaranteed to have a short name), but it will probably save us most of time...

kemuri-_9 · 21st May 2011, 19:00

Using shortnames is not likely to work either:

Import uses GetFullPathName or SearchPath (which is used depends on if there's a path delimiter in the filename) to get the fullpath/filename of the script file.
these will error if the filepath contains characters outside of the native/oem codepage.

LoRd_MuldeR · 21st May 2011, 19:59

Quote:

Originally Posted by kemuri-_9

Using shortnames is not likely to work either:

Import uses GetFullPathName or SearchPath (which is used depends on if there's a path delimiter in the filename) to get the fullpath/filename of the script file.
these will error if the filepath contains characters outside of the native/oem codepage.

GetFullPathName() is one of the things that can prevent the "pass short path name" workaround from working, I know

However with Avisynth it works for me here, passing the "short" version of a fully-qualified path:

Code:

avs2wav v1.2 [May 21 2011]
by Jory Stone <jcsston@toughguy.net>, updates by LoRd_MuldeR <mulder2@gmx.de>

Input: G:\ViDeOz\Music\Володимирович 菅直人 κλασικής.avs
Output: Dump.wav

Checking Avisynth... Done
Analyzing input file... Done
Opening output file... Done

[Audio Info]
TotalSamples: 11391413
TotalSeconds: 258
SamplesPerSec: 44100
BitsPerSample: 16
Channels: 2
AvgBytesPerSec: 176400

Dumping audio data, please wait:
0/11391413 [0%]
11391413/11391413 [100%]

All samples have been dumped. Exiting.

25th July 2010, 17:41	#86 \| Link
mariush Registered User Join Date: Dec 2008 Posts: 589	Notepad and most editors, when you select Unicode at Save as, they'd default at UTF-8. You would have to do 2-3 extra clicks to select on purpose UTF-16. Once you say your app support Unicode, people would naturally assume you support UTF-8 too, so they'd be constantly surprised if their scripts fail. So you'd have to support UTF-8 anyways. kemuri - you could also use an environment variable for that, I think. ps. Would installing two libraries be possible, for example avisynth.dll and avisynthw.dll for unicode ? maybe have some "Stub" or whatever it's called that would then load the appropriate dll depending on how it's initialized? ps2. Well ok, maybe not Notepad, which has ASCII, then Unicode, then Unicode Big Endian and then UTF-8, at least on my Windows 2003, but other editors have UTF-8 first, like for example the popular Ultraedit: Last edited by mariush; 25th July 2010 at 17:46.

21st May 2011, 14:02	#94 \| Link
LoRd_MuldeR Software Developer Join Date: Jun 2005 Location: Last House on Slunk Street Posts: 13,248	Doesn't look like that thread ever came to final conclusion. And even if it eventually will, Unicode (UTF8) support will only be in new versions, not in the old 2.5.x series, right? So I guess I will stick with the VfW interface for now, which at least should support Unicode names... I know that this won't work if Unicode strings are used inside the Avisynth script, but at least I can open the AVS file, if the name of the AVS file itself contains Unicode characters. (Still I have no clue why VFW on my system apparently opens such files successfully, but then reports that there are no streams) __________________ Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 21st May 2011 at 14:06.

21st May 2011, 15:06	#95 \| Link
kemuri-_9 Compiling Encoder Join Date: Jan 2007 Posts: 1,348	indeed, avisynth does not support utf-8 within scripts and API calls, this includes the basic 'Import(file.avs)' call necessary to load existing scripts into avisynth via the API. when support does come, it is very likely that it will not be backported to earlier releases either. you could always work around the vfw interface not failing for when the file doesn't exist by doing a _waccess( filepath, 0 ) check beforehand. __________________ custom x264 builds & patches \| F@H \| My Specs

21st May 2011, 17:50	#98 \| Link
LoRd_MuldeR Software Developer Join Date: Jun 2005 Location: Last House on Slunk Street Posts: 13,248	Yup, that's clearly the reason why it doesn't work with Unicode (non-Latin1) file names, even when AVIFileOpenW() initially succeeds. I think the only workaround at this time is converting the file names to "short" names and passing the short ones AVIFileOpenW(). Short names usually survive the conversion to the local 8-Bit Codepage. It's not nice and it's not guaranteed to work (AFAIK the MSDN doc nowhere says that "short" names can't contain Unicode chars, but it does say that files are not guaranteed to have a short name), but it will probably save us most of time... __________________ Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 21st May 2011 at 18:01.

21st May 2011, 19:00	#99 \| Link
kemuri-_9 Compiling Encoder Join Date: Jan 2007 Posts: 1,348	Using shortnames is not likely to work either: Import uses GetFullPathName or SearchPath (which is used depends on if there's a path delimiter in the filename) to get the fullpath/filename of the script file. these will error if the filepath contains characters outside of the native/oem codepage. __________________ custom x264 builds & patches \| F@H \| My Specs

25th July 2010, 16:19	#85 \| Link
Groucho2004 Join Date: Mar 2006 Location: Barcelona Posts: 5,034	Maybe I'm missing something but why would we need UTF-8 at all? What about either 8 bit Ansi or UTF-16 as choices? The advantage would be that both encodings are easily identified, whether there is a BOM in UTF-16 or not.

25th July 2010, 18:27	#88 \| Link
mariush Registered User Join Date: Dec 2008 Posts: 589	When you have a 15-25 thousand car and chances of losing it for a day, you're more careful about that sort of thing. However, even so it will still happen to screw up. People won't be as careful with a script as with a car, and for lots of people it would be routine, automatism, and they'd expect for something to work as it works with other programs - most won't even read the documentation, as I'm sure you didn't read the manual for Windows or other programs you have first. People will just make mistakes a lot and will be pissed and will fill the forum and most won't even understand the difference between UTF-8 and UTF-16 and won't care to learn - they just want to convert a video. I've been reading Raymond Chen's blog for years (he's a guy that worked at Microsoft on Windows 95 and still works there) and I highly recommend you browse the posts in the History section... I've learned to appreciate the extend Microsoft went to preserve compatibility with lots of things as operating systems improved and often found some things I thought to be wrong to be actually quite smart and appropriate. I've also found out that we often make assumptions that are not really true in the real world - open source programmers and small developers don't have access to millions of people using the software to know how they think so just assuming they won't use or that they'll learn not to use UTF-8 is bad and won't work in reality.

27th July 2010, 07:03	#93 \| Link
henryho_hk Registered User Join Date: Mar 2004 Posts: 889	Alternate proposal (similar to shell script headers) for file format: 1) Support 8-bit local codepage and UTF-8 only 2) BOM is optional and ignored (it is optional for UTF-8 plain text files anyway) 3) All UTF-8 encoded AVS files must have a special comment like "#! UTF-8" at the very first line. If it is not found, the file is assumed to be in 8-bit local code page encoding (irrespective of BOM). 99.999% full backward compatibility with existing scripts and easy to implement.