Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 26th April 2006, 04:19   #1  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Foreign Language characters in filenames

Unicode guru's your help please!

A tracker bug report, [ 1407212 ] Problem with spanish unicode characters in filenames, has me a little perplexed.

Most of AviSynth uses 8 bits character file system calls like fopen and as such relies on the system default conversion of filenames to unicode.

DirectShowSource() is the one exception and it explicitly uses "MultiByteToWideChar(CP_ACP, ..." to do the
filename unicode translation.

To the best of my reseach the system default translation should be equivalent to this code, but for Sven Rieke and naugas and probably many others they are not.

Enlightment required please
IanB is offline   Reply With Quote
Old 28th April 2006, 23:03   #2  |  Link
Richard Berg
developer wannabe
 
Richard Berg's Avatar
 
Join Date: Nov 2001
Location: Brooklyn, NY
Posts: 1,212
Are string variables in the script language Unicode?
Richard Berg is offline   Reply With Quote
Old 29th April 2006, 07:52   #3  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
No, the entire world is 8 bit. The text files that hold the script, the file i/o calls that read it, the memory array that hold the data, etc.
IanB is offline   Reply With Quote
Old 29th April 2006, 11:28   #4  |  Link
Richard Berg
developer wannabe
 
Richard Berg's Avatar
 
Join Date: Nov 2001
Location: Brooklyn, NY
Posts: 1,212
8 bit is ok if it's treated as UTF-8 (not plain ASCII). I think you'll have to use platform APIs for that to happen, though, not ANSI stuff like fopen.
Richard Berg is offline   Reply With Quote
Old 29th April 2006, 13:58   #5  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Okay who are you and what have you done with the real Richard Berg.

No seriously, the fundamental question is what translation does Windows use for ANSI file system calls like fopen?

I assumed it was based on the current active code page, but obviously it is not. Can we help these users without recodeing half the internals.

Last edited by IanB; 29th April 2006 at 14:32.
IanB is offline   Reply With Quote
Old 29th April 2006, 17:37   #6  |  Link
foxyshadis
ангел смерти
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Lost
Posts: 9,175
fopen is overloaded depending on what you defined in the way of multibyte support.

Code:
TCHAR.H routine		_UNICODE & _MBCS not defined	_MBCS defined	_UNICODE defined
_tfopen			fopen				fopen		_wfopen
If it still uses the plain ansi version, it uses setlocale() and defaults to the system default code page. So if it isn't defaulting something bizarre is happening, like a setlocale() in a header. reference

I found nothing about fopen being translated any differently from anything else, so I assume there's nothing special with its conversion.
__________________
There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. ~ Ed Howdershelt
foxyshadis is offline   Reply With Quote
Old 30th April 2006, 06:15   #7  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
@foxyshadis,

No fopen and like are never overloaded. The TCHAR versions are the ones that are overloaded like _tfopen which will map to either fopen or _wfopen. Other calls like CreateFile map to either CreateFileW(LPCWSTR.... or CreateFileA(LPCSTR.... depending on whether UNICODE is defined.

However thank's for the reference, it pushed me thru another documentation iteration of code page stuff and I may be a little closer to understanding what is happening.

Current guess is that on the systems with a problem they are using the OEM code page instead of the ANSI code page for the unicode translation or maybe vice versa.

It is more likely to be fopen inappropriatly using the OEM code page because DSS works and it explicitly does a manual translation using the system default ANSI code page. (which incidently is probably wrong, the code probably should be this ... AreFileApisANSI() ? CP_ACP :CP_OEMCP ...)

Whatever it is, the path from the persons fingers to final data block representing the filename inside the OS is not consistent.

Short of converting the internals to use WCHAR's everywhere AND forcing the .AVS scripts to be 16bit UNICODE text files, I cannot see how to resolve this. I need someone with practicle experience that actually suffers the problem to step up.
IanB is offline   Reply With Quote
Old 2nd May 2006, 13:50   #8  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Maybe some of our friends that use non "US English" Windows can ask around for a little guidance and report back.
IanB is offline   Reply With Quote
Old 2nd May 2006, 18:57   #9  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
I am not Unicode guru, but I am one of your friends that use non "US English" Windows.

I have no any problem with Russian (Cyrillic) names of avi files, log files, or avs scripts in Win2k or WinXP.
AVS file must be created in ANSI code, of course, by Notepad, etc.
But if I try create AVS file in console application editor like Far Manager in OEM codepage, then Russian names of AVI files in AVS script are not good ("not found" error), of course.

Some not-related info. Some time ago I consider try to do language wrapper of Avisynth functions to Russian.
For example, instead of BlankClip I can create function:

Function ПустойКлип (clip ..., ...)
{
BlankClip(......)
}
ПустойКлип()

This produce Error: expected a function name

But I can use some non-capital Russian letters, if first symbol of function name is Latin or _:

Function _пустойклип (......)
{
BlankClip(......)
}
_пустойклип()

Some Russian letters produce "unexpected character" error.
Why?
For clip names first symbol may be Russian too.
Sorry for off-topic.

Last edited by Fizick; 18th November 2007 at 11:37.
Fizick is offline   Reply With Quote
Old 3rd November 2007, 00:51   #10  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Ping! Here is another example :- Avisynth and japanese characters
IanB is offline   Reply With Quote
Old 17th November 2007, 07:57   #11  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Ping! And here is yet another example :- Avisynth + Unicode
IanB is offline   Reply With Quote
Old 17th November 2007, 13:51   #12  |  Link
dukey
Registered User
 
Join Date: Dec 2005
Posts: 560
to fix avisynth with funky file names ..
without looking at the code
just use _wfopen
or rather use _tfopen and define _UNICODE and maybe UNICODE as well. Then replace char with TCHAR.

that still leaves the problem though of the scripts being ascii 7 or 8 bit
dukey is offline   Reply With Quote
Old 17th November 2007, 22:51   #13  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Quote:
Originally Posted by dukey
... the problem though of the scripts being ascii 8 bit
The problem you are looking to fix is currently at core\main.cpp@588
Code:
env->Invoke("Import", szScriptName);
so your your last statement is slightly wrong, it is not the remainder of the problem, it is the whole problem


One possibility is to switch to using CP_UTF8 in place of the current CP_ACP and CP_OEMCP. (Ding! Richard your last comment suddenly makes sense now) Now this will not be transparent, you will need to find text editors and be able to configure them to output UTF-8 text files for scripts.

I could provide ANSItoUTF8, ACPtoUTF8 and OEMtoUTF8 script functions to ease the pain a little. i.e.
Code:
AviSource(ANSItoUTF8("ПустойКлип"))
I could also flip the system code page to CP_UTF8 (65001) on entry to Avisynth and restore it on exit to assist plugins that do fopen, etc, calls.

Thorts?
IanB is offline   Reply With Quote
Old 18th November 2007, 02:30   #14  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
AviSource("ПустойКлип.avi", UTF8=true) would be more compatible with existent scripts,
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 18th November 2007, 03:22   #15  |  Link
Leak
ffdshow/AviSynth wrangler
 
Leak's Avatar
 
Join Date: Feb 2003
Location: Austria
Posts: 2,441
Quote:
Originally Posted by Fizick View Post
AviSource("ПустойКлип.avi", UTF8=true) would be more compatible with existent scripts,
Looks a bit weird IMHO, having to tack on a parameter for this.

How about AviSource and AviSourceUTF8?
__________________
now playing: [artist] - [track] ([album])
Leak is offline   Reply With Quote
Old 18th November 2007, 04:41   #16  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
@Leak, The blind following the blind straight over a cliff :P

@Fizick, You seem to have missed the issue, AviSource() is but 1 example of the whole problem. i.e. Import(), DirectShowSource(), Mpeg2Source, WavSource(), Nic*Audio(), etc, etc, etc, ...

@All,

Neither AviSource(UTF8=True) or AviSourceUTF8() type thinking addresses the issue, which is what is the true UNICODE filename string intended by the user in the 8bit text of the script. As I said earlier the path between the users fingers and the UNICODE file system must not be ambiguous.

Currently we use the "fopen" class of ansi i/o library calls, which to the best of my research so far internally do MultiBytetoWideChar(AreFileApisANSI() ? CP_ACP : CP_OEMCP, ...) calls to get a UNICODE string for the tail end kernel calls. This mostly work for users who can nail down their Code Page environment. It comes unstuck as soon as you need more than 1 Code Page.

UTF-8 is an encoding scheme to represent UNICODE characters unambiguously in 8 bit data. By assuming that scripts are UTF-8, it solves all my problems dealing with a UNICODE based file system behind the I/O library calls.

But UTF-8 is an encoding scheme it is not a very friendly replacement for the warm safe language based Ansi Code Page world.

As a native english speaker whose scripts consist entirely of chars [32..126] I will not ever see a problem. Everybody else will have serious issues!

More Thorts?
IanB is offline   Reply With Quote
Old 18th November 2007, 04:47   #17  |  Link
foxyshadis
ангел смерти
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Lost
Posts: 9,175
You can start by looking for a BOM in the parser, that should be a very good indicator of how upper characters should be translated. (Table of BOMs.) All unicode-aware text editors are supposed to use it, but if it's not present, as in many files, it'll be more difficult, but what about starting by assuming it's UTF-8, translating that back to UCS-16 internally, if that fails (or if it succeeds but the file isn't found) assume it's some code page and retry the open with that translation? Further translation attempts up to you.

I just ask because anything extra in the script is going to be a big pain.

That doesn't work for 3rd party plugins, obvs, and it'll cause some more complexity internally, but as you know there isn't really a way you can translate everything to one format (like UCS-16) without some things failing or people not realizing they're working in one encoding and not another.
__________________
There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. ~ Ed Howdershelt

Last edited by foxyshadis; 18th November 2007 at 04:50.
foxyshadis is offline   Reply With Quote
Old 18th November 2007, 04:51   #18  |  Link
[P]ako
A geek wannabe
 
[P]ako's Avatar
 
Join Date: Apr 2007
Posts: 231
I had a similar problem. I had the antialiasing script (aaa) saved as unicode, maybe because of the accent mark on Didé's nick. Avisynth couldn't load the script at all, the error was pointing an issue with the line importing the script. I solved the issue by converting said script to UTF-8.
[P]ako is offline   Reply With Quote
Old 18th November 2007, 09:14   #19  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
@foxyshadis,

Looking for a BOM doesn't help. It's cart before the horse. The scripts are plain 8 bit text. To look for the BOM you have to start with the assumption that the file is UTF, the BOM then tells you which flavour of UTF.

Here's a UTF-8 BOM "" .... oh, it doesn't do anything.

@[P]ako,

Yes trying to feed raw UNICODE in to something expecting ASCII chokes big time, every second byte is a NULL (as in NULL terminated strings). Converting to any 8 bit code page would have masked the problem.


@All,

At last some inspiration!

I zap all the fopen/createfileA type calls and replace them with _wfopen/createfileW type calls. Plus I manually use MultiBytetoWideChar(AVSCodePage, ...) to do the char to wchar translations.

I set AVSCodePage to (AreFileApisANSI() ? CP_ACP : CP_OEMCP) after script verb "Import" has opened the script file but before it parses it. i.e. it _wfopen's the script file using the existing AVSCodePage translation, then resets AVSCodePage to it's default value.

I provide a script verb to let the user specify the code page to be used for char to wchar translations in subsequent script verbs that result in unicode system or library calls.

In the AVIFileOpenW implementation code for the Avisynth handler I do a WideChartoMultiByte(CP_UTF8, ...), set AVSCodePage=CP_UTF8 and then do the env->Invoke("Import", ...) [this is why I need to keep resetting AVSCodePage]

Maybe I can allow an .AVSI script to set a default default code page.

Many more thorts please?
IanB is offline   Reply With Quote
Old 20th November 2007, 17:32   #20  |  Link
DonQ
Registered User
 
Join Date: Oct 2006
Location: Estonia
Posts: 45
Read two first bytes from any source/script/include file in binary mode. If they present BOM (were they FF FE and FE FF?), consider this file UTF8, otherwise ASCII. Of course you need convert both files to same format afterwards, UTF8 or UCS16 for example.

This corresponds exactly to how Notepad saves [new] files - if it detects any character not from current ANSI code page, it automatically adds BOM and saves file in UTF8.
DonQ is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 18:50.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2017, vBulletin Solutions Inc.