Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 20th November 2007, 22:48   #21  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Hmm, keep the ruminations coming
Code:
Bytes	Encoding Form
FE FF		UTF-16, big-endian
FF FE		UTF-16, little-endian
EF BB BF	UTF-8
If I see FF FE or FE FF as the first 2 bytes of a file I will know it is going to be 16bit unicode and I will NOT be able to do anything with it (maybe not so).

However if I saw the EF BB BF sequence I could set the default AVSCodePage to UTF-8, I am not aware of editors that do this, but I could easily add the processing in anticipation. Effectively treating it as if the script had started
AVSCodePage("UTF-8") ... [Is this transparent?]

As for the "maybe not so" I could easily do a WideChartoMultiByte(CP_UTF8, ...) and set AVSCodePage=CP_UTF8 in the file read code of Import() ... [Is this transparent?]

Consider your wishes carefully.

I will not implement anything that will break the existing 7 bit ascii behaviour of Avisynth.

I will resist implementing anything that will break existing 8 bit codepage behaviour that currently works satisfactorily.

I will not implement anything that will prevent a future pure 16bit unicode implementation of Avisynth.
IanB is offline   Reply With Quote
Old 21st November 2007, 00:40   #22  |  Link
foxyshadis
Angel of Night
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Tangled in the silks
Posts: 9,559
Why do you want to transform back to UTF-8, instead of dropping internal ascii parsing all together? Convert to UCS-16 based on either the UTF marked by the BOM or by the default code page in use. Convert back to 8-bit for string arguments based on the code page only. If people want to use unicode without a BOM, or something, perhaps use the python method - assume everything is 8-bit unless a string is preceded by u, as in u"ангел смерти".

Define a new character to replace 's' ('u'?) for unicode or utf-8 or whatever you want, and then convert all characters to that, for any filters that want to support it. Tell people that they need two filter definitions, one with s and one with u, and show them how to build callbacks using with WideChartoMultiByte(CP_ACP, ...) to hand to a WCHAR constructor.

Avisynth will prefer the u definition over the s definition.

Supporting legacy ASCII and Unicode behavior is not an easy problem, and any easy workarounds are unfortunately likely to make a later conversion to full unicode impossible without breaking something.
foxyshadis is offline   Reply With Quote
Old 21st November 2007, 01:29   #23  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Because converting Avisynth to unicode is going to be very very hard and I am lazy

Just saying scripts are UTF-8 period! would be exceedingly easy, but I am not that much of an arsehole, so I look for simple ways to maintain the status quo compatibility wise and give Foreign Language users some way to move forward.
IanB is offline   Reply With Quote
Old 21st November 2007, 13:52   #24  |  Link
dukey
Registered User
 
Join Date: Dec 2005
Posts: 560
UTF-8 seems like a good standard to go for

under linux i believe you can pass a UTF-8 filename to fopen() and it'll work fine.

Under windows this doesn't work so you would have to convert the filenames to
wchar_t
or whatever the defines for windows are. TCHAR with the _UNICODE defines i think and then use _wfopen()

But that doesn't seem like too much of a big deal.
That way all existing scripts will still work and can easily be extended.
dukey is offline   Reply With Quote
Old 14th January 2009, 13:57   #25  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Got this email from sourceforge, interesting idea about using the short (8.3) names as a hack around.

A difficulty to consider might be how windows map long filenames that collide in the short namespace. I believe it is chronological, i.e 1st file created is FRED~1.AVS, 2nd file created is FRED~2.AVS
Quote:
Originally Posted by Bernhard
Message body follows:

Hello there

First things first. I'm probably using AviSynth for
something that it isn't intended. But I quite like the
fanciness of it :-)
What I'm doing is, I have an AVS script with embedded
subtitles in it (SSA script after __END__ works like a
charm). The script itself checks its own filename and loads
whatever audio and video is available around it with the
same name and plays them with the subtitles over it.
I have a Perl script which generates such files from a AVS
template and timed text information.

Anyway, I've run into some Unicode issues which were quite
easy to fix (or at least circumvent a bit for the AVS
filename part).

First patch is for AVS files with Unicode filenames.
If one tries to play a file with non-local-CP characters in
the file or any part of the path the AVS script cannot be
opened at all.
I'm aware that its not really feasible to rewrite AviSynth
to use wide characters, so I used windows' short filename
form feature to get the scripts loaded. It tries to keep the
last part of the path (the filename) in the long form if
possible, because that would help my special case when I try
to load video/audio files with the same name during the
scripts execution.
Works fine on XP and should on all OSs which allow Unicode
filenames. It even works under Wine :-)
Code:
@
-------------------------------------------------------------------------------------------------------------------
@
--- avisynth_base/src/core/main.cpp	Mon Jun  9 23:53:20 2008
+++ avisynth/src/core/main.cpp	Thu Jan  8 16:32:15 2009
@@ -320,6 +320,24 @@ STDMETHODIMP CAVIFileSynth::Load(LPCOLES
 	char filename[MAX_PATH];
 
 	WideCharToMultiByte(AreFileApisANSI() ? CP_ACP : CP_OEMCP, 0, lpszFileName, -1, filename, sizeof filename, NULL, NULL); 
+	if (strchr(filename, '?'))
+	{
+		// if a ? ended up in the converted filename there is a non-representable unicode character in the path
+		// Change to the short file name form (8.3 characters per path element).
+		char filenameshort[MAX_PATH], filenamelong[MAX_PATH];
+		OLECHAR lpszFileNameShort[MAX_PATH];
+		strncpy(filenamelong, filename, sizeof filenamelong);
+		GetShortPathNameW(lpszFileName, lpszFileNameShort, MAX_PATH);
+		WideCharToMultiByte(AreFileApisANSI() ? CP_ACP : CP_OEMCP, 0, lpszFileNameShort, MAX_PATH, filenameshort, sizeof filenameshort, NULL, NULL);
+		// Depending if the found '?' is before or after (or on both sides) of the last path devider
+		// use either LONG-PATH/SHORT-FILE or SHORT-PATH/LONG-FILE or SHORT-PATH/SHORT-FILE
+		char* pfls = max(strrchr(filenamelong, '/'), strrchr(filenamelong, '\\'));
+		char* pfss = max(strrchr(filenameshort, '/'), strrchr(filenameshort, '\\'));
+		if (!pfss || !pfls) memcpy(filename, filenameshort, sizeof filenameshort);
+		else if (strchr(filenamelong, '?') > pfls) memcpy(filename+(pfls-filenamelong), pfss, sizeof(filenamelong)-(pfls-filenamelong));
+		else if (strchr(pfls, '?')) memcpy(filename, filenameshort, sizeof filenameshort);
+		else { memcpy(filename, filenameshort, pfss-filenameshort); memcpy(filename+(pfss-filenameshort), pfls, sizeof(filenameshort)-(pfss-filenameshort)); }
+	}
 
 	_RPT3(0,"%p->CAVIFileSynth::Load(\"%s\", 0x%X)\n", this, filename, grfMode);
@
-------------------------------------------------------------------------------------------------------------------
@
The second one is for scripts with an added UTF-8 header
thingy often (mis-)called 'byte order mark'. The thing is, I
need these (or rather VSFilter TextSub does) three bytes at
the beginning to get (embedded) Unicode subtitles up and
running. Other than that, it doesn't really hinder the
execution of the script at all, as of course, UTF-8 is fully
compatible with ANSI ASCII.
The setting of the null-terminator was moved up so it will
be there when memmove is being done.
Code:
@
-------------------------------------------------------------------------------------------------------------------
@ 
--- avisynth_bases/src/core/parser/script.cpp	Fri Jun 20 05:58:18 2008
+++ avisynth/src/core/parser/script.cpp	Wed Jan 14 01:13:52 2009
@@ -268,6 +268,7 @@ AVSValue Import(AVSValue args, void*, IS
     if (!ReadFile(h, buf, size, &size, NULL))
       env->ThrowError("Import: unable to read \"%s\"", script_name);
     CloseHandle(h);
+    ((char*)buf)[size] = 0;
 
     // Give Unicode smartarses a hint they need to use ANSI encoding
     if (size >= 2) {
@@ -278,11 +279,9 @@ AVSValue Import(AVSValue args, void*, IS
                           "re-save script with ANSI encoding! : \"%s\"", script_name);
 
       if (q[0]==0xEF && q[1]==0xBB && q[2]==0xBF)
-          env->ThrowError("Import: UTF-8 source files are not supported, "
-                          "re-save script with ANSI encoding! : \"%s\"", script_name);
+          memmove(buf, buf+3, size-3+1);
     }
 
-    ((char*)buf)[size] = 0;
     AVSValue eval_args[] = { (char*)buf, script_name };
     result = env->Invoke("Eval", AVSValue(eval_args, 2));
   }
@
-------------------------------------------------------------------------------------------------------------------
@
I'm aware that I'm kinda late now with 2.58 out, but maybe
you can use this to get some kind of new alpha or beta out
with an updated copyright notice :-)

Greetings from Tokyo Japan,
Bernhard
--
This message has been sent to you, a registered SourceForge.net user,
by another site user, through the SourceForge.net site. This message
has been delivered to your SourceForge.net mail alias. You may reply
to this message using the "Reply" feature of your email client, or
using the messaging facility of SourceForge.net at:
https://sourceforge.net/sendmessage.php?touser=314083

Last edited by Wilbert; 20th January 2009 at 18:49.
IanB is offline   Reply With Quote
Old 20th January 2009, 04:02   #26  |  Link
Barna
Registered User
 
Join Date: Jan 2009
Posts: 1
Yay after 5 long days I am finally allowed to post :-)

There are no collisions possible in the short form.
It's the operating system which cares for that. For Windows and its two file systems VFAT and NTFS it actually stores the short forms in there. So that means the short forms are generated during the naming of the file and will not change until the (long) name does.
After ~1,~2,~3,~4 the naming (at least under NTFS) changes to some hashing/random type which just keeps the first two ascii letters.

Oh, and I'm not sure how I feel about having my full name and an email address that forwards to my main one being posted on a public forum...
Barna is offline   Reply With Quote
Old 20th January 2009, 18:49   #27  |  Link
Wilbert
Moderator
 
Join Date: Nov 2001
Location: Netherlands
Posts: 6,364
Quote:
Oh, and I'm not sure how I feel about having my full name and an email address that forwards to my main one being posted on a public forum...
Barna is offline Report Post IP Click Here to Strike Barna
I removed them for you. Sorry about that.
Wilbert is offline   Reply With Quote
Old 20th January 2009, 23:38   #28  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Quote:
Originally Posted by Barna View Post
...
Oh, and I'm not sure how I feel about having my full name and an email address that forwards to my main one being posted on a public forum...
Sorry about that, I need to censor more when I cut and paste. Thanks Wilbert.
Quote:
There are no collisions possible in the short form.
It's the operating system which cares for that. For Windows and its two file systems VFAT and NTFS it actually stores the short forms in there. So that means the short forms are generated during the naming of the file and will not change until the (long) name does.
After ~1,~2,~3,~4 the naming (at least under NTFS) changes to some hashing/random type which just keeps the first two ascii letters.
Yes I guess collisions do not matter here, the appropriate script will get opened.

For the benefit of others, I was worried that you cannot predict the short name for a given long filename, i.e. "Microsoft Sucks.avs" and "Microsoft Blows.avs" translate to "MICROS~1.AVS" and "MICROS~2.AVS", which is which depends on the order of creations.


Given there will never be enough developer cycles to do a Unicode conversion and that such a change would not be transparent to the plugin API (i.e. recompile ALL the plugins in existence) I am currently leaning towards making the internals UTF-8.


This means all wide char system entries would do WideCharToMultiByte(CP_UTF8, ...

And all file system interaction would do MultiByteToWideChar(CP_UTF8, ... followed by using the ...W() version of the Windows API.

In Import for BOM-less ASCII files do a MultiByteToWideChar(AreFileApisANSI() ? CP_ACP : CP_OEMCP, ... ->> WideCharToMultiByte(CP_UTF8, ..., for files with a BOM, i.e. UTF, UNICODE, etc strip the BOM and directly translate to UTF8.

To support plugins that deal with the file system provide a ShortPathName() function that uses similar logic to Bernhards above and maybe UTF8_to_ANSI/UTF8_to_OEM/UTF8_to_ASCII() string functions.

Within the 7 bit ASCII character set the change would be transparent.

Only scripts that contain, Code Page specific characters and feeds those characters to plugins that interact with ANSI file API's, will need modification. Currently this case works and fails by the grace of god anyway. Feeding to internal functions will work as expected.

Thoughts?
IanB is offline   Reply With Quote
Old 20th January 2009, 23:48   #29  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Remember, because Avisynth is a library stuffing around with the process/thread codepage settings is not possible, just like we cannot pop messageboxes.

Hence the "MultiByteToWideChar(CP_UTF8, ... followed by using the ...W() version of the Windows API." gymnastics will be needed in place of any ANSI file system API's.
IanB is offline   Reply With Quote
Old 16th February 2010, 12:59   #30  |  Link
stax76
Registered User
 
stax76's Avatar
 
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
Apparently the script self cannot have a unicode filename like:

...\Hoří, má panenko.avs

Shows: Import: Couldn't open "...\Hoří, má panenko.avs"

Using version 2.58
stax76 is offline   Reply With Quote
Old 16th February 2010, 21:40   #31  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
@stax76, yes this is true. As long as we continue to use ...A() windows system calls there will be problems using unicode characters that are not available in the active codepage for the host application.
IanB is offline   Reply With Quote
Old 17th February 2010, 03:30   #32  |  Link
mariush
Registered User
 
Join Date: Dec 2008
Posts: 589
No.

The BOM is optional and will most likely appear only if the architecture is different, like little endian vs big endian - that's why there are two BOMs. A file can be BOTH ASCII and UTF8 at the same time, as UTF8 includes the ASCII subset.

The best thing would be to ASSUME file is UTF8 and if you stumble on a errors while parsing it, fall back to ASCII. As errors go, there are some documented edge cases and unallowed character combinations, for example as an error you're not allowed to write the pound sign by itself (£) in UTF8 because its character code (0xA3) is used for something else in UTF8, so you have to map it to two bytes.

There's a library ICU (http://site.icu-project.org/) that's also used by PHP to do normalization between UTF formats and code pages and stuff...

ps... AviSource("something", UTF8=true) is kind of dumb... what's next, you'll have utf16=true? If it comes to this, it would make more sense to have Codepage="UTF8" or something like this.

ps2. DON'T make assumptions that the short file names are there. Some people do registry tweakings to disable the creation of short file names or to disable last access/last modified time updates, for better disk access, less I/O, whatever, so these short file names are not always there. And... if I remember correctly, on some media they don't exist by default - like 8GB micro sd cards or removable drives or some CD/DVD media.

Last edited by mariush; 17th February 2010 at 03:37.
mariush is offline   Reply With Quote
Old 17th February 2010, 09:46   #33  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
@mariush, What do you mean "No." (are you answering my post 1374763 or is this out of context)

Avisynth currently supports ANSI encoding in the host applications current code page. i.e. what the ...A() windows API calls support. To avoid bogus support issues it currently detects UTF8, UTF16le and UTF16be BOM codes and reports a helpless error.
IanB is offline   Reply With Quote
Old 17th February 2010, 17:08   #34  |  Link
mariush
Registered User
 
Join Date: Dec 2008
Posts: 589
No... in the sense that I don't agree with what was said in the previous few messages (about short file names), which were posts 25 - 29 when I read the thread (I think I left the page open in a tab and replied after a few hours and that's why it's a bit out of context).

My apologies if it was misunderstood.
mariush is offline   Reply With Quote
Old 27th June 2010, 18:03   #35  |  Link
stax76
Registered User
 
stax76's Avatar
 
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
Has there been any progress?
stax76 is offline   Reply With Quote
Old 27th June 2010, 22:00   #36  |  Link
krieger2005
Registered User
 
krieger2005's Avatar
 
Join Date: Oct 2003
Location: Germany
Posts: 377
Read in the Blog of the CppCMS-Project (http://art-blog.no-ip.info/cppcms/blog) , that the developer has some problems developing on Windows Multilanguage-Filenames. As i understand he solved his Problem. So maybe look at his results or talk to him would help implimenting access of Multilanguage Filenames.

PS: Here the ULR of the exact Blog-Entry: http://art-blog.no-ip.info/cppcms/blog/post/62
krieger2005 is offline   Reply With Quote
Old 4th July 2010, 19:21   #37  |  Link
stax76
Registered User
 
stax76's Avatar
 
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
I have a report that a script name with Russian Unicode chars like предприятий.avs works on Russian systems. Can somebody confirm this? If it works can somebody explain me why? On my German system it's not working.
stax76 is offline   Reply With Quote
Old 4th July 2010, 20:15   #38  |  Link
kemuri-_9
Compiling Encoder
 
kemuri-_9's Avatar
 
Join Date: Jan 2007
Posts: 1,348
Quote:
Originally Posted by stax76 View Post
I have a report that a script name with Russian Unicode chars like предприятий.avs works on Russian systems. Can somebody confirm this? If it works can somebody explain me why? On my German system it's not working.
IanB answered your questions already:
Quote:
Originally Posted by IanB View Post
Avisynth currently supports ANSI encoding in the host applications current code page. i.e. what the ...A() windows API calls support.
so whatever your windows system is using as the native code page is what avisynth can recognize.
like on my japanese codepage (cp932) system i can use scripts that use filenames with japanese characters,
but i can not use scripts that contain filenames with japanese characters on my english codepage (cp1252) system.
__________________
custom x264 builds & patches | F@H | My Specs
kemuri-_9 is offline   Reply With Quote
Old 4th July 2010, 20:55   #39  |  Link
stax76
Registered User
 
stax76's Avatar
 
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
The report I got must be wrong then, maybe I could try to rename source files converting Unicode chars to ANSI chars if there is a representation, not sure if it can be done with the WinAPI or .NET API or if I need a table or something.
stax76 is offline   Reply With Quote
Old 4th July 2010, 22:27   #40  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
You seem to be confused about something in Kemuri's reply.

You can easily try it yourself even on your German Windows: Change the code page to cyrillic and restart Windows. Avisynth will accept filenames with cyrillic characters.

Just in case you don't know how:
Control Panel -> Regional and Language Options -> Advanced -> Language for non-Unicode programs -->> Change it to Russian.

Last edited by Groucho2004; 4th July 2010 at 22:32.
Groucho2004 is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:01.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.