Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
20th November 2007, 22:48 | #21 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Hmm, keep the ruminations coming
Code:
Bytes Encoding Form FE FF UTF-16, big-endian FF FE UTF-16, little-endian EF BB BF UTF-8 However if I saw the EF BB BF sequence I could set the default AVSCodePage to UTF-8, I am not aware of editors that do this, but I could easily add the processing in anticipation. Effectively treating it as if the script had started AVSCodePage("UTF-8") ... [Is this transparent?] As for the "maybe not so" I could easily do a WideChartoMultiByte(CP_UTF8, ...) and set AVSCodePage=CP_UTF8 in the file read code of Import() ... [Is this transparent?] Consider your wishes carefully. I will not implement anything that will break the existing 7 bit ascii behaviour of Avisynth. I will resist implementing anything that will break existing 8 bit codepage behaviour that currently works satisfactorily. I will not implement anything that will prevent a future pure 16bit unicode implementation of Avisynth. |
21st November 2007, 00:40 | #22 | Link |
Angel of Night
Join Date: Nov 2004
Location: Tangled in the silks
Posts: 9,559
|
Why do you want to transform back to UTF-8, instead of dropping internal ascii parsing all together? Convert to UCS-16 based on either the UTF marked by the BOM or by the default code page in use. Convert back to 8-bit for string arguments based on the code page only. If people want to use unicode without a BOM, or something, perhaps use the python method - assume everything is 8-bit unless a string is preceded by u, as in u"ангел смерти".
Define a new character to replace 's' ('u'?) for unicode or utf-8 or whatever you want, and then convert all characters to that, for any filters that want to support it. Tell people that they need two filter definitions, one with s and one with u, and show them how to build callbacks using with WideChartoMultiByte(CP_ACP, ...) to hand to a WCHAR constructor. Avisynth will prefer the u definition over the s definition. Supporting legacy ASCII and Unicode behavior is not an easy problem, and any easy workarounds are unfortunately likely to make a later conversion to full unicode impossible without breaking something. |
21st November 2007, 01:29 | #23 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Because converting Avisynth to unicode is going to be very very hard and I am lazy
Just saying scripts are UTF-8 period! would be exceedingly easy, but I am not that much of an arsehole, so I look for simple ways to maintain the status quo compatibility wise and give Foreign Language users some way to move forward. |
21st November 2007, 13:52 | #24 | Link |
Registered User
Join Date: Dec 2005
Posts: 560
|
UTF-8 seems like a good standard to go for
under linux i believe you can pass a UTF-8 filename to fopen() and it'll work fine. Under windows this doesn't work so you would have to convert the filenames to wchar_t or whatever the defines for windows are. TCHAR with the _UNICODE defines i think and then use _wfopen() But that doesn't seem like too much of a big deal. That way all existing scripts will still work and can easily be extended. |
14th January 2009, 13:57 | #25 | Link | |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Got this email from sourceforge, interesting idea about using the short (8.3) names as a hack around.
A difficulty to consider might be how windows map long filenames that collide in the short namespace. I believe it is chronological, i.e 1st file created is FRED~1.AVS, 2nd file created is FRED~2.AVS Quote:
Last edited by Wilbert; 20th January 2009 at 18:49. |
|
20th January 2009, 04:02 | #26 | Link |
Registered User
Join Date: Jan 2009
Posts: 1
|
Yay after 5 long days I am finally allowed to post :-)
There are no collisions possible in the short form. It's the operating system which cares for that. For Windows and its two file systems VFAT and NTFS it actually stores the short forms in there. So that means the short forms are generated during the naming of the file and will not change until the (long) name does. After ~1,~2,~3,~4 the naming (at least under NTFS) changes to some hashing/random type which just keeps the first two ascii letters. Oh, and I'm not sure how I feel about having my full name and an email address that forwards to my main one being posted on a public forum... |
20th January 2009, 18:49 | #27 | Link | |
Moderator
Join Date: Nov 2001
Location: Netherlands
Posts: 6,364
|
Quote:
|
|
20th January 2009, 23:38 | #28 | Link | ||
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Quote:
Quote:
For the benefit of others, I was worried that you cannot predict the short name for a given long filename, i.e. "Microsoft Sucks.avs" and "Microsoft Blows.avs" translate to "MICROS~1.AVS" and "MICROS~2.AVS", which is which depends on the order of creations. Given there will never be enough developer cycles to do a Unicode conversion and that such a change would not be transparent to the plugin API (i.e. recompile ALL the plugins in existence) I am currently leaning towards making the internals UTF-8. This means all wide char system entries would do WideCharToMultiByte(CP_UTF8, ... And all file system interaction would do MultiByteToWideChar(CP_UTF8, ... followed by using the ...W() version of the Windows API. In Import for BOM-less ASCII files do a MultiByteToWideChar(AreFileApisANSI() ? CP_ACP : CP_OEMCP, ... ->> WideCharToMultiByte(CP_UTF8, ..., for files with a BOM, i.e. UTF, UNICODE, etc strip the BOM and directly translate to UTF8. To support plugins that deal with the file system provide a ShortPathName() function that uses similar logic to Bernhards above and maybe UTF8_to_ANSI/UTF8_to_OEM/UTF8_to_ASCII() string functions. Within the 7 bit ASCII character set the change would be transparent. Only scripts that contain, Code Page specific characters and feeds those characters to plugins that interact with ANSI file API's, will need modification. Currently this case works and fails by the grace of god anyway. Feeding to internal functions will work as expected. Thoughts? |
||
20th January 2009, 23:48 | #29 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Remember, because Avisynth is a library stuffing around with the process/thread codepage settings is not possible, just like we cannot pop messageboxes.
Hence the "MultiByteToWideChar(CP_UTF8, ... followed by using the ...W() version of the Windows API." gymnastics will be needed in place of any ANSI file system API's. |
16th February 2010, 12:59 | #30 | Link |
Registered User
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
|
Apparently the script self cannot have a unicode filename like:
...\Hoří, má panenko.avs Shows: Import: Couldn't open "...\Hoří, má panenko.avs" Using version 2.58
__________________
https://github.com/stax76/software-list https://www.youtube.com/@stax76/playlists |
16th February 2010, 21:40 | #31 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
@stax76, yes this is true. As long as we continue to use ...A() windows system calls there will be problems using unicode characters that are not available in the active codepage for the host application.
|
17th February 2010, 03:30 | #32 | Link |
Registered User
Join Date: Dec 2008
Posts: 589
|
No.
The BOM is optional and will most likely appear only if the architecture is different, like little endian vs big endian - that's why there are two BOMs. A file can be BOTH ASCII and UTF8 at the same time, as UTF8 includes the ASCII subset. The best thing would be to ASSUME file is UTF8 and if you stumble on a errors while parsing it, fall back to ASCII. As errors go, there are some documented edge cases and unallowed character combinations, for example as an error you're not allowed to write the pound sign by itself (£) in UTF8 because its character code (0xA3) is used for something else in UTF8, so you have to map it to two bytes. There's a library ICU (http://site.icu-project.org/) that's also used by PHP to do normalization between UTF formats and code pages and stuff... ps... AviSource("something", UTF8=true) is kind of dumb... what's next, you'll have utf16=true? If it comes to this, it would make more sense to have Codepage="UTF8" or something like this. ps2. DON'T make assumptions that the short file names are there. Some people do registry tweakings to disable the creation of short file names or to disable last access/last modified time updates, for better disk access, less I/O, whatever, so these short file names are not always there. And... if I remember correctly, on some media they don't exist by default - like 8GB micro sd cards or removable drives or some CD/DVD media. Last edited by mariush; 17th February 2010 at 03:37. |
17th February 2010, 09:46 | #33 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
@mariush, What do you mean "No." (are you answering my post 1374763 or is this out of context)
Avisynth currently supports ANSI encoding in the host applications current code page. i.e. what the ...A() windows API calls support. To avoid bogus support issues it currently detects UTF8, UTF16le and UTF16be BOM codes and reports a helpless error. |
17th February 2010, 17:08 | #34 | Link |
Registered User
Join Date: Dec 2008
Posts: 589
|
No... in the sense that I don't agree with what was said in the previous few messages (about short file names), which were posts 25 - 29 when I read the thread (I think I left the page open in a tab and replied after a few hours and that's why it's a bit out of context).
My apologies if it was misunderstood. |
27th June 2010, 18:03 | #35 | Link |
Registered User
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
|
Has there been any progress?
__________________
https://github.com/stax76/software-list https://www.youtube.com/@stax76/playlists |
27th June 2010, 22:00 | #36 | Link |
Registered User
Join Date: Oct 2003
Location: Germany
Posts: 377
|
Read in the Blog of the CppCMS-Project (http://art-blog.no-ip.info/cppcms/blog) , that the developer has some problems developing on Windows Multilanguage-Filenames. As i understand he solved his Problem. So maybe look at his results or talk to him would help implimenting access of Multilanguage Filenames.
PS: Here the ULR of the exact Blog-Entry: http://art-blog.no-ip.info/cppcms/blog/post/62 |
4th July 2010, 19:21 | #37 | Link |
Registered User
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
|
I have a report that a script name with Russian Unicode chars like предприятий.avs works on Russian systems. Can somebody confirm this? If it works can somebody explain me why? On my German system it's not working.
__________________
https://github.com/stax76/software-list https://www.youtube.com/@stax76/playlists |
4th July 2010, 20:15 | #38 | Link | ||
Compiling Encoder
Join Date: Jan 2007
Posts: 1,348
|
Quote:
Quote:
like on my japanese codepage (cp932) system i can use scripts that use filenames with japanese characters, but i can not use scripts that contain filenames with japanese characters on my english codepage (cp1252) system. |
||
4th July 2010, 20:55 | #39 | Link |
Registered User
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
|
The report I got must be wrong then, maybe I could try to rename source files converting Unicode chars to ANSI chars if there is a representation, not sure if it can be done with the WinAPI or .NET API or if I need a table or something.
__________________
https://github.com/stax76/software-list https://www.youtube.com/@stax76/playlists |
4th July 2010, 22:27 | #40 | Link |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
You seem to be confused about something in Kemuri's reply.
You can easily try it yourself even on your German Windows: Change the code page to cyrillic and restart Windows. Avisynth will accept filenames with cyrillic characters. Just in case you don't know how: Control Panel -> Regional and Language Options -> Advanced -> Language for non-Unicode programs -->> Change it to Russian. Last edited by Groucho2004; 4th July 2010 at 22:32. |
|
|