这个博客酝酿好久,不敢发,这个计算机的基本知识,我坦白说,我一直很混沌,一直不清楚,自己写点啥,纠结不知道自己的是否正确,容易被鄙视,尽量测试来论证,但是由于本人水平不高,还是会还怕对于这么基础的知识,还是掌握的不好。
在学习文字编码的细节之前,先要认识几个概念:
文字:
以视觉方式表现语言体系所用的符号。这个很好理解就是我们每天看见的A、B、C、D、啊、喔、额此类的东西。
字符集:
由于我们日常所见的文字,符号和数字总和的数量是巨大的,同事处理所有的文字是不可能的,所以事先规定使用哪些文字,这些文字的集合就叫字符集。具有代表性的字符集有人比较熟知的美国的ASCII,欧洲的ISO8859,咱们中国人的GB_2312,以及后来的以表现多语言为目的的Unicode字符集,我们看一下ASCII表:
字符编码:
在字符集中,每个字符都分配一个编码,就叫做字符编码。
字符编码方式:
计算机上仅仅用整数来表示字符编码的方式成为字符编码方式。
现在似乎明白一点了,虽然计算机能够处理图像、动画、以及各种程序、各种数据,但是CPU只能处理二进制的数字。所以必须将各种形式的处理对象转换成二进制,因为当初最开始搞计算机的人说英语,所有最开始的例如ASCII中只有字母,数字,和基本符号。然后随着计算机的发展,发展到中国了,ASCII已经不好使了,所以就出现了Unicode,和GB_2312,以及其他各个国家的字符集。
好了现在写点代码来详细讲讲。
在C#中查看一下C#中Unicode支持的字符集编码方式:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 FileStream fs = File.Open("c:\\code.txt", FileMode.OpenOrCreate); 14 StringBuilder sb = new StringBuilder(); 15 foreach (EncodingInfo coif in Encoding.GetEncodings()) 16 { 17 sb.Append("Display Name: " + coif.DisplayName + "----Name: " + coif.Name + "\n"); 18 } 19 byte[] coByte = Encoding.GetEncoding("Unicode").GetBytes(sb.ToString()); 20 21 fs.Write(coByte, 0, coByte.Length); 22 fs.Close(); 23 Console.ReadKey(); 24 25 } 26 } 27 }
本来是输出到控制台的,结果发现输出的内容还挺多的,只要写到文件里了,下面是输出的内容:
1 Display Name: IBM EBCDIC (美国-加拿大)----Name: IBM037 2 Display Name: OEM 美国----Name: IBM437 3 Display Name: IBM EBCDIC (国际)----Name: IBM500 4 Display Name: 阿拉伯字符(ASMO-708)----Name: ASMO-708 5 Display Name: 阿拉伯字符(DOS)----Name: DOS-720 6 Display Name: 希腊字符(DOS)----Name: ibm737 7 Display Name: 波罗的海字符(DOS)----Name: ibm775 8 Display Name: 西欧字符(DOS)----Name: ibm850 9 Display Name: 中欧字符(DOS)----Name: ibm852 10 Display Name: OEM 西里尔语----Name: IBM855 11 Display Name: 土耳其字符(DOS)----Name: ibm857 12 Display Name: OEM 多语言拉丁语 I----Name: IBM00858 13 Display Name: 葡萄牙语(DOS)----Name: IBM860 14 Display Name: 冰岛语(DOS)----Name: ibm861 15 Display Name: 希伯来字符(DOS)----Name: DOS-862 16 Display Name: 加拿大法语(DOS)----Name: IBM863 17 Display Name: 阿拉伯字符(864)----Name: IBM864 18 Display Name: 北欧字符(DOS)----Name: IBM865 19 Display Name: 西里尔字符(DOS)----Name: cp866 20 Display Name: 现代希腊字符(DOS)----Name: ibm869 21 Display Name: IBM EBCDIC (多语言拉丁语 2)----Name: IBM870 22 Display Name: 泰语(Windows)----Name: windows-874 23 Display Name: IBM EBCDIC (现代希腊语)----Name: cp875 24 Display Name: 日语(Shift-JIS)----Name: shift_jis 25 Display Name: 简体中文(GB2312)----Name: gb2312 26 Display Name: 朝鲜语----Name: ks_c_5601-1987 27 Display Name: 繁体中文(Big5)----Name: big5 28 Display Name: IBM EBCDIC (土耳其拉丁语 5)----Name: IBM1026 29 Display Name: IBM 拉丁语 1----Name: IBM01047 30 Display Name: IBM EBCDIC (美国-加拿大-欧洲)----Name: IBM01140 31 Display Name: IBM EBCDIC (德国-欧洲)----Name: IBM01141 32 Display Name: IBM EBCDIC (丹麦-挪威-欧洲)----Name: IBM01142 33 Display Name: IBM EBCDIC (芬兰-瑞典-欧洲)----Name: IBM01143 34 Display Name: IBM EBCDIC (意大利-欧洲)----Name: IBM01144 35 Display Name: IBM EBCDIC (西班牙-欧洲)----Name: IBM01145 36 Display Name: IBM EBCDIC (英国-欧洲)----Name: IBM01146 37 Display Name: IBM EBCDIC (法国-欧洲)----Name: IBM01147 38 Display Name: IBM EBCDIC (国际-欧洲)----Name: IBM01148 39 Display Name: IBM EBCDIC (冰岛语-欧洲)----Name: IBM01149 40 Display Name: Unicode----Name: utf-16 41 Display Name: Unicode (Big-Endian)----Name: utf-16BE 42 Display Name: 中欧字符(Windows)----Name: windows-1250 43 Display Name: 西里尔字符(Windows)----Name: windows-1251 44 Display Name: 西欧字符(Windows)----Name: Windows-1252 45 Display Name: 希腊字符(Windows)----Name: windows-1253 46 Display Name: 土耳其字符(Windows)----Name: windows-1254 47 Display Name: 希伯来字符(Windows)----Name: windows-1255 48 Display Name: 阿拉伯字符(Windows)----Name: windows-1256 49 Display Name: 波罗的海字符(Windows)----Name: windows-1257 50 Display Name: 越南字符(Windows)----Name: windows-1258 51 Display Name: 朝鲜语(Johab)----Name: Johab 52 Display Name: 西欧字符(Mac)----Name: macintosh 53 Display Name: 日语(Mac)----Name: x-mac-japanese 54 Display Name: 繁体中文(Mac)----Name: x-mac-chinesetrad 55 Display Name: 朝鲜语(Mac)----Name: x-mac-korean 56 Display Name: 阿拉伯字符(Mac)----Name: x-mac-arabic 57 Display Name: 希伯来字符(Mac)----Name: x-mac-hebrew 58 Display Name: 希腊字符(Mac)----Name: x-mac-greek 59 Display Name: 西里尔字符(Mac)----Name: x-mac-cyrillic 60 Display Name: 简体中文(Mac)----Name: x-mac-chinesesimp 61 Display Name: 罗马尼亚语(Mac)----Name: x-mac-romanian 62 Display Name: 乌克兰语(Mac)----Name: x-mac-ukrainian 63 Display Name: 泰语(Mac)----Name: x-mac-thai 64 Display Name: 中欧字符(Mac)----Name: x-mac-ce 65 Display Name: 冰岛语(Mac)----Name: x-mac-icelandic 66 Display Name: 土耳其字符(Mac)----Name: x-mac-turkish 67 Display Name: 克罗地亚语(Mac)----Name: x-mac-croatian 68 Display Name: Unicode (UTF-32)----Name: utf-32 69 Display Name: Unicode (UTF-32 Big-Endian)----Name: utf-32BE 70 Display Name: 繁体中文(CNS)----Name: x-Chinese-CNS 71 Display Name: TCA 台湾----Name: x-cp20001 72 Display Name: 繁体中文(Eten)----Name: x-Chinese-Eten 73 Display Name: IBM5550 台湾----Name: x-cp20003 74 Display Name: TeleText 台湾----Name: x-cp20004 75 Display Name: Wang 台湾----Name: x-cp20005 76 Display Name: 西欧字符(IA5)----Name: x-IA5 77 Display Name: 德语(IA5)----Name: x-IA5-German 78 Display Name: 瑞典语(IA5)----Name: x-IA5-Swedish 79 Display Name: 挪威语(IA5)----Name: x-IA5-Norwegian 80 Display Name: US-ASCII----Name: us-ascii 81 Display Name: T.61----Name: x-cp20261 82 Display Name: ISO-6937----Name: x-cp20269 83 Display Name: IBM EBCDIC (德国)----Name: IBM273 84 Display Name: IBM EBCDIC (丹麦-挪威)----Name: IBM277 85 Display Name: IBM EBCDIC (芬兰-瑞典)----Name: IBM278 86 Display Name: IBM EBCDIC (意大利)----Name: IBM280 87 Display Name: IBM EBCDIC (西班牙)----Name: IBM284 88 Display Name: IBM EBCDIC (UK)----Name: IBM285 89 Display Name: IBM EBCDIC (日语片假名)----Name: IBM290 90 Display Name: IBM EBCDIC (法国)----Name: IBM297 91 Display Name: IBM EBCDIC (阿拉伯语)----Name: IBM420 92 Display Name: IBM EBCDIC (希腊语)----Name: IBM423 93 Display Name: IBM EBCDIC (希伯来语)----Name: IBM424 94 Display Name: IBM EBCDIC (朝鲜语扩展)----Name: x-EBCDIC-KoreanExtended 95 Display Name: IBM EBCDIC (泰语)----Name: IBM-Thai 96 Display Name: 西里尔字符(KOI8-R)----Name: koi8-r 97 Display Name: IBM EBCDIC (冰岛语)----Name: IBM871 98 Display Name: IBM EBCDIC (西里尔俄语)----Name: IBM880 99 Display Name: IBM EBCDIC (土耳其语)----Name: IBM905 100 Display Name: IBM 拉丁语 1----Name: IBM00924 101 Display Name: 日语(JIS 0208-1990 和 0212-1990)----Name: EUC-JP 102 Display Name: 简体中文(GB2312-80)----Name: x-cp20936 103 Display Name: 朝鲜语 Wansung----Name: x-cp20949 104 Display Name: IBM EBCDIC (西里尔塞尔维亚-保加利亚语)----Name: cp1025 105 Display Name: 西里尔字符(KOI8-U)----Name: koi8-u 106 Display Name: 西欧字符(ISO)----Name: iso-8859-1 107 Display Name: 中欧字符(ISO)----Name: iso-8859-2 108 Display Name: 拉丁语 3 (ISO)----Name: iso-8859-3 109 Display Name: 波罗的海字符(ISO)----Name: iso-8859-4 110 Display Name: 西里尔字符(ISO)----Name: iso-8859-5 111 Display Name: 阿拉伯字符(ISO)----Name: iso-8859-6 112 Display Name: 希腊字符(ISO)----Name: iso-8859-7 113 Display Name: 希伯来字符(ISO-Visual)----Name: iso-8859-8 114 Display Name: 土耳其字符(ISO)----Name: iso-8859-9 115 Display Name: 爱沙尼亚语(ISO)----Name: iso-8859-13 116 Display Name: 拉丁语 9 (ISO)----Name: iso-8859-15 117 Display Name: 欧罗巴----Name: x-Europa 118 Display Name: 希伯来字符(ISO-Logical)----Name: iso-8859-8-i 119 Display Name: 日语(JIS)----Name: iso-2022-jp 120 Display Name: 日语(JIS-允许 1 字节假名)----Name: csISO2022JP 121 Display Name: 日语(JIS-允许 1 字节假名 - SO/SI)----Name: iso-2022-jp 122 Display Name: 朝鲜语(ISO)----Name: iso-2022-kr 123 Display Name: 简体中文(ISO-2022)----Name: x-cp50227 124 Display Name: 日语(EUC)----Name: euc-jp 125 Display Name: 简体中文(EUC)----Name: EUC-CN 126 Display Name: 朝鲜语(EUC)----Name: euc-kr 127 Display Name: 简体中文(HZ)----Name: hz-gb-2312 128 Display Name: 简体中文(GB18030)----Name: GB18030 129 Display Name: ISCII 梵文----Name: x-iscii-de 130 Display Name: ISCII 孟加拉语----Name: x-iscii-be 131 Display Name: ISCII 泰米尔语----Name: x-iscii-ta 132 Display Name: ISCII 泰卢固语----Name: x-iscii-te 133 Display Name: ISCII 阿萨姆语----Name: x-iscii-as 134 Display Name: ISCII 奥里雅语----Name: x-iscii-or 135 Display Name: ISCII 卡纳达语----Name: x-iscii-ka 136 Display Name: ISCII 马拉雅拉姆语----Name: x-iscii-ma 137 Display Name: ISCII 古吉拉特语----Name: x-iscii-gu 138 Display Name: ISCII 旁遮普语----Name: x-iscii-pa 139 Display Name: Unicode (UTF-7)----Name: utf-7 140 Display Name: Unicode (UTF-8)----Name: utf-8
下面看一下Java的:
1 package code; 2 3 import java.nio.charset.Charset; 4 import java.util.SortedMap; 5 6 public class Code { 7 8 public static void main(String[] args) { 9 SortedMap<String, Charset> availableSet = Charset.availableCharsets(); 10 for (String setKey : availableSet.keySet()) { 11 System.out.println("DisplayName: "+availableSet.get(setKey).displayName() +" Name: "+ availableSet.get(setKey).name()); 12 } 13 14 } 15 16 }
看输出结果:
1 DisplayName: Big5 Name: Big5 2 DisplayName: Big5-HKSCS Name: Big5-HKSCS 3 DisplayName: EUC-JP Name: EUC-JP 4 DisplayName: EUC-KR Name: EUC-KR 5 DisplayName: GB18030 Name: GB18030 6 DisplayName: GB2312 Name: GB2312 7 DisplayName: GBK Name: GBK 8 DisplayName: IBM-Thai Name: IBM-Thai 9 DisplayName: IBM00858 Name: IBM00858 10 DisplayName: IBM01140 Name: IBM01140 11 DisplayName: IBM01141 Name: IBM01141 12 DisplayName: IBM01142 Name: IBM01142 13 DisplayName: IBM01143 Name: IBM01143 14 DisplayName: IBM01144 Name: IBM01144 15 DisplayName: IBM01145 Name: IBM01145 16 DisplayName: IBM01146 Name: IBM01146 17 DisplayName: IBM01147 Name: IBM01147 18 DisplayName: IBM01148 Name: IBM01148 19 DisplayName: IBM01149 Name: IBM01149 20 DisplayName: IBM037 Name: IBM037 21 DisplayName: IBM1026 Name: IBM1026 22 DisplayName: IBM1047 Name: IBM1047 23 DisplayName: IBM273 Name: IBM273 24 DisplayName: IBM277 Name: IBM277 25 DisplayName: IBM278 Name: IBM278 26 DisplayName: IBM280 Name: IBM280 27 DisplayName: IBM284 Name: IBM284 28 DisplayName: IBM285 Name: IBM285 29 DisplayName: IBM297 Name: IBM297 30 DisplayName: IBM420 Name: IBM420 31 DisplayName: IBM424 Name: IBM424 32 DisplayName: IBM437 Name: IBM437 33 DisplayName: IBM500 Name: IBM500 34 DisplayName: IBM775 Name: IBM775 35 DisplayName: IBM850 Name: IBM850 36 DisplayName: IBM852 Name: IBM852 37 DisplayName: IBM855 Name: IBM855 38 DisplayName: IBM857 Name: IBM857 39 DisplayName: IBM860 Name: IBM860 40 DisplayName: IBM861 Name: IBM861 41 DisplayName: IBM862 Name: IBM862 42 DisplayName: IBM863 Name: IBM863 43 DisplayName: IBM864 Name: IBM864 44 DisplayName: IBM865 Name: IBM865 45 DisplayName: IBM866 Name: IBM866 46 DisplayName: IBM868 Name: IBM868 47 DisplayName: IBM869 Name: IBM869 48 DisplayName: IBM870 Name: IBM870 49 DisplayName: IBM871 Name: IBM871 50 DisplayName: IBM918 Name: IBM918 51 DisplayName: ISO-2022-CN Name: ISO-2022-CN 52 DisplayName: ISO-2022-JP Name: ISO-2022-JP 53 DisplayName: ISO-2022-JP-2 Name: ISO-2022-JP-2 54 DisplayName: ISO-2022-KR Name: ISO-2022-KR 55 DisplayName: ISO-8859-1 Name: ISO-8859-1 56 DisplayName: ISO-8859-13 Name: ISO-8859-13 57 DisplayName: ISO-8859-15 Name: ISO-8859-15 58 DisplayName: ISO-8859-2 Name: ISO-8859-2 59 DisplayName: ISO-8859-3 Name: ISO-8859-3 60 DisplayName: ISO-8859-4 Name: ISO-8859-4 61 DisplayName: ISO-8859-5 Name: ISO-8859-5 62 DisplayName: ISO-8859-6 Name: ISO-8859-6 63 DisplayName: ISO-8859-7 Name: ISO-8859-7 64 DisplayName: ISO-8859-8 Name: ISO-8859-8 65 DisplayName: ISO-8859-9 Name: ISO-8859-9 66 DisplayName: JIS_X0201 Name: JIS_X0201 67 DisplayName: JIS_X0212-1990 Name: JIS_X0212-1990 68 DisplayName: KOI8-R Name: KOI8-R 69 DisplayName: KOI8-U Name: KOI8-U 70 DisplayName: Shift_JIS Name: Shift_JIS 71 DisplayName: TIS-620 Name: TIS-620 72 DisplayName: US-ASCII Name: US-ASCII 73 DisplayName: UTF-16 Name: UTF-16 74 DisplayName: UTF-16BE Name: UTF-16BE 75 DisplayName: UTF-16LE Name: UTF-16LE 76 DisplayName: UTF-32 Name: UTF-32 77 DisplayName: UTF-32BE Name: UTF-32BE 78 DisplayName: UTF-32LE Name: UTF-32LE 79 DisplayName: UTF-8 Name: UTF-8 80 DisplayName: windows-1250 Name: windows-1250 81 DisplayName: windows-1251 Name: windows-1251 82 DisplayName: windows-1252 Name: windows-1252 83 DisplayName: windows-1253 Name: windows-1253 84 DisplayName: windows-1254 Name: windows-1254 85 DisplayName: windows-1255 Name: windows-1255 86 DisplayName: windows-1256 Name: windows-1256 87 DisplayName: windows-1257 Name: windows-1257 88 DisplayName: windows-1258 Name: windows-1258 89 DisplayName: windows-31j Name: windows-31j 90 DisplayName: x-Big5-HKSCS-2001 Name: x-Big5-HKSCS-2001 91 DisplayName: x-Big5-Solaris Name: x-Big5-Solaris 92 DisplayName: x-euc-jp-linux Name: x-euc-jp-linux 93 DisplayName: x-EUC-TW Name: x-EUC-TW 94 DisplayName: x-eucJP-Open Name: x-eucJP-Open 95 DisplayName: x-IBM1006 Name: x-IBM1006 96 DisplayName: x-IBM1025 Name: x-IBM1025 97 DisplayName: x-IBM1046 Name: x-IBM1046 98 DisplayName: x-IBM1097 Name: x-IBM1097 99 DisplayName: x-IBM1098 Name: x-IBM1098 100 DisplayName: x-IBM1112 Name: x-IBM1112 101 DisplayName: x-IBM1122 Name: x-IBM1122 102 DisplayName: x-IBM1123 Name: x-IBM1123 103 DisplayName: x-IBM1124 Name: x-IBM1124 104 DisplayName: x-IBM1364 Name: x-IBM1364 105 DisplayName: x-IBM1381 Name: x-IBM1381 106 DisplayName: x-IBM1383 Name: x-IBM1383 107 DisplayName: x-IBM33722 Name: x-IBM33722 108 DisplayName: x-IBM737 Name: x-IBM737 109 DisplayName: x-IBM833 Name: x-IBM833 110 DisplayName: x-IBM834 Name: x-IBM834 111 DisplayName: x-IBM856 Name: x-IBM856 112 DisplayName: x-IBM874 Name: x-IBM874 113 DisplayName: x-IBM875 Name: x-IBM875 114 DisplayName: x-IBM921 Name: x-IBM921 115 DisplayName: x-IBM922 Name: x-IBM922 116 DisplayName: x-IBM930 Name: x-IBM930 117 DisplayName: x-IBM933 Name: x-IBM933 118 DisplayName: x-IBM935 Name: x-IBM935 119 DisplayName: x-IBM937 Name: x-IBM937 120 DisplayName: x-IBM939 Name: x-IBM939 121 DisplayName: x-IBM942 Name: x-IBM942 122 DisplayName: x-IBM942C Name: x-IBM942C 123 DisplayName: x-IBM943 Name: x-IBM943 124 DisplayName: x-IBM943C Name: x-IBM943C 125 DisplayName: x-IBM948 Name: x-IBM948 126 DisplayName: x-IBM949 Name: x-IBM949 127 DisplayName: x-IBM949C Name: x-IBM949C 128 DisplayName: x-IBM950 Name: x-IBM950 129 DisplayName: x-IBM964 Name: x-IBM964 130 DisplayName: x-IBM970 Name: x-IBM970 131 DisplayName: x-ISCII91 Name: x-ISCII91 132 DisplayName: x-ISO-2022-CN-CNS Name: x-ISO-2022-CN-CNS 133 DisplayName: x-ISO-2022-CN-GB Name: x-ISO-2022-CN-GB 134 DisplayName: x-iso-8859-11 Name: x-iso-8859-11 135 DisplayName: x-JIS0208 Name: x-JIS0208 136 DisplayName: x-JISAutoDetect Name: x-JISAutoDetect 137 DisplayName: x-Johab Name: x-Johab 138 DisplayName: x-MacArabic Name: x-MacArabic 139 DisplayName: x-MacCentralEurope Name: x-MacCentralEurope 140 DisplayName: x-MacCroatian Name: x-MacCroatian 141 DisplayName: x-MacCyrillic Name: x-MacCyrillic 142 DisplayName: x-MacDingbat Name: x-MacDingbat 143 DisplayName: x-MacGreek Name: x-MacGreek 144 DisplayName: x-MacHebrew Name: x-MacHebrew 145 DisplayName: x-MacIceland Name: x-MacIceland 146 DisplayName: x-MacRoman Name: x-MacRoman 147 DisplayName: x-MacRomania Name: x-MacRomania 148 DisplayName: x-MacSymbol Name: x-MacSymbol 149 DisplayName: x-MacThai Name: x-MacThai 150 DisplayName: x-MacTurkish Name: x-MacTurkish 151 DisplayName: x-MacUkraine Name: x-MacUkraine 152 DisplayName: x-MS932_0213 Name: x-MS932_0213 153 DisplayName: x-MS950-HKSCS Name: x-MS950-HKSCS 154 DisplayName: x-MS950-HKSCS-XP Name: x-MS950-HKSCS-XP 155 DisplayName: x-mswin-936 Name: x-mswin-936 156 DisplayName: x-PCK Name: x-PCK 157 DisplayName: x-SJIS_0213 Name: x-SJIS_0213 158 DisplayName: x-UTF-16LE-BOM Name: x-UTF-16LE-BOM 159 DisplayName: X-UTF-32BE-BOM Name: X-UTF-32BE-BOM 160 DisplayName: X-UTF-32LE-BOM Name: X-UTF-32LE-BOM 161 DisplayName: x-windows-50220 Name: x-windows-50220 162 DisplayName: x-windows-50221 Name: x-windows-50221 163 DisplayName: x-windows-874 Name: x-windows-874 164 DisplayName: x-windows-949 Name: x-windows-949 165 DisplayName: x-windows-950 Name: x-windows-950 166 DisplayName: x-windows-iso2022jp Name: x-windows-iso2022jp
貌似比C#支持的编码方式更多一些。
在Eclipse中设置默认的程序集
这个很简单,不同的电脑和程序可能设置不同的编码方式作为默认值,所以一个程序从一台电脑上拷贝到另一台电脑上,程序不一定能够编译。接下来在程序默认的程序集:
JAVA:
1 package code; 2 3 import java.nio.charset.Charset; 4 5 public class Code { 6 7 public static void main(String[] args) { 8 System.out.println("Default CharSet: "+Charset.defaultCharset()); 9 } 10 11 }
输出结果:
1 Default CharSet: UTF-8
我的环境中的C#的默认编码格式:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 Console.WriteLine(Encoding.Default.EncodingName); 14 Console.ReadKey(); 15 } 16 } 17 }
输出结果:
下面说做个有意思的事情,看看C#支持的编码格式都有那种格式能够支持咱们中文,借用一下最开始的那段程序:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 FileStream fs = File.Open("c:\\code.txt", FileMode.OpenOrCreate,FileAccess.ReadWrite); 14 string testStr = "天添"; 15 StringBuilder sb = new StringBuilder(); 16 foreach (EncodingInfo coif in Encoding.GetEncodings()) 17 { 18 Byte[] desBytes = Encoding.GetEncoding(coif.Name).GetBytes(testStr); 19 string desStr = Encoding.GetEncoding(coif.Name).GetString(desBytes); 20 21 sb.Append(" Display Name: " + coif.DisplayName + "----Name: " + coif.Name +"----And The result is: "+ desStr + "\n"); 22 } 23 byte[] coByte = Encoding.GetEncoding("Unicode").GetBytes(sb.ToString()); 24 25 fs.Write(coByte, 0, coByte.Length); 26 fs.Close(); 27 Console.ReadKey(); 28 } 29 } 30 }
输出结果:
1 Display Name: IBM EBCDIC (美国-加拿大)----Name: IBM037----And The result is: ?? 2 Display Name: OEM 美国----Name: IBM437----And The result is: ?? 3 Display Name: IBM EBCDIC (国际)----Name: IBM500----And The result is: ?? 4 Display Name: 阿拉伯字符(ASMO-708)----Name: ASMO-708----And The result is: ?? 5 Display Name: 阿拉伯字符(DOS)----Name: DOS-720----And The result is: ?? 6 Display Name: 希腊字符(DOS)----Name: ibm737----And The result is: ?? 7 Display Name: 波罗的海字符(DOS)----Name: ibm775----And The result is: ?? 8 Display Name: 西欧字符(DOS)----Name: ibm850----And The result is: ?? 9 Display Name: 中欧字符(DOS)----Name: ibm852----And The result is: ?? 10 Display Name: OEM 西里尔语----Name: IBM855----And The result is: ?? 11 Display Name: 土耳其字符(DOS)----Name: ibm857----And The result is: ?? 12 Display Name: OEM 多语言拉丁语 I----Name: IBM00858----And The result is: ?? 13 Display Name: 葡萄牙语(DOS)----Name: IBM860----And The result is: ?? 14 Display Name: 冰岛语(DOS)----Name: ibm861----And The result is: ?? 15 Display Name: 希伯来字符(DOS)----Name: DOS-862----And The result is: ?? 16 Display Name: 加拿大法语(DOS)----Name: IBM863----And The result is: ?? 17 Display Name: 阿拉伯字符(864)----Name: IBM864----And The result is: ?? 18 Display Name: 北欧字符(DOS)----Name: IBM865----And The result is: ?? 19 Display Name: 西里尔字符(DOS)----Name: cp866----And The result is: ?? 20 Display Name: 现代希腊字符(DOS)----Name: ibm869----And The result is: ?? 21 Display Name: IBM EBCDIC (多语言拉丁语 2)----Name: IBM870----And The result is: ?? 22 Display Name: 泰语(Windows)----Name: windows-874----And The result is: ?? 23 Display Name: IBM EBCDIC (现代希腊语)----Name: cp875----And The result is: ?? 24 Display Name: 日语(Shift-JIS)----Name: shift_jis----And The result is: 天添 25 Display Name: 简体中文(GB2312)----Name: gb2312----And The result is: 天添 26 Display Name: 朝鲜语----Name: ks_c_5601-1987----And The result is: 天添 27 Display Name: 繁体中文(Big5)----Name: big5----And The result is: 天添 28 Display Name: IBM EBCDIC (土耳其拉丁语 5)----Name: IBM1026----And The result is: ?? 29 Display Name: IBM 拉丁语 1----Name: IBM01047----And The result is: ?? 30 Display Name: IBM EBCDIC (美国-加拿大-欧洲)----Name: IBM01140----And The result is: ?? 31 Display Name: IBM EBCDIC (德国-欧洲)----Name: IBM01141----And The result is: ?? 32 Display Name: IBM EBCDIC (丹麦-挪威-欧洲)----Name: IBM01142----And The result is: ?? 33 Display Name: IBM EBCDIC (芬兰-瑞典-欧洲)----Name: IBM01143----And The result is: ?? 34 Display Name: IBM EBCDIC (意大利-欧洲)----Name: IBM01144----And The result is: ?? 35 Display Name: IBM EBCDIC (西班牙-欧洲)----Name: IBM01145----And The result is: ?? 36 Display Name: IBM EBCDIC (英国-欧洲)----Name: IBM01146----And The result is: ?? 37 Display Name: IBM EBCDIC (法国-欧洲)----Name: IBM01147----And The result is: ?? 38 Display Name: IBM EBCDIC (国际-欧洲)----Name: IBM01148----And The result is: ?? 39 Display Name: IBM EBCDIC (冰岛语-欧洲)----Name: IBM01149----And The result is: ?? 40 Display Name: Unicode----Name: utf-16----And The result is: 天添 41 Display Name: Unicode (Big-Endian)----Name: utf-16BE----And The result is: 天添 42 Display Name: 中欧字符(Windows)----Name: windows-1250----And The result is: ?? 43 Display Name: 西里尔字符(Windows)----Name: windows-1251----And The result is: ?? 44 Display Name: 西欧字符(Windows)----Name: Windows-1252----And The result is: ?? 45 Display Name: 希腊字符(Windows)----Name: windows-1253----And The result is: ?? 46 Display Name: 土耳其字符(Windows)----Name: windows-1254----And The result is: ?? 47 Display Name: 希伯来字符(Windows)----Name: windows-1255----And The result is: ?? 48 Display Name: 阿拉伯字符(Windows)----Name: windows-1256----And The result is: ?? 49 Display Name: 波罗的海字符(Windows)----Name: windows-1257----And The result is: ?? 50 Display Name: 越南字符(Windows)----Name: windows-1258----And The result is: ?? 51 Display Name: 朝鲜语(Johab)----Name: Johab----And The result is: 天添 52 Display Name: 西欧字符(Mac)----Name: macintosh----And The result is: ?? 53 Display Name: 日语(Mac)----Name: x-mac-japanese----And The result is: 天添 54 Display Name: 繁体中文(Mac)----Name: x-mac-chinesetrad----And The result is: 天添 55 Display Name: 朝鲜语(Mac)----Name: x-mac-korean----And The result is: 天添 56 Display Name: 阿拉伯字符(Mac)----Name: x-mac-arabic----And The result is: ?? 57 Display Name: 希伯来字符(Mac)----Name: x-mac-hebrew----And The result is: ?? 58 Display Name: 希腊字符(Mac)----Name: x-mac-greek----And The result is: ?? 59 Display Name: 西里尔字符(Mac)----Name: x-mac-cyrillic----And The result is: ?? 60 Display Name: 简体中文(Mac)----Name: x-mac-chinesesimp----And The result is: 天添 61 Display Name: 罗马尼亚语(Mac)----Name: x-mac-romanian----And The result is: ?? 62 Display Name: 乌克兰语(Mac)----Name: x-mac-ukrainian----And The result is: ?? 63 Display Name: 泰语(Mac)----Name: x-mac-thai----And The result is: ?? 64 Display Name: 中欧字符(Mac)----Name: x-mac-ce----And The result is: ?? 65 Display Name: 冰岛语(Mac)----Name: x-mac-icelandic----And The result is: ?? 66 Display Name: 土耳其字符(Mac)----Name: x-mac-turkish----And The result is: ?? 67 Display Name: 克罗地亚语(Mac)----Name: x-mac-croatian----And The result is: ?? 68 Display Name: Unicode (UTF-32)----Name: utf-32----And The result is: 天添 69 Display Name: Unicode (UTF-32 Big-Endian)----Name: utf-32BE----And The result is: 天添 70 Display Name: 繁体中文(CNS)----Name: x-Chinese-CNS----And The result is: 天添 71 Display Name: TCA 台湾----Name: x-cp20001----And The result is: 天添 72 Display Name: 繁体中文(Eten)----Name: x-Chinese-Eten----And The result is: 天添 73 Display Name: IBM5550 台湾----Name: x-cp20003----And The result is: 天添 74 Display Name: TeleText 台湾----Name: x-cp20004----And The result is: 天添 75 Display Name: Wang 台湾----Name: x-cp20005----And The result is: 天添 76 Display Name: 西欧字符(IA5)----Name: x-IA5----And The result is: ?? 77 Display Name: 德语(IA5)----Name: x-IA5-German----And The result is: ?? 78 Display Name: 瑞典语(IA5)----Name: x-IA5-Swedish----And The result is: ?? 79 Display Name: 挪威语(IA5)----Name: x-IA5-Norwegian----And The result is: ?? 80 Display Name: US-ASCII----Name: us-ascii----And The result is: ?? 81 Display Name: T.61----Name: x-cp20261----And The result is: ?? 82 Display Name: ISO-6937----Name: x-cp20269----And The result is: ?? 83 Display Name: IBM EBCDIC (德国)----Name: IBM273----And The result is: ?? 84 Display Name: IBM EBCDIC (丹麦-挪威)----Name: IBM277----And The result is: ?? 85 Display Name: IBM EBCDIC (芬兰-瑞典)----Name: IBM278----And The result is: ?? 86 Display Name: IBM EBCDIC (意大利)----Name: IBM280----And The result is: ?? 87 Display Name: IBM EBCDIC (西班牙)----Name: IBM284----And The result is: ?? 88 Display Name: IBM EBCDIC (UK)----Name: IBM285----And The result is: ?? 89 Display Name: IBM EBCDIC (日语片假名)----Name: IBM290----And The result is: ?? 90 Display Name: IBM EBCDIC (法国)----Name: IBM297----And The result is: ?? 91 Display Name: IBM EBCDIC (阿拉伯语)----Name: IBM420----And The result is: ?? 92 Display Name: IBM EBCDIC (希腊语)----Name: IBM423----And The result is: ?? 93 Display Name: IBM EBCDIC (希伯来语)----Name: IBM424----And The result is: ?? 94 Display Name: IBM EBCDIC (朝鲜语扩展)----Name: x-EBCDIC-KoreanExtended----And The result is: ?? 95 Display Name: IBM EBCDIC (泰语)----Name: IBM-Thai----And The result is: ?? 96 Display Name: 西里尔字符(KOI8-R)----Name: koi8-r----And The result is: ?? 97 Display Name: IBM EBCDIC (冰岛语)----Name: IBM871----And The result is: ?? 98 Display Name: IBM EBCDIC (西里尔俄语)----Name: IBM880----And The result is: ?? 99 Display Name: IBM EBCDIC (土耳其语)----Name: IBM905----And The result is: ?? 100 Display Name: IBM 拉丁语 1----Name: IBM00924----And The result is: ?? 101 Display Name: 日语(JIS 0208-1990 和 0212-1990)----Name: EUC-JP----And The result is: 天添 102 Display Name: 简体中文(GB2312-80)----Name: x-cp20936----And The result is: 天添 103 Display Name: 朝鲜语 Wansung----Name: x-cp20949----And The result is: 天添 104 Display Name: IBM EBCDIC (西里尔塞尔维亚-保加利亚语)----Name: cp1025----And The result is: ?? 105 Display Name: 西里尔字符(KOI8-U)----Name: koi8-u----And The result is: ?? 106 Display Name: 西欧字符(ISO)----Name: iso-8859-1----And The result is: ?? 107 Display Name: 中欧字符(ISO)----Name: iso-8859-2----And The result is: ?? 108 Display Name: 拉丁语 3 (ISO)----Name: iso-8859-3----And The result is: ?? 109 Display Name: 波罗的海字符(ISO)----Name: iso-8859-4----And The result is: ?? 110 Display Name: 西里尔字符(ISO)----Name: iso-8859-5----And The result is: ?? 111 Display Name: 阿拉伯字符(ISO)----Name: iso-8859-6----And The result is: ?? 112 Display Name: 希腊字符(ISO)----Name: iso-8859-7----And The result is: ?? 113 Display Name: 希伯来字符(ISO-Visual)----Name: iso-8859-8----And The result is: ?? 114 Display Name: 土耳其字符(ISO)----Name: iso-8859-9----And The result is: ?? 115 Display Name: 爱沙尼亚语(ISO)----Name: iso-8859-13----And The result is: ?? 116 Display Name: 拉丁语 9 (ISO)----Name: iso-8859-15----And The result is: ?? 117 Display Name: 欧罗巴----Name: x-Europa----And The result is: ?? 118 Display Name: 希伯来字符(ISO-Logical)----Name: iso-8859-8-i----And The result is: ?? 119 Display Name: 日语(JIS)----Name: iso-2022-jp----And The result is: 天添 120 Display Name: 日语(JIS-允许 1 字节假名)----Name: csISO2022JP----And The result is: 天添 121 Display Name: 日语(JIS-允许 1 字节假名 - SO/SI)----Name: iso-2022-jp----And The result is: 天添 122 Display Name: 朝鲜语(ISO)----Name: iso-2022-kr----And The result is: 天添 123 Display Name: 简体中文(ISO-2022)----Name: x-cp50227----And The result is: 天添 124 Display Name: 日语(EUC)----Name: euc-jp----And The result is: 天添 125 Display Name: 简体中文(EUC)----Name: EUC-CN----And The result is: 天添 126 Display Name: 朝鲜语(EUC)----Name: euc-kr----And The result is: 天添 127 Display Name: 简体中文(HZ)----Name: hz-gb-2312----And The result is: 天添 128 Display Name: 简体中文(GB18030)----Name: GB18030----And The result is: 天添 129 Display Name: ISCII 梵文----Name: x-iscii-de----And The result is: ?? 130 Display Name: ISCII 孟加拉语----Name: x-iscii-be----And The result is: ?? 131 Display Name: ISCII 泰米尔语----Name: x-iscii-ta----And The result is: ?? 132 Display Name: ISCII 泰卢固语----Name: x-iscii-te----And The result is: ?? 133 Display Name: ISCII 阿萨姆语----Name: x-iscii-as----And The result is: ?? 134 Display Name: ISCII 奥里雅语----Name: x-iscii-or----And The result is: ?? 135 Display Name: ISCII 卡纳达语----Name: x-iscii-ka----And The result is: ?? 136 Display Name: ISCII 马拉雅拉姆语----Name: x-iscii-ma----And The result is: ?? 137 Display Name: ISCII 古吉拉特语----Name: x-iscii-gu----And The result is: ?? 138 Display Name: ISCII 旁遮普语----Name: x-iscii-pa----And The result is: ?? 139 Display Name: Unicode (UTF-7)----Name: utf-7----And The result is: 天添 140 Display Name: Unicode (UTF-8)----Name: utf-8----And The result is: 天添
看了一下,有24中编码方式能够解析中文,其中还包括日本朝鲜台湾。有点意思。
虽然有一些编码方式都支持中文,但是他们确实是一样的吗?找几个看一下:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 14 string testStr = "天添"; 15 16 ASCIIEncoding ascii = new ASCIIEncoding(); 17 UTF8Encoding utf8Encoding = new UTF8Encoding(); 18 19 Console.WriteLine("原字符串为: " + testStr); 20 Byte[] asciiBytes = ascii.GetBytes(testStr); 21 Console.Write("Ascii转换的字节为:"); 22 foreach (Byte b in asciiBytes) 23 { 24 Console.Write("[{0}]", b); 25 } 26 Byte[] utf8Bytes = utf8Encoding.GetBytes(testStr); 27 Console.WriteLine(); 28 Console.Write("UTF8转换的字节为:"); 29 foreach (Byte b in utf8Bytes) 30 { 31 Console.Write("[{0}]", b); 32 } 33 Console.WriteLine(); 34 Byte[] gb2312Bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(testStr); 35 Console.Write("Gb2312转换的字节为: "); 36 foreach (Byte b in gb2312Bytes) 37 { 38 Console.Write("[{0}]", b); 39 } 40 Console.WriteLine(); 41 Byte[] jpBytes = Encoding.GetEncoding("iso-2022-jp").GetBytes(testStr); 42 Console.Write("iso-2022-jp转换的字节为: "); 43 foreach (Byte b in jpBytes) 44 { 45 Console.Write("[{0}]", b); 46 } 47 Console.WriteLine(); 48 string desAsciiStr = Encoding.GetEncoding("ascii").GetString(asciiBytes); 49 string desUtf8Str = Encoding.GetEncoding("utf-8").GetString(utf8Bytes); 50 string desGb2312Str = Encoding.GetEncoding("hz-gb-2312").GetString(gb2312Bytes); 51 string desJpStr = Encoding.GetEncoding("csISO2022JP").GetString(jpBytes); 52 Console.WriteLine("ascii转换结果: " + desAsciiStr); 53 Console.WriteLine("uft8转换结果: " + desUtf8Str); 54 Console.WriteLine("gb2312转换结果: " + desGb2312Str); 55 Console.WriteLine("iso-2022-jp转换结果: " + desJpStr); 56 Console.ReadKey(); 57 } 58 } 59 }
执行结果:
发现个问题:
即使最终解析成功的UTF8和GB2312,但是它们中间产生的byte数组其实不一样的,这个好理解。这也是因为使用不同的字符编码。
下面看一下.NET FRAMEWORK提供的Encoding类提供处理编码的方式
ASCIIEncoding,UTF8Encoding刚才已经稍微的用了一下了,下面试用一下其他的三个,在尝试的过程中发现有一点点的不一样。这也是Unicode的两个问题,
NUL问题:因为C语言处理字符串中的NUL和C#处理方式不同。(我也不是特别熟悉,囧)
字节排序问题:计算机中表示16位整数的时候,关于字节顺序有两种方式,一种是little endian,低位的8位先放,英特尔x86系列的CPU就是这样设计的。另一种成为big endian,代表性的SUN公司APARC的CPU。这样就有问题,选择哪种方式特别重要,再此CPU上使用这种方式编写,在另一种CPU上执行此程序需要更久的时间。
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 14 string testStr = "天添"; 15 16 UnicodeEncoding unicodingBigEnd = new UnicodeEncoding(true, true); 17 UnicodeEncoding unicodingLittleEnd = new UnicodeEncoding(false, true); 18 Console.WriteLine("原字符串为: " + testStr); 19 Byte[] unicodingBigEndBytes = unicodingBigEnd.GetBytes(testStr); 20 Console.Write("BinEnd转换的字节为:"); 21 foreach (Byte b in unicodingBigEndBytes) 22 { 23 Console.Write("[{0}]", b); 24 } 25 Console.WriteLine(); 26 Byte[] unicodingLittleBytes = unicodingLittleEnd.GetBytes(testStr); 27 Console.Write("Little转换的字节为:"); 28 foreach (Byte b in unicodingLittleBytes) 29 { 30 Console.Write("[{0}]", b); 31 } 32 Console.WriteLine(); 33 string unicodeBigEnd = Encoding.GetEncoding("utf-16BE").GetString(unicodingBigEndBytes); 34 string unicodeLittleEnd = Encoding.GetEncoding("utf-16").GetString(unicodingLittleBytes); 35 36 Console.WriteLine("BinEnd转换结果: " + unicodeBigEnd); 37 Console.WriteLine("Little转换结果: " + unicodeLittleEnd); 38 Console.ReadKey(); 39 } 40 } 41 }
看结果:
发现果然是byte的顺序不一样,UTF32Encoding也有此问题。
貌似说了好多,又好像什么都没说,而且说的乱糟糟的。感觉对于编码方式有了一点新的认识,不知道我理解的对也不对,欢迎大家交流。上个图:
编程语言处理文本数据UCS方式和CSI方式的内容。以后再说吧。