如何准确地将Java中的UTF8编码文件读入字符串?
当我改变这个编码。java文件UTF-8(Eclipse
为什么源文件的编码应该对从字节创建字符串有任何影响。当编码已知时,从字节创建字符串的万无一失的方法是什么?我可能有不同编码的文件。一旦文件的编码已知,我必须能够读入字符串,而不管file.编码的值如何?
utf8文件的内容如下
English Hello World.
Korean 안녕하세요.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi हैलो वर्ल्ड।
Gujarati હેલો વર્લ્ડ.
Thai สวัสดีชาวโลก.
-文件结束-
代码在下面。我的观察在里面的评论中。
public class App {
public static void main(String[] args) {
String slash = System.getProperty("file.separator");
File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
File outputUtfByteWrittenFile = new File(
"C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
outputUtfFile.delete();
outputUtfByteWrittenFile.delete();
try {
/*
* read a utf8 text file with internationalized strings into bytes.
* there should be no information loss here, when read into raw bytes.
* We are sure that this file is UTF-8 encoded.
* Input file created using Notepad++. Text copied from Google translate.
*/
byte[] fileBytes = readBytes(inputUtfFile);
/*
* Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
*/
String str = new String(fileBytes, StandardCharsets.UTF_8);
/*
* The console is incapable of displaying this string.
* So we write into another file. Open in notepad++ to check.
*/
ArrayList<String> list = new ArrayList<>();
list.add(str);
writeLines(list, outputUtfFile);
/*
* Works fine when I read bytes and write bytes.
* Open the other output file in notepad++ and check.
*/
writeBytes(fileBytes, outputUtfByteWrittenFile);
/*
* I am using JDK 8u60.
* I tried running this on command line instead of eclipse. Does not work.
* I tried using apache commons io library. Does not work.
*
* This means that new String(bytes, charset); does not work correctly.
* There is no real effect of specifying charset to string.
*/
} catch (IOException e) {
e.printStackTrace();
}
}
public static void writeLines(List<String> lines, File file) throws IOException {
BufferedWriter writer = null;
OutputStreamWriter osw = null;
OutputStream fos = null;
try {
fos = new FileOutputStream(file);
osw = new OutputStreamWriter(fos);
writer = new BufferedWriter(osw);
String lineSeparator = System.getProperty("line.separator");
for (int i = 0; i < lines.size(); i++) {
String line = lines.get(i);
writer.write(line);
if (i < lines.size() - 1) {
writer.write(lineSeparator);
}
}
} catch (IOException e) {
throw e;
} finally {
close(writer);
close(osw);
close(fos);
}
}
public static byte[] readBytes(File file) {
FileInputStream fis = null;
byte[] b = null;
try {
fis = new FileInputStream(file);
b = readBytesFromStream(fis);
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fis);
}
return b;
}
public static void writeBytes(byte[] inBytes, File file) {
FileOutputStream fos = null;
try {
fos = new FileOutputStream(file);
writeBytesToStream(inBytes, fos);
fos.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fos);
}
}
public static void close(InputStream inStream) {
try {
inStream.close();
} catch (IOException e) {
e.printStackTrace();
}
inStream = null;
}
public static void close(OutputStream outStream) {
try {
outStream.close();
} catch (IOException e) {
e.printStackTrace();
}
outStream = null;
}
public static void close(Writer writer) {
if (writer != null) {
try {
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
writer = null;
}
}
public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
int bytesread = -1;
byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
long count = 0;
bytesread = readStream.read(b);
while (bytesread != -1) {
writeStream.write(b, 0, bytesread);
count += bytesread;
bytesread = readStream.read(b);
}
return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
ByteArrayOutputStream writeStream = null;
byte[] byteArr = null;
writeStream = new ByteArrayOutputStream();
try {
copy(readStream, writeStream);
writeStream.flush();
byteArr = writeStream.toByteArray();
} finally {
close(writeStream);
}
return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
ByteArrayInputStream bis = null;
bis = new ByteArrayInputStream(inBytes);
try {
copy(bis, writeStream);
} finally {
close(bis);
}
}
};
编辑:对于@JB Nizet和所有人:)
//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work.
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works
将字节读入String时,我需要指定字节编码。当我将字符串中的字节写入文件时,我需要指定字节编码。
一旦我在JVM中有一个字符串,我就不需要记住源字节编码,对吗?
当我写入文件时,它应该将字符串转换为我机器的默认字符集(无论是UTF8、ASCII还是cp1252)。那是失败的。UTF16 BE也失败了。为什么某些字符集会失败?
源文件编码Java确实无关紧要。而且你代码的读取部分是正确的(虽然效率低下)。不正确的是写入部分:
osw = new OutputStreamWriter(fos);
应改为
osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
否则,您使用默认编码(在您的系统上似乎不是UTF8)而不是使用UTF8。
请注意,Java允许在文件路径中使用正斜杠,即使在Windows上也是如此
File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");
编辑:
一旦我在JVM中有一个字符串,我就不需要记住源字节编码,对吗?
是的,你说得对。
当我写入文件时,它应该将字符串转换为我机器的默认字符集(无论是UTF8、ASCII还是cp1252)。那是失败的。
如果您不指定任何编码,Java确实会使用平台默认编码将字符转换为字节。如果您指定了编码(如本答案开头所建议的),那么它将使用您告诉它使用的编码。
但是所有的编码不能像UTF8一样代表所有的unicode字符。例如ASCII只支持128个不同的字符。Cp1252,AFAIK,只支持256个字符。所以,编码成功了,但是它用一个特殊的字符替换了不可编码的字符(我不记得是哪一个),这意味着:我不能对这个泰语或俄语字符进行编码,因为它不是我支持的字符集的一部分。
UTF16编码应该没问题。但请务必将文本编辑器配置为在读取和显示文件内容时使用UTF16。如果配置为使用其他编码,则显示的内容将不正确。