How to convert a text file with mixture of encodings to a single encoding?How can I find a file's path in the...

Equivalent of "illegal" for violating civil law

Why is 'diphthong' pronounced the way it is?

How do you funnel food off a cutting board?

Website seeing my Facebook data?

What is a good reason for every spaceship to carry a weapon on board?

Is there a verb that means to inject with poison?

Has any human ever had the choice to leave Earth permanently?

I have trouble understanding this fallacy: "If A, then B. Therefore if not-B, then not-A."

What's this assembly doing?

Am I correct in stating that the study of topology is purely theoretical?

How can I play a serial killer in a party of good PCs?

Microtypography protrusion with Polish quotation marks

What is the wife of a henpecked husband called?

Non-Cancer terminal illness that can affect young (age 10-13) girls?

How much mayhem could I cause as a fish?

How do you voice extended chords?

Why avoid shared user accounts?

Which RAF squadrons and aircraft types took part in the bombing of Berlin on the 25th of August 1940?

Subsurf on a crown. How can I smooth some edges and keep others sharp?

Eww, those bytes are gross

Why is it that Bernie Sanders is always called a "socialist"?

Is there a file that always exists and a 'normal' user can't lstat it?

How to not let the Identify spell spoil everything?

Can the "Friends" spell be used without making the target hostile?

How to convert a text file with mixture of encodings to a single encoding?

How can I find a file's path in the text encoding used by PosteRazor?Make gedit recognize more encodingsChange File Encoding InfoReplacing LF, NEL line endings in text file with CR+LFfind out the encoding method that gedit uses to open a text file?How does gedit detect the encoding of a text file?Detect & Convert encoding for broken file namesConvert Text File EncodingHow to handle any text encoding while processing text operations?How can I know the encoding Gedit assumes for a given file?

I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?

Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.

But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.

Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?

Thanks.

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,16042104177

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

|
show 3 more comments

I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?

Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.

But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.

Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?

Thanks.

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,16042104177

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

|
show 3 more comments

I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?

Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.

But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.

Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?

Thanks.

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,16042104177

I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?

Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.

But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.

Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?

Thanks.

gedit emacs encoding

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,16042104177

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,16042104177

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,16042104177

asked Sep 27 '14 at 17:11

Tim

8,16042104177

asked Sep 27 '14 at 17:11

Tim

8,16042104177

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

|
show 3 more comments

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

|
show 3 more comments

2 Answers
2

active

oldest

votes

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.

I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.4k879146

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

Another one is to split the two parts which have different encodings, copy them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).

edited 3 mins ago

answered Jan 23 at 18:46

giordano

1113

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f529322%2fhow-to-convert-a-text-file-with-mixture-of-encodings-to-a-single-encoding%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.4k879146

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.4k879146

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.4k879146

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.4k879146

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.4k879146

answered Sep 27 '14 at 17:53

Rmano

25.4k879146

answered Sep 27 '14 at 17:53

Rmano

25.4k879146

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

edited 3 mins ago

answered Jan 23 at 18:46

giordano

1113

add a comment |

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

edited 3 mins ago

answered Jan 23 at 18:46

giordano

1113

add a comment |

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

edited 3 mins ago

answered Jan 23 at 18:46

giordano

1113

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

edited 3 mins ago

answered Jan 23 at 18:46

giordano

1113

edited 3 mins ago

answered Jan 23 at 18:46

giordano

1113

answered Jan 23 at 18:46

giordano

1113

answered Jan 23 at 18:46

giordano

1113

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sfdwhf