How to convert a text file with mixture of encodings to a single encoding?How can I find a file's path in the...

Equivalent of "illegal" for violating civil law

Why is 'diphthong' pronounced the way it is?

How do you funnel food off a cutting board?

Website seeing my Facebook data?

What is a good reason for every spaceship to carry a weapon on board?

Is there a verb that means to inject with poison?

Has any human ever had the choice to leave Earth permanently?

I have trouble understanding this fallacy: "If A, then B. Therefore if not-B, then not-A."

What's this assembly doing?

Am I correct in stating that the study of topology is purely theoretical?

How can I play a serial killer in a party of good PCs?

Microtypography protrusion with Polish quotation marks

What is the wife of a henpecked husband called?

Non-Cancer terminal illness that can affect young (age 10-13) girls?

How much mayhem could I cause as a fish?

How do you voice extended chords?

Why avoid shared user accounts?

Which RAF squadrons and aircraft types took part in the bombing of Berlin on the 25th of August 1940?

Subsurf on a crown. How can I smooth some edges and keep others sharp?

Eww, those bytes are gross

Why is it that Bernie Sanders is always called a "socialist"?

Is there a file that always exists and a 'normal' user can't lstat it?

How to not let the Identify spell spoil everything?

Can the "Friends" spell be used without making the target hostile?



How to convert a text file with mixture of encodings to a single encoding?


How can I find a file's path in the text encoding used by PosteRazor?Make gedit recognize more encodingsChange File Encoding InfoReplacing LF, NEL line endings in text file with CR+LFfind out the encoding method that gedit uses to open a text file?How does gedit detect the encoding of a text file?Detect & Convert encoding for broken file namesConvert Text File EncodingHow to handle any text encoding while processing text operations?How can I know the encoding Gedit assumes for a given file?













3
















  1. I created a text file by copying its different parts from different
    sources (webpages, other text files, pdf files) into gedit and
    saving it to the file. I guess that is the reason that I have
    multiple encodings in the text file, but I am not sure. How can I
    avoid creating a text file with mixed encodings by copying its
    different parts from different sources into gedit?


  2. Whenever I open the file in gedit, gedit can always show or decode
    every part of the text correctly. It seems that gedit can handle a
    text file with mixed encodings, but I am not sure.



    But when I open the file in emacs, there will be characters that
    can't be shown correctly. (I am not sure why emacs can't do that.)
    So I would like to convert the file from mixed encodings to a single
    encoding such as utf-8.



    Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
    utf-8, or at least tell me what encoding it finds for which part of the file?




Thanks.










share|improve this question

























  • When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

    – jeremija
    Sep 27 '14 at 17:37











  • Is that the encoding which gedit used for opening the text file?

    – Tim
    Sep 27 '14 at 17:39













  • Most probably it is.

    – jeremija
    Sep 27 '14 at 17:41











  • Is that also the encoding which gedit guessed for the text file?

    – Tim
    Sep 27 '14 at 17:42











  • I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

    – jeremija
    Sep 27 '14 at 17:43
















3
















  1. I created a text file by copying its different parts from different
    sources (webpages, other text files, pdf files) into gedit and
    saving it to the file. I guess that is the reason that I have
    multiple encodings in the text file, but I am not sure. How can I
    avoid creating a text file with mixed encodings by copying its
    different parts from different sources into gedit?


  2. Whenever I open the file in gedit, gedit can always show or decode
    every part of the text correctly. It seems that gedit can handle a
    text file with mixed encodings, but I am not sure.



    But when I open the file in emacs, there will be characters that
    can't be shown correctly. (I am not sure why emacs can't do that.)
    So I would like to convert the file from mixed encodings to a single
    encoding such as utf-8.



    Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
    utf-8, or at least tell me what encoding it finds for which part of the file?




Thanks.










share|improve this question

























  • When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

    – jeremija
    Sep 27 '14 at 17:37











  • Is that the encoding which gedit used for opening the text file?

    – Tim
    Sep 27 '14 at 17:39













  • Most probably it is.

    – jeremija
    Sep 27 '14 at 17:41











  • Is that also the encoding which gedit guessed for the text file?

    – Tim
    Sep 27 '14 at 17:42











  • I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

    – jeremija
    Sep 27 '14 at 17:43














3












3








3









  1. I created a text file by copying its different parts from different
    sources (webpages, other text files, pdf files) into gedit and
    saving it to the file. I guess that is the reason that I have
    multiple encodings in the text file, but I am not sure. How can I
    avoid creating a text file with mixed encodings by copying its
    different parts from different sources into gedit?


  2. Whenever I open the file in gedit, gedit can always show or decode
    every part of the text correctly. It seems that gedit can handle a
    text file with mixed encodings, but I am not sure.



    But when I open the file in emacs, there will be characters that
    can't be shown correctly. (I am not sure why emacs can't do that.)
    So I would like to convert the file from mixed encodings to a single
    encoding such as utf-8.



    Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
    utf-8, or at least tell me what encoding it finds for which part of the file?




Thanks.










share|improve this question

















  1. I created a text file by copying its different parts from different
    sources (webpages, other text files, pdf files) into gedit and
    saving it to the file. I guess that is the reason that I have
    multiple encodings in the text file, but I am not sure. How can I
    avoid creating a text file with mixed encodings by copying its
    different parts from different sources into gedit?


  2. Whenever I open the file in gedit, gedit can always show or decode
    every part of the text correctly. It seems that gedit can handle a
    text file with mixed encodings, but I am not sure.



    But when I open the file in emacs, there will be characters that
    can't be shown correctly. (I am not sure why emacs can't do that.)
    So I would like to convert the file from mixed encodings to a single
    encoding such as utf-8.



    Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
    utf-8, or at least tell me what encoding it finds for which part of the file?




Thanks.







gedit emacs encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 27 '14 at 17:44







Tim

















asked Sep 27 '14 at 17:11









TimTim

8,16042104177




8,16042104177













  • When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

    – jeremija
    Sep 27 '14 at 17:37











  • Is that the encoding which gedit used for opening the text file?

    – Tim
    Sep 27 '14 at 17:39













  • Most probably it is.

    – jeremija
    Sep 27 '14 at 17:41











  • Is that also the encoding which gedit guessed for the text file?

    – Tim
    Sep 27 '14 at 17:42











  • I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

    – jeremija
    Sep 27 '14 at 17:43



















  • When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

    – jeremija
    Sep 27 '14 at 17:37











  • Is that the encoding which gedit used for opening the text file?

    – Tim
    Sep 27 '14 at 17:39













  • Most probably it is.

    – jeremija
    Sep 27 '14 at 17:41











  • Is that also the encoding which gedit guessed for the text file?

    – Tim
    Sep 27 '14 at 17:42











  • I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

    – jeremija
    Sep 27 '14 at 17:43

















When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37





When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37













Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39







Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39















Most probably it is.

– jeremija
Sep 27 '14 at 17:41





Most probably it is.

– jeremija
Sep 27 '14 at 17:41













Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42





Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42













I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43





I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43










2 Answers
2






active

oldest

votes


















2














Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



(1) It's in the repos, just install it with sudo apt-get enca.






share|improve this answer


























  • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

    – Tim
    Sep 27 '14 at 18:09













  • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

    – Rmano
    Sep 27 '14 at 22:10



















1














I had the same problem and solved it with Emacs. The solution is quoted from here:




Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




Another one is to split the two parts which have different encodings, copy them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).






share|improve this answer

























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "89"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f529322%2fhow-to-convert-a-text-file-with-mixture-of-encodings-to-a-single-encoding%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



    For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



    I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



     while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


    ...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



    (1) It's in the repos, just install it with sudo apt-get enca.






    share|improve this answer


























    • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

      – Tim
      Sep 27 '14 at 18:09













    • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

      – Rmano
      Sep 27 '14 at 22:10
















    2














    Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



    For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



    I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



     while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


    ...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



    (1) It's in the repos, just install it with sudo apt-get enca.






    share|improve this answer


























    • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

      – Tim
      Sep 27 '14 at 18:09













    • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

      – Rmano
      Sep 27 '14 at 22:10














    2












    2








    2







    Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



    For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



    I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



     while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


    ...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



    (1) It's in the repos, just install it with sudo apt-get enca.






    share|improve this answer















    Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



    For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



    I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



     while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


    ...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



    (1) It's in the repos, just install it with sudo apt-get enca.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Sep 27 '14 at 17:59

























    answered Sep 27 '14 at 17:53









    RmanoRmano

    25.4k879146




    25.4k879146













    • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

      – Tim
      Sep 27 '14 at 18:09













    • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

      – Rmano
      Sep 27 '14 at 22:10



















    • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

      – Tim
      Sep 27 '14 at 18:09













    • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

      – Rmano
      Sep 27 '14 at 22:10

















    (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

    – Tim
    Sep 27 '14 at 18:09







    (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

    – Tim
    Sep 27 '14 at 18:09















    I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

    – Rmano
    Sep 27 '14 at 22:10





    I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

    – Rmano
    Sep 27 '14 at 22:10













    1














    I had the same problem and solved it with Emacs. The solution is quoted from here:




    Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




    Another one is to split the two parts which have different encodings, copy them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).






    share|improve this answer






























      1














      I had the same problem and solved it with Emacs. The solution is quoted from here:




      Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




      Another one is to split the two parts which have different encodings, copy them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).






      share|improve this answer




























        1












        1








        1







        I had the same problem and solved it with Emacs. The solution is quoted from here:




        Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




        Another one is to split the two parts which have different encodings, copy them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).






        share|improve this answer















        I had the same problem and solved it with Emacs. The solution is quoted from here:




        Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




        Another one is to split the two parts which have different encodings, copy them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 3 mins ago

























        answered Jan 23 at 18:46









        giordanogiordano

        1113




        1113






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Ask Ubuntu!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f529322%2fhow-to-convert-a-text-file-with-mixture-of-encodings-to-a-single-encoding%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Why do type traits not work with types in namespace scope?What are POD types in C++?Why can templates only be...

            Simple Scan not detecting my scanner (Brother DCP-7055W)Brother MFC-L2700DW printer can print, can't...

            Will tsunami waves travel forever if there was no land?Why do tsunami waves begin with the water flowing away...