Converting a text document with special format to Pandas DataFrame Announcing the arrival of...

Why is one lightbulb in a string illuminated?

Why did Israel vote against lifting the American embargo on Cuba?

How is an IPA symbol that lacks a name (e.g. ɲ) called?

Unix AIX passing variable and arguments to expect and spawn

What is the definining line between a helicopter and a drone a person can ride in?

How to ask rejected full-time candidates to apply to teach individual courses?

Providing direct feedback to a product salesperson

Trying to enter the Fox's den

Who's this lady in the war room?

A journey... into the MIND

If gravity precedes the formation of a solar system, where did the mass come from that caused the gravity?

Why aren't these two solutions equivalent? Combinatorics problem

Assertions In A Mock Callout Test

How to make an animal which can only breed for a certain number of generations?

Will the Antimagic Field spell cause elementals not summoned by magic to dissipate?

Are Flameskulls resistant to magical piercing damage?

Does traveling In The United States require a passport or can I use my green card if not a US citizen?

A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?

Where is Bhagavad Gita referred to as Hari Gita?

Kepler's 3rd law: ratios don't fit data

Is there a verb for listening stealthily?

“Since the train was delayed for more than an hour, passengers were given a full refund.” – Why is there no article before “passengers”?

Is Bran literally the world's memory?

How was Lagrange appointed professor of mathematics so early?

Converting a text document with special format to Pandas DataFrame

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)

Data science time! April 2019 and salary with experience

The Ask Question Wizard is Live!How can I reverse a list in Python?Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeUse a list of values to select rows from a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I have a text file with the following format:

1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345

I need to covert this text to a DataFrame with the following format:

Id   Term    weight

1    frack   0.733

1    shale   0.700

10   space   0.645

10   station 0.327

10   nasa    0.258

4    celebr  0.262

4    bahar   0.345

How I can do it?

edited 7 hours ago

Brad Solomon

15.1k83996

asked 10 hours ago

Mary

465217

I can only think of regex helping here.

– amanb
10 hours ago

1

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
9 hours ago

It can be done with explode and split

– Wen-Ben
9 hours ago

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
9 hours ago

The data is in text format.

– Mary
9 hours ago

|
show 1 more comment

I have a text file with the following format:

1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345

I need to covert this text to a DataFrame with the following format:

Id   Term    weight

1    frack   0.733

1    shale   0.700

10   space   0.645

10   station 0.327

10   nasa    0.258

4    celebr  0.262

4    bahar   0.345

How I can do it?

edited 7 hours ago

Brad Solomon

15.1k83996

asked 10 hours ago

Mary

465217

I can only think of regex helping here.

– amanb
10 hours ago

1

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
9 hours ago

It can be done with explode and split

– Wen-Ben
9 hours ago

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
9 hours ago

The data is in text format.

– Mary
9 hours ago

|
show 1 more comment

I have a text file with the following format:

1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345

I need to covert this text to a DataFrame with the following format:

Id   Term    weight

1    frack   0.733

1    shale   0.700

10   space   0.645

10   station 0.327

10   nasa    0.258

4    celebr  0.262

4    bahar   0.345

How I can do it?

edited 7 hours ago

Brad Solomon

15.1k83996

asked 10 hours ago

Mary

465217

I have a text file with the following format:

1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345

I need to covert this text to a DataFrame with the following format:

Id   Term    weight

1    frack   0.733

1    shale   0.700

10   space   0.645

10   station 0.327

10   nasa    0.258

4    celebr  0.262

4    bahar   0.345

How I can do it?

python pandas

edited 7 hours ago

Brad Solomon

15.1k83996

asked 10 hours ago

Mary

465217

edited 7 hours ago

Brad Solomon

15.1k83996

asked 10 hours ago

Mary

465217

edited 7 hours ago

Brad Solomon

15.1k83996

edited 7 hours ago

Brad Solomon

15.1k83996

edited 7 hours ago

Brad Solomon

15.1k83996

asked 10 hours ago

Mary

465217

asked 10 hours ago

Mary

465217

asked 10 hours ago

Mary

465217

I can only think of regex helping here.

– amanb
10 hours ago

1

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
9 hours ago

It can be done with explode and split

– Wen-Ben
9 hours ago

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
9 hours ago

The data is in text format.

– Mary
9 hours ago

|
show 1 more comment

I can only think of regex helping here.

– amanb
10 hours ago

1

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
9 hours ago

It can be done with explode and split

– Wen-Ben
9 hours ago

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
9 hours ago

The data is in text format.

– Mary
9 hours ago

I can only think of regex helping here.

– amanb
10 hours ago

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
9 hours ago

It can be done with explode and split

– Wen-Ben
9 hours ago

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
9 hours ago

The data is in text format.

– Mary
9 hours ago

|
show 1 more comment

8 Answers
8

active

oldest

votes

Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.

import re

import pandas as pd



SEP_RE = re.compile(r":s+")

DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)





def parse(filepath: str):

    def _parse(filepath):

        with open(filepath) as f:

            for line in f:

                id, rest = SEP_RE.split(line, maxsplit=1)

                for match in DATA_RE.finditer(rest):

                    yield [int(id), match["term"], float(match["weight"])]

    return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),

...                   columns=["Id", "Term", "weight"])

>>> 

>>> df

   Id     Term  weight

0   1    frack   0.733

1   1    shale   0.700

2  10    space   0.645

3  10  station   0.327

4  10     nasa   0.258

5   4   celebr   0.262

6   4    bahar   0.345



>>> df.dtypes

Id          int64

Term       object

weight    float64

dtype: object

Walkthrough

SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.

After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"

>>> SEP_RE.split(line, maxsplit=1)

['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)

>>> it = DATA_RE.finditer(rest)

>>> match = next(it)

>>> match

<re.Match object; span=(0, 11), match='frack 0.733'>

>>> match["term"]

'frack'

>>> match["weight"]

'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.

edited 3 hours ago

answered 9 hours ago

Brad Solomon

15.1k83996

3

Brilliant answer, I must say.

– amanb
9 hours ago

@amanb Thank you!

– Brad Solomon
9 hours ago

add a comment |

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd

from itertools import chain



text="""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """



df = pd.DataFrame(

    list(

        chain.from_iterable(

            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 

            map(lambda x: x.strip(" ,").split(":"), text.splitlines())

        )

    ), 

    columns=["Id", "Term", "weight"]

)



print(df)

#  Id     Term weight

#0  4    frack  0.733

#1  4    shale  0.700

#2  4    space  0.645

#3  4  station  0.327

#4  4     nasa  0.258

#5  4   celebr  0.262

#6  4    bahar  0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))

#[['1', ' frack 0.733, shale 0.700'], 

# ['10', ' space 0.645, station 0.327, nasa 0.258'], 

# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(

    [

        list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 

        map(lambda x: x.strip(" ,").split(":"), text.splitlines())

    ]

)

#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],

# [('10', 'space', '0.645'),

#  ('10', 'station', '0.327'),

#  ('10', 'nasa', '0.258')],

# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited 9 hours ago

answered 9 hours ago

pault

17.3k42754

add a comment |

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)

df.set_index(0, inplace=True)



# split the `,`

df = df[1].str.strip().str.split(',', expand=True)



#    0             1              2           3

#--  ------------  -------------  ----------  ---

# 1  frack 0.733   shale 0.700

#10  space 0.645   station 0.327  nasa 0.258

# 4  celebr 0.262  bahar 0.345



# stack and drop empty

df = df.stack()

df = df[~df.eq('')]



# split ' '

df = df.str.strip().str.split(' ', expand=True)



# edit to give final expected output:



# rename index and columns for reset_index

df.index.names = ['Id', 'to_drop']

df.columns = ['Term', 'weight']



# final df

final_df  = df.reset_index().drop('to_drop', axis=1)

edited 9 hours ago

answered 9 hours ago

Quang Hoang

3,83211020

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
9 hours ago

1

@Rebin add engine='python'

– pault
9 hours ago

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
9 hours ago

I dont know how to add engine python? what is the command?

– Rebin
9 hours ago

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
9 hours ago

|
show 1 more comment

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd

from parsimonious.grammar import Grammar

from parsimonious.nodes import NodeVisitor



file = """1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 

"""



grammar = Grammar(

    r"""

    expr    = line+



    line    = id colon pair*

    pair    = term ws weight sep? ws?



    id      = ~"d+"

    colon   = ws? ":" ws?

    sep     = ws? "," ws?



    term    = ~"[a-zA-Z]+"

    weight  = ~"d+(?:.d+)?"



    ws      = ~"s+"

    """

)



tree = grammar.parse(file)



class PandasVisitor(NodeVisitor):

    def generic_visit(self, node, visited_children):

        return visited_children or node



    def visit_pair(self, node, visited_children):

        term, _, weight, *_ = visited_children

        return (term.text, weight.text)



    def visit_line(self, node, visited_children):

        id, _, pairs = visited_children

        return [(id.text, *pair) for pair in pairs]



    def visit_expr(self, node, visited_children):

        return [item for lst in visited_children for item in lst]



pv = PandasVisitor()

result = pv.visit(tree)



df = pd.DataFrame(result, columns=["Id", "Term", "weight"])

print(df)

This yields

   Id     Term weight

0   1    frack  0.733

1   1    shale  0.700

2  10    space  0.645

3  10  station  0.327

4  10     nasa  0.258

5   4   celebr  0.262

6   4    bahar  0.345

answered 8 hours ago

Jan

26.1k52750

add a comment |

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd

file=r"give_your_path".replace('\', '/')

my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term  Weight]

with open(file,"r+") as f:

    for line in f.readlines():#looping every line

        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term

        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:

            my_list_of_lists.append(my_id+term)

df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe

df.columns=["Id","Term","weight"]#giving columns their names

answered 9 hours ago

JoPapou13

914

add a comment |

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """), sep=":", header=None)



#df:

    0                                          1

0   1                 frack 0.733, shale 0.700, 

1  10   space 0.645, station 0.327, nasa 0.258, 

2   4                 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)



dfs = []

for idx, rows in df.iterrows():

    print(rows)

    dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})

    dfs.append(dfslice)

newdf = pd.concat(dfs, ignore_index=True)



# this creates newdf:

   Id           terms

0   1     frack 0.733

1   1     shale 0.700

2   1                

3  10     space 0.645

4  10   station 0.327

5  10      nasa 0.258

6  10                

7   4    celebr 0.262

8   4    bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()

newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))

newdf.columns = ["Id", "terms", "Term", "Weights"]

newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

   Id     Term Weights

0   1    frack   0.733

1   1    shale   0.700

3  10    space   0.645

4  10  station   0.327

5  10     nasa   0.258

7   4   celebr   0.262

8   4    bahar   0.345

answered 9 hours ago

Rocky Li

3,6831719

add a comment |

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])

with open('C:/random/d1','r') as readObject:

    for line in readObject:

        line=line.rstrip('n')

        tempList1=line.split(':')

        tempList2=tempList1[1]

        tempList2=tempList2.rstrip(',')

        tempList2=tempList2.split(',')

        for item in tempList2:

            e=item.split(' ')

            tempRow=[tempList1[0], e[0],e[1]]

            df.loc[len(df)]=tempRow

print(df)

answered 9 hours ago

Rebin

193212

add a comment |

-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:

   content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[

    ['1','frack 0.733, shale 0.700,'],

    ['10', 'space 0.645, station 0.327, nasa 0.258,'],

    ['4','celebr 0.262, bahar 0.345 ']]

answered 9 hours ago

CedricLy

3

Your result is not the result asked for in the question.

– GiraffeMan91
9 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55799784%2fconverting-a-text-document-with-special-format-to-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

8 Answers
8

active

oldest

votes

8 Answers
8

active

oldest

votes

import re

import pandas as pd



SEP_RE = re.compile(r":s+")

DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)





def parse(filepath: str):

    def _parse(filepath):

        with open(filepath) as f:

            for line in f:

                id, rest = SEP_RE.split(line, maxsplit=1)

                for match in DATA_RE.finditer(rest):

                    yield [int(id), match["term"], float(match["weight"])]

    return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),

...                   columns=["Id", "Term", "weight"])

>>> 

>>> df

   Id     Term  weight

0   1    frack   0.733

1   1    shale   0.700

2  10    space   0.645

3  10  station   0.327

4  10     nasa   0.258

5   4   celebr   0.262

6   4    bahar   0.345



>>> df.dtypes

Id          int64

Term       object

weight    float64

dtype: object

Walkthrough

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"

>>> SEP_RE.split(line, maxsplit=1)

['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)

>>> it = DATA_RE.finditer(rest)

>>> match = next(it)

>>> match

<re.Match object; span=(0, 11), match='frack 0.733'>

>>> match["term"]

'frack'

>>> match["weight"]

'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

edited 3 hours ago

answered 9 hours ago

Brad Solomon

15.1k83996

3

Brilliant answer, I must say.

– amanb
9 hours ago

@amanb Thank you!

– Brad Solomon
9 hours ago

add a comment |

import re

import pandas as pd



SEP_RE = re.compile(r":s+")

DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)





def parse(filepath: str):

    def _parse(filepath):

        with open(filepath) as f:

            for line in f:

                id, rest = SEP_RE.split(line, maxsplit=1)

                for match in DATA_RE.finditer(rest):

                    yield [int(id), match["term"], float(match["weight"])]

    return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),

...                   columns=["Id", "Term", "weight"])

>>> 

>>> df

   Id     Term  weight

0   1    frack   0.733

1   1    shale   0.700

2  10    space   0.645

3  10  station   0.327

4  10     nasa   0.258

5   4   celebr   0.262

6   4    bahar   0.345



>>> df.dtypes

Id          int64

Term       object

weight    float64

dtype: object

Walkthrough

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"

>>> SEP_RE.split(line, maxsplit=1)

['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)

>>> it = DATA_RE.finditer(rest)

>>> match = next(it)

>>> match

<re.Match object; span=(0, 11), match='frack 0.733'>

>>> match["term"]

'frack'

>>> match["weight"]

'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

edited 3 hours ago

answered 9 hours ago

Brad Solomon

15.1k83996

3

Brilliant answer, I must say.

– amanb
9 hours ago

@amanb Thank you!

– Brad Solomon
9 hours ago

add a comment |

import re

import pandas as pd



SEP_RE = re.compile(r":s+")

DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)





def parse(filepath: str):

    def _parse(filepath):

        with open(filepath) as f:

            for line in f:

                id, rest = SEP_RE.split(line, maxsplit=1)

                for match in DATA_RE.finditer(rest):

                    yield [int(id), match["term"], float(match["weight"])]

    return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),

...                   columns=["Id", "Term", "weight"])

>>> 

>>> df

   Id     Term  weight

0   1    frack   0.733

1   1    shale   0.700

2  10    space   0.645

3  10  station   0.327

4  10     nasa   0.258

5   4   celebr   0.262

6   4    bahar   0.345



>>> df.dtypes

Id          int64

Term       object

weight    float64

dtype: object

Walkthrough

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"

>>> SEP_RE.split(line, maxsplit=1)

['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)

>>> it = DATA_RE.finditer(rest)

>>> match = next(it)

>>> match

<re.Match object; span=(0, 11), match='frack 0.733'>

>>> match["term"]

'frack'

>>> match["weight"]

'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

edited 3 hours ago

answered 9 hours ago

Brad Solomon

15.1k83996

import re

import pandas as pd



SEP_RE = re.compile(r":s+")

DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)





def parse(filepath: str):

    def _parse(filepath):

        with open(filepath) as f:

            for line in f:

                id, rest = SEP_RE.split(line, maxsplit=1)

                for match in DATA_RE.finditer(rest):

                    yield [int(id), match["term"], float(match["weight"])]

    return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),

...                   columns=["Id", "Term", "weight"])

>>> 

>>> df

   Id     Term  weight

0   1    frack   0.733

1   1    shale   0.700

2  10    space   0.645

3  10  station   0.327

4  10     nasa   0.258

5   4   celebr   0.262

6   4    bahar   0.345



>>> df.dtypes

Id          int64

Term       object

weight    float64

dtype: object

Walkthrough

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"

>>> SEP_RE.split(line, maxsplit=1)

['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)

>>> it = DATA_RE.finditer(rest)

>>> match = next(it)

>>> match

<re.Match object; span=(0, 11), match='frack 0.733'>

>>> match["term"]

'frack'

>>> match["weight"]

'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

edited 3 hours ago

answered 9 hours ago

Brad Solomon

15.1k83996

edited 3 hours ago

answered 9 hours ago

Brad Solomon

15.1k83996

answered 9 hours ago

Brad Solomon

15.1k83996

answered 9 hours ago

Brad Solomon

15.1k83996

3

Brilliant answer, I must say.

– amanb
9 hours ago

@amanb Thank you!

– Brad Solomon
9 hours ago

add a comment |

3

Brilliant answer, I must say.

– amanb
9 hours ago

@amanb Thank you!

– Brad Solomon
9 hours ago

Brilliant answer, I must say.

– amanb
9 hours ago

@amanb Thank you!

– Brad Solomon
9 hours ago

add a comment |

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd

from itertools import chain



text="""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """



df = pd.DataFrame(

    list(

        chain.from_iterable(

            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 

            map(lambda x: x.strip(" ,").split(":"), text.splitlines())

        )

    ), 

    columns=["Id", "Term", "weight"]

)



print(df)

#  Id     Term weight

#0  4    frack  0.733

#1  4    shale  0.700

#2  4    space  0.645

#3  4  station  0.327

#4  4     nasa  0.258

#5  4   celebr  0.262

#6  4    bahar  0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))

#[['1', ' frack 0.733, shale 0.700'], 

# ['10', ' space 0.645, station 0.327, nasa 0.258'], 

# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(

    [

        list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 

        map(lambda x: x.strip(" ,").split(":"), text.splitlines())

    ]

)

#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],

# [('10', 'space', '0.645'),

#  ('10', 'station', '0.327'),

#  ('10', 'nasa', '0.258')],

# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited 9 hours ago

answered 9 hours ago

pault

17.3k42754

add a comment |

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd

from itertools import chain



text="""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """



df = pd.DataFrame(

    list(

        chain.from_iterable(

            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 

            map(lambda x: x.strip(" ,").split(":"), text.splitlines())

        )

    ), 

    columns=["Id", "Term", "weight"]

)



print(df)

#  Id     Term weight

#0  4    frack  0.733

#1  4    shale  0.700

#2  4    space  0.645

#3  4  station  0.327

#4  4     nasa  0.258

#5  4   celebr  0.262

#6  4    bahar  0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))

#[['1', ' frack 0.733, shale 0.700'], 

# ['10', ' space 0.645, station 0.327, nasa 0.258'], 

# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(

    [

        list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 

        map(lambda x: x.strip(" ,").split(":"), text.splitlines())

    ]

)

#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],

# [('10', 'space', '0.645'),

#  ('10', 'station', '0.327'),

#  ('10', 'nasa', '0.258')],

# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited 9 hours ago

answered 9 hours ago

pault

17.3k42754

add a comment |

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd

from itertools import chain



text="""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """



df = pd.DataFrame(

    list(

        chain.from_iterable(

            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 

            map(lambda x: x.strip(" ,").split(":"), text.splitlines())

        )

    ), 

    columns=["Id", "Term", "weight"]

)



print(df)

#  Id     Term weight

#0  4    frack  0.733

#1  4    shale  0.700

#2  4    space  0.645

#3  4  station  0.327

#4  4     nasa  0.258

#5  4   celebr  0.262

#6  4    bahar  0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))

#[['1', ' frack 0.733, shale 0.700'], 

# ['10', ' space 0.645, station 0.327, nasa 0.258'], 

# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(

    [

        list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 

        map(lambda x: x.strip(" ,").split(":"), text.splitlines())

    ]

)

#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],

# [('10', 'space', '0.645'),

#  ('10', 'station', '0.327'),

#  ('10', 'nasa', '0.258')],

# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited 9 hours ago

answered 9 hours ago

pault

17.3k42754

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd

from itertools import chain



text="""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """



df = pd.DataFrame(

    list(

        chain.from_iterable(

            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 

            map(lambda x: x.strip(" ,").split(":"), text.splitlines())

        )

    ), 

    columns=["Id", "Term", "weight"]

)



print(df)

#  Id     Term weight

#0  4    frack  0.733

#1  4    shale  0.700

#2  4    space  0.645

#3  4  station  0.327

#4  4     nasa  0.258

#5  4   celebr  0.262

#6  4    bahar  0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))

#[['1', ' frack 0.733, shale 0.700'], 

# ['10', ' space 0.645, station 0.327, nasa 0.258'], 

# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(

    [

        list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 

        map(lambda x: x.strip(" ,").split(":"), text.splitlines())

    ]

)

#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],

# [('10', 'space', '0.645'),

#  ('10', 'station', '0.327'),

#  ('10', 'nasa', '0.258')],

# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited 9 hours ago

answered 9 hours ago

pault

17.3k42754

edited 9 hours ago

answered 9 hours ago

pault

17.3k42754

answered 9 hours ago

pault

17.3k42754

answered 9 hours ago

pault

17.3k42754

add a comment |

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)

df.set_index(0, inplace=True)



# split the `,`

df = df[1].str.strip().str.split(',', expand=True)



#    0             1              2           3

#--  ------------  -------------  ----------  ---

# 1  frack 0.733   shale 0.700

#10  space 0.645   station 0.327  nasa 0.258

# 4  celebr 0.262  bahar 0.345



# stack and drop empty

df = df.stack()

df = df[~df.eq('')]



# split ' '

df = df.str.strip().str.split(' ', expand=True)



# edit to give final expected output:



# rename index and columns for reset_index

df.index.names = ['Id', 'to_drop']

df.columns = ['Term', 'weight']



# final df

final_df  = df.reset_index().drop('to_drop', axis=1)

edited 9 hours ago

answered 9 hours ago

Quang Hoang

3,83211020

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
9 hours ago

1

@Rebin add engine='python'

– pault
9 hours ago

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
9 hours ago

I dont know how to add engine python? what is the command?

– Rebin
9 hours ago

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
9 hours ago

|
show 1 more comment

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)

df.set_index(0, inplace=True)



# split the `,`

df = df[1].str.strip().str.split(',', expand=True)



#    0             1              2           3

#--  ------------  -------------  ----------  ---

# 1  frack 0.733   shale 0.700

#10  space 0.645   station 0.327  nasa 0.258

# 4  celebr 0.262  bahar 0.345



# stack and drop empty

df = df.stack()

df = df[~df.eq('')]



# split ' '

df = df.str.strip().str.split(' ', expand=True)



# edit to give final expected output:



# rename index and columns for reset_index

df.index.names = ['Id', 'to_drop']

df.columns = ['Term', 'weight']



# final df

final_df  = df.reset_index().drop('to_drop', axis=1)

edited 9 hours ago

answered 9 hours ago

Quang Hoang

3,83211020

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
9 hours ago

1

@Rebin add engine='python'

– pault
9 hours ago

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
9 hours ago

I dont know how to add engine python? what is the command?

– Rebin
9 hours ago

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
9 hours ago

|
show 1 more comment

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)

df.set_index(0, inplace=True)



# split the `,`

df = df[1].str.strip().str.split(',', expand=True)



#    0             1              2           3

#--  ------------  -------------  ----------  ---

# 1  frack 0.733   shale 0.700

#10  space 0.645   station 0.327  nasa 0.258

# 4  celebr 0.262  bahar 0.345



# stack and drop empty

df = df.stack()

df = df[~df.eq('')]



# split ' '

df = df.str.strip().str.split(' ', expand=True)



# edit to give final expected output:



# rename index and columns for reset_index

df.index.names = ['Id', 'to_drop']

df.columns = ['Term', 'weight']



# final df

final_df  = df.reset_index().drop('to_drop', axis=1)

edited 9 hours ago

answered 9 hours ago

Quang Hoang

3,83211020

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)

df.set_index(0, inplace=True)



# split the `,`

df = df[1].str.strip().str.split(',', expand=True)



#    0             1              2           3

#--  ------------  -------------  ----------  ---

# 1  frack 0.733   shale 0.700

#10  space 0.645   station 0.327  nasa 0.258

# 4  celebr 0.262  bahar 0.345



# stack and drop empty

df = df.stack()

df = df[~df.eq('')]



# split ' '

df = df.str.strip().str.split(' ', expand=True)



# edit to give final expected output:



# rename index and columns for reset_index

df.index.names = ['Id', 'to_drop']

df.columns = ['Term', 'weight']



# final df

final_df  = df.reset_index().drop('to_drop', axis=1)

edited 9 hours ago

answered 9 hours ago

Quang Hoang

3,83211020

edited 9 hours ago

answered 9 hours ago

Quang Hoang

3,83211020

answered 9 hours ago

Quang Hoang

3,83211020

answered 9 hours ago

Quang Hoang

3,83211020

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
9 hours ago

1

@Rebin add engine='python'

– pault
9 hours ago

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
9 hours ago

I dont know how to add engine python? what is the command?

– Rebin
9 hours ago

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
9 hours ago

|
show 1 more comment

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
9 hours ago

1

@Rebin add engine='python'

– pault
9 hours ago

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
9 hours ago

I dont know how to add engine python? what is the command?

– Rebin
9 hours ago

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
9 hours ago

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
9 hours ago

@Rebin add engine='python'

– pault
9 hours ago

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
9 hours ago

I dont know how to add engine python? what is the command?

– Rebin
9 hours ago

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
9 hours ago

|
show 1 more comment

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd

from parsimonious.grammar import Grammar

from parsimonious.nodes import NodeVisitor



file = """1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 

"""



grammar = Grammar(

    r"""

    expr    = line+



    line    = id colon pair*

    pair    = term ws weight sep? ws?



    id      = ~"d+"

    colon   = ws? ":" ws?

    sep     = ws? "," ws?



    term    = ~"[a-zA-Z]+"

    weight  = ~"d+(?:.d+)?"



    ws      = ~"s+"

    """

)



tree = grammar.parse(file)



class PandasVisitor(NodeVisitor):

    def generic_visit(self, node, visited_children):

        return visited_children or node



    def visit_pair(self, node, visited_children):

        term, _, weight, *_ = visited_children

        return (term.text, weight.text)



    def visit_line(self, node, visited_children):

        id, _, pairs = visited_children

        return [(id.text, *pair) for pair in pairs]



    def visit_expr(self, node, visited_children):

        return [item for lst in visited_children for item in lst]



pv = PandasVisitor()

result = pv.visit(tree)



df = pd.DataFrame(result, columns=["Id", "Term", "weight"])

print(df)

This yields

   Id     Term weight

0   1    frack  0.733

1   1    shale  0.700

2  10    space  0.645

3  10  station  0.327

4  10     nasa  0.258

5   4   celebr  0.262

6   4    bahar  0.345

answered 8 hours ago

Jan

26.1k52750

add a comment |

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd

from parsimonious.grammar import Grammar

from parsimonious.nodes import NodeVisitor



file = """1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 

"""



grammar = Grammar(

    r"""

    expr    = line+



    line    = id colon pair*

    pair    = term ws weight sep? ws?



    id      = ~"d+"

    colon   = ws? ":" ws?

    sep     = ws? "," ws?



    term    = ~"[a-zA-Z]+"

    weight  = ~"d+(?:.d+)?"



    ws      = ~"s+"

    """

)



tree = grammar.parse(file)



class PandasVisitor(NodeVisitor):

    def generic_visit(self, node, visited_children):

        return visited_children or node



    def visit_pair(self, node, visited_children):

        term, _, weight, *_ = visited_children

        return (term.text, weight.text)



    def visit_line(self, node, visited_children):

        id, _, pairs = visited_children

        return [(id.text, *pair) for pair in pairs]



    def visit_expr(self, node, visited_children):

        return [item for lst in visited_children for item in lst]



pv = PandasVisitor()

result = pv.visit(tree)



df = pd.DataFrame(result, columns=["Id", "Term", "weight"])

print(df)

This yields

   Id     Term weight

0   1    frack  0.733

1   1    shale  0.700

2  10    space  0.645

3  10  station  0.327

4  10     nasa  0.258

5   4   celebr  0.262

6   4    bahar  0.345

answered 8 hours ago

Jan

26.1k52750

add a comment |

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd

from parsimonious.grammar import Grammar

from parsimonious.nodes import NodeVisitor



file = """1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 

"""



grammar = Grammar(

    r"""

    expr    = line+



    line    = id colon pair*

    pair    = term ws weight sep? ws?



    id      = ~"d+"

    colon   = ws? ":" ws?

    sep     = ws? "," ws?



    term    = ~"[a-zA-Z]+"

    weight  = ~"d+(?:.d+)?"



    ws      = ~"s+"

    """

)



tree = grammar.parse(file)



class PandasVisitor(NodeVisitor):

    def generic_visit(self, node, visited_children):

        return visited_children or node



    def visit_pair(self, node, visited_children):

        term, _, weight, *_ = visited_children

        return (term.text, weight.text)



    def visit_line(self, node, visited_children):

        id, _, pairs = visited_children

        return [(id.text, *pair) for pair in pairs]



    def visit_expr(self, node, visited_children):

        return [item for lst in visited_children for item in lst]



pv = PandasVisitor()

result = pv.visit(tree)



df = pd.DataFrame(result, columns=["Id", "Term", "weight"])

print(df)

This yields

   Id     Term weight

0   1    frack  0.733

1   1    shale  0.700

2  10    space  0.645

3  10  station  0.327

4  10     nasa  0.258

5   4   celebr  0.262

6   4    bahar  0.345

answered 8 hours ago

Jan

26.1k52750

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd

from parsimonious.grammar import Grammar

from parsimonious.nodes import NodeVisitor



file = """1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 

"""



grammar = Grammar(

    r"""

    expr    = line+



    line    = id colon pair*

    pair    = term ws weight sep? ws?



    id      = ~"d+"

    colon   = ws? ":" ws?

    sep     = ws? "," ws?



    term    = ~"[a-zA-Z]+"

    weight  = ~"d+(?:.d+)?"



    ws      = ~"s+"

    """

)



tree = grammar.parse(file)



class PandasVisitor(NodeVisitor):

    def generic_visit(self, node, visited_children):

        return visited_children or node



    def visit_pair(self, node, visited_children):

        term, _, weight, *_ = visited_children

        return (term.text, weight.text)



    def visit_line(self, node, visited_children):

        id, _, pairs = visited_children

        return [(id.text, *pair) for pair in pairs]



    def visit_expr(self, node, visited_children):

        return [item for lst in visited_children for item in lst]



pv = PandasVisitor()

result = pv.visit(tree)



df = pd.DataFrame(result, columns=["Id", "Term", "weight"])

print(df)

This yields

   Id     Term weight

0   1    frack  0.733

1   1    shale  0.700

2  10    space  0.645

3  10  station  0.327

4  10     nasa  0.258

5   4   celebr  0.262

6   4    bahar  0.345

answered 8 hours ago

Jan

26.1k52750

answered 8 hours ago

Jan

26.1k52750

answered 8 hours ago

Jan

26.1k52750

answered 8 hours ago

Jan

26.1k52750

add a comment |

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd

file=r"give_your_path".replace('\', '/')

my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term  Weight]

with open(file,"r+") as f:

    for line in f.readlines():#looping every line

        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term

        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:

            my_list_of_lists.append(my_id+term)

df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe

df.columns=["Id","Term","weight"]#giving columns their names

answered 9 hours ago

JoPapou13

914

add a comment |

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd

file=r"give_your_path".replace('\', '/')

my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term  Weight]

with open(file,"r+") as f:

    for line in f.readlines():#looping every line

        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term

        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:

            my_list_of_lists.append(my_id+term)

df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe

df.columns=["Id","Term","weight"]#giving columns their names

answered 9 hours ago

JoPapou13

914

add a comment |

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd

file=r"give_your_path".replace('\', '/')

my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term  Weight]

with open(file,"r+") as f:

    for line in f.readlines():#looping every line

        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term

        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:

            my_list_of_lists.append(my_id+term)

df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe

df.columns=["Id","Term","weight"]#giving columns their names

answered 9 hours ago

JoPapou13

914

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd

file=r"give_your_path".replace('\', '/')

my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term  Weight]

with open(file,"r+") as f:

    for line in f.readlines():#looping every line

        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term

        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:

            my_list_of_lists.append(my_id+term)

df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe

df.columns=["Id","Term","weight"]#giving columns their names

answered 9 hours ago

JoPapou13

914

answered 9 hours ago

JoPapou13

914

answered 9 hours ago

JoPapou13

914

answered 9 hours ago

JoPapou13

914

add a comment |

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """), sep=":", header=None)



#df:

    0                                          1

0   1                 frack 0.733, shale 0.700, 

1  10   space 0.645, station 0.327, nasa 0.258, 

2   4                 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)



dfs = []

for idx, rows in df.iterrows():

    print(rows)

    dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})

    dfs.append(dfslice)

newdf = pd.concat(dfs, ignore_index=True)



# this creates newdf:

   Id           terms

0   1     frack 0.733

1   1     shale 0.700

2   1                

3  10     space 0.645

4  10   station 0.327

5  10      nasa 0.258

6  10                

7   4    celebr 0.262

8   4    bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()

newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))

newdf.columns = ["Id", "terms", "Term", "Weights"]

newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

   Id     Term Weights

0   1    frack   0.733

1   1    shale   0.700

3  10    space   0.645

4  10  station   0.327

5  10     nasa   0.258

7   4   celebr   0.262

8   4    bahar   0.345

answered 9 hours ago

Rocky Li

3,6831719

add a comment |

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """), sep=":", header=None)



#df:

    0                                          1

0   1                 frack 0.733, shale 0.700, 

1  10   space 0.645, station 0.327, nasa 0.258, 

2   4                 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)



dfs = []

for idx, rows in df.iterrows():

    print(rows)

    dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})

    dfs.append(dfslice)

newdf = pd.concat(dfs, ignore_index=True)



# this creates newdf:

   Id           terms

0   1     frack 0.733

1   1     shale 0.700

2   1                

3  10     space 0.645

4  10   station 0.327

5  10      nasa 0.258

6  10                

7   4    celebr 0.262

8   4    bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()

newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))

newdf.columns = ["Id", "terms", "Term", "Weights"]

newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

   Id     Term Weights

0   1    frack   0.733

1   1    shale   0.700

3  10    space   0.645

4  10  station   0.327

5  10     nasa   0.258

7   4   celebr   0.262

8   4    bahar   0.345

answered 9 hours ago

Rocky Li

3,6831719

add a comment |

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """), sep=":", header=None)



#df:

    0                                          1

0   1                 frack 0.733, shale 0.700, 

1  10   space 0.645, station 0.327, nasa 0.258, 

2   4                 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)



dfs = []

for idx, rows in df.iterrows():

    print(rows)

    dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})

    dfs.append(dfslice)

newdf = pd.concat(dfs, ignore_index=True)



# this creates newdf:

   Id           terms

0   1     frack 0.733

1   1     shale 0.700

2   1                

3  10     space 0.645

4  10   station 0.327

5  10      nasa 0.258

6  10                

7   4    celebr 0.262

8   4    bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()

newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))

newdf.columns = ["Id", "terms", "Term", "Weights"]

newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

   Id     Term Weights

0   1    frack   0.733

1   1    shale   0.700

3  10    space   0.645

4  10  station   0.327

5  10     nasa   0.258

7   4   celebr   0.262

8   4    bahar   0.345

answered 9 hours ago

Rocky Li

3,6831719

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 

10: space 0.645, station 0.327, nasa 0.258, 

4: celebr 0.262, bahar 0.345 """), sep=":", header=None)



#df:

    0                                          1

0   1                 frack 0.733, shale 0.700, 

1  10   space 0.645, station 0.327, nasa 0.258, 

2   4                 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)



dfs = []

for idx, rows in df.iterrows():

    print(rows)

    dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})

    dfs.append(dfslice)

newdf = pd.concat(dfs, ignore_index=True)



# this creates newdf:

   Id           terms

0   1     frack 0.733

1   1     shale 0.700

2   1                

3  10     space 0.645

4  10   station 0.327

5  10      nasa 0.258

6  10                

7   4    celebr 0.262

8   4    bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()

newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))

newdf.columns = ["Id", "terms", "Term", "Weights"]

newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

   Id     Term Weights

0   1    frack   0.733

1   1    shale   0.700

3  10    space   0.645

4  10  station   0.327

5  10     nasa   0.258

7   4   celebr   0.262

8   4    bahar   0.345

answered 9 hours ago

Rocky Li

3,6831719

answered 9 hours ago

Rocky Li

3,6831719

answered 9 hours ago

Rocky Li

3,6831719

answered 9 hours ago

Rocky Li

3,6831719

add a comment |

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])

with open('C:/random/d1','r') as readObject:

    for line in readObject:

        line=line.rstrip('n')

        tempList1=line.split(':')

        tempList2=tempList1[1]

        tempList2=tempList2.rstrip(',')

        tempList2=tempList2.split(',')

        for item in tempList2:

            e=item.split(' ')

            tempRow=[tempList1[0], e[0],e[1]]

            df.loc[len(df)]=tempRow

print(df)

answered 9 hours ago

Rebin

193212

add a comment |

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])

with open('C:/random/d1','r') as readObject:

    for line in readObject:

        line=line.rstrip('n')

        tempList1=line.split(':')

        tempList2=tempList1[1]

        tempList2=tempList2.rstrip(',')

        tempList2=tempList2.split(',')

        for item in tempList2:

            e=item.split(' ')

            tempRow=[tempList1[0], e[0],e[1]]

            df.loc[len(df)]=tempRow

print(df)

answered 9 hours ago

Rebin

193212

add a comment |

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])

with open('C:/random/d1','r') as readObject:

    for line in readObject:

        line=line.rstrip('n')

        tempList1=line.split(':')

        tempList2=tempList1[1]

        tempList2=tempList2.rstrip(',')

        tempList2=tempList2.split(',')

        for item in tempList2:

            e=item.split(' ')

            tempRow=[tempList1[0], e[0],e[1]]

            df.loc[len(df)]=tempRow

print(df)

answered 9 hours ago

Rebin

193212

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])

with open('C:/random/d1','r') as readObject:

    for line in readObject:

        line=line.rstrip('n')

        tempList1=line.split(':')

        tempList2=tempList1[1]

        tempList2=tempList2.rstrip(',')

        tempList2=tempList2.split(',')

        for item in tempList2:

            e=item.split(' ')

            tempRow=[tempList1[0], e[0],e[1]]

            df.loc[len(df)]=tempRow

print(df)

answered 9 hours ago

Rebin

193212

answered 9 hours ago

Rebin

193212

answered 9 hours ago

Rebin

193212

answered 9 hours ago

Rebin

193212

add a comment |

-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:

   content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[

    ['1','frack 0.733, shale 0.700,'],

    ['10', 'space 0.645, station 0.327, nasa 0.258,'],

    ['4','celebr 0.262, bahar 0.345 ']]

answered 9 hours ago

CedricLy

3

Your result is not the result asked for in the question.

– GiraffeMan91
9 hours ago

add a comment |

-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:

   content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[

    ['1','frack 0.733, shale 0.700,'],

    ['10', 'space 0.645, station 0.327, nasa 0.258,'],

    ['4','celebr 0.262, bahar 0.345 ']]

answered 9 hours ago

CedricLy

3

Your result is not the result asked for in the question.

– GiraffeMan91
9 hours ago

add a comment |

-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:

   content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[

    ['1','frack 0.733, shale 0.700,'],

    ['10', 'space 0.645, station 0.327, nasa 0.258,'],

    ['4','celebr 0.262, bahar 0.345 ']]

answered 9 hours ago

CedricLy

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:

   content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[

    ['1','frack 0.733, shale 0.700,'],

    ['10', 'space 0.645, station 0.327, nasa 0.258,'],

    ['4','celebr 0.262, bahar 0.345 ']]

answered 9 hours ago

CedricLy

answered 9 hours ago

CedricLy

answered 9 hours ago

CedricLy

answered 9 hours ago

CedricLy

3

Your result is not the result asked for in the question.

– GiraffeMan91
9 hours ago

add a comment |

3

Your result is not the result asked for in the question.

– GiraffeMan91
9 hours ago

Your result is not the result asked for in the question.

– GiraffeMan91
9 hours ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sfdwhf