Converting a text document with special format to Pandas DataFrameHow can I reverse a list in Python?Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeUse a list of values to select rows from a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame

What is the unit of time_lock_delta in LND?

Is there metaphorical meaning of "aus der Haft entlassen"?

Von Neumann Extractor - Which bit is retained?

How can I wire a 9-position switch so that each position turns on one more LED than the one before?

Multiple options vs single option UI

Why must Chinese maps be obfuscated?

All ASCII characters with a given bit count

Why did C use the -> operator instead of reusing the . operator?

Negative Resistance

Prove that the countable union of countable sets is also countable

Is there any pythonic way to find average of specific tuple elements in array?

What to do with someone that cheated their way through university and a PhD program?

What is the term for a person whose job is to place products on shelves in stores?

Where was the County of Thurn und Taxis located?

Is Diceware more secure than a long passphrase?

Co-worker works way more than he should

"My boss was furious with me and I have been fired" vs. "My boss was furious with me and I was fired"

Multiple fireplaces in an apartment building?

Is it acceptable to use working hours to read general interest books?

Which big number is bigger?

Could moose/elk survive in the Amazon forest?

Find a stone which is not the lightest one

Nails holding drywall

Magical attacks and overcoming damage resistance

Converting a text document with special format to Pandas DataFrame

How can I reverse a list in Python?Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeUse a list of values to select rows from a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have a text file with the following format:

1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345

I need to covert this text to a DataFrame with the following format:

Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345

How I can do it?

edited Apr 22 at 21:39

Brad Solomon

15k84096

asked Apr 22 at 19:10

Mary

477217

I can only think of regex helping here.

– amanb
Apr 22 at 19:13

1

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
Apr 22 at 19:20

It can be done with explode and split

– Wen-Ben
Apr 22 at 19:24

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
Apr 22 at 19:25

The data is in text format.

– Mary
Apr 22 at 19:26

|
show 1 more comment

I have a text file with the following format:

1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345

I need to covert this text to a DataFrame with the following format:

Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345

How I can do it?

edited Apr 22 at 21:39

Brad Solomon

15k84096

asked Apr 22 at 19:10

Mary

477217

I can only think of regex helping here.

– amanb
Apr 22 at 19:13

1

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
Apr 22 at 19:20

It can be done with explode and split

– Wen-Ben
Apr 22 at 19:24

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
Apr 22 at 19:25

The data is in text format.

– Mary
Apr 22 at 19:26

|
show 1 more comment

I have a text file with the following format:

1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345

I need to covert this text to a DataFrame with the following format:

Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345

How I can do it?

edited Apr 22 at 21:39

Brad Solomon

15k84096

asked Apr 22 at 19:10

Mary

477217

I have a text file with the following format:

1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345

I need to covert this text to a DataFrame with the following format:

Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345

How I can do it?

python pandas

edited Apr 22 at 21:39

Brad Solomon

15k84096

asked Apr 22 at 19:10

Mary

477217

edited Apr 22 at 21:39

Brad Solomon

15k84096

asked Apr 22 at 19:10

Mary

477217

edited Apr 22 at 21:39

Brad Solomon

15k84096

edited Apr 22 at 21:39

Brad Solomon

15k84096

edited Apr 22 at 21:39

Brad Solomon

15k84096

asked Apr 22 at 19:10

Mary

477217

asked Apr 22 at 19:10

Mary

477217

asked Apr 22 at 19:10

Mary

477217

I can only think of regex helping here.

– amanb
Apr 22 at 19:13

1

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
Apr 22 at 19:20

It can be done with explode and split

– Wen-Ben
Apr 22 at 19:24

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
Apr 22 at 19:25

The data is in text format.

– Mary
Apr 22 at 19:26

|
show 1 more comment

I can only think of regex helping here.

– amanb
Apr 22 at 19:13

1

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
Apr 22 at 19:20

It can be done with explode and split

– Wen-Ben
Apr 22 at 19:24

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
Apr 22 at 19:25

The data is in text format.

– Mary
Apr 22 at 19:26

I can only think of regex helping here.

– amanb
Apr 22 at 19:13

Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
Apr 22 at 19:20

It can be done with explode and split

– Wen-Ben
Apr 22 at 19:24

Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
Apr 22 at 19:25

The data is in text format.

– Mary
Apr 22 at 19:26

|
show 1 more comment

8 Answers
8

active

oldest

votes

Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.

import re
import pandas as pd

SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


def parse(filepath: str):
 def _parse(filepath):
 with open(filepath) as f:
 for line in f:
 id, rest = SEP_RE.split(line, maxsplit=1)
 for match in DATA_RE.finditer(rest):
 yield [int(id), match["term"], float(match["weight"])]
 return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>> 
>>> df
 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

>>> df.dtypes
Id int64
Term object
weight float64
dtype: object

Walkthrough

SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.

After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.

edited Apr 23 at 2:00

answered Apr 22 at 19:35

Brad Solomon

15k84096

3

Brilliant answer, I must say.

– amanb
Apr 22 at 19:42

@amanb Thank you!

– Brad Solomon
Apr 22 at 19:45

add a comment |

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
 list(
 chain.from_iterable(
 map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 )
 ), 
 columns=["Id", "Term", "weight"]
)

print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'], 
# ['10', ' space 0.645, station 0.327, nasa 0.258'], 
# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(
 [
 list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 ]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited Apr 22 at 19:44

answered Apr 22 at 19:39

pault

17.4k42854

add a comment |

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)

# split the `,`
df = df[1].str.strip().str.split(',', expand=True)

# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345

# stack and drop empty
df = df.stack()
df = df[~df.eq('')]

# split ' '
df = df.str.strip().str.split(' ', expand=True)

# edit to give final expected output:

# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']

# final df
final_df = df.reset_index().drop('to_drop', axis=1)

edited Apr 22 at 19:57

answered Apr 22 at 19:43

Quang Hoang

4,03611020

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
Apr 22 at 19:55

1

@Rebin add engine='python'

– pault
Apr 22 at 19:58

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
Apr 22 at 20:02

I dont know how to add engine python? what is the command?

– Rebin
Apr 22 at 20:02

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
Apr 22 at 20:04

|
show 1 more comment

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

file = """
1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 
"""

grammar = Grammar(
 r"""
 expr = (garbage / line)+

 line = id colon pair*
 pair = term ws weight sep? ws?
 garbage = ws+

 id = ~"d+"
 colon = ws? ":" ws?
 sep = ws? "," ws?

 term = ~"[a-zA-Z]+"
 weight = ~"d+(?:.d+)?"

 ws = ~"s+"
 """
)

tree = grammar.parse(file)

class PandasVisitor(NodeVisitor):
 def generic_visit(self, node, visited_children):
 return visited_children or node

 def visit_pair(self, node, visited_children):
 term, _, weight, *_ = visited_children
 return (term.text, weight.text)

 def visit_line(self, node, visited_children):
 id, _, pairs = visited_children
 return [(id.text, *pair) for pair in pairs]

 def visit_garbage(self, node, visited_children):
 return None

 def visit_expr(self, node, visited_children):
 return [item
 for lst in visited_children
 for sublst in lst if sublst
 for item in sublst]

pv = PandasVisitor()
out = pv.visit(tree)

df = pd.DataFrame(out, columns=["Id", "Term", "weight"])
print(df)

This yields

 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

Here, we are building a grammar with the possible information: either a line or whitespace. The line is built of an id (e.g. 1), followed a colon (:), whitespace and pairs of term and weight evtl. followed by a separator.

Afterwards, we need a NodeVisitor class to actually do sth. with the retrieved ast.

edited 2 days ago

answered Apr 22 at 20:29

Jan

26.1k52750

add a comment |

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
 for line in f.readlines():#looping every line
 my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
 for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
 my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names

answered Apr 22 at 19:55

JoPapou13

914

add a comment |

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

#df:
 0 1
0 1 frack 0.733, shale 0.700, 
1 10 space 0.645, station 0.327, nasa 0.258, 
2 4 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)

dfs = []
for idx, rows in df.iterrows():
 print(rows)
 dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
 dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)

# this creates newdf:
 Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1 
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10 
7 4 celebr 0.262
8 4 bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

 Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345

answered Apr 22 at 19:58

Rocky Li

3,7081719

add a comment |

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
 for line in readObject:
 line=line.rstrip('n')
 tempList1=line.split(':')
 tempList2=tempList1[1]
 tempList2=tempList2.rstrip(',')
 tempList2=tempList2.split(',')
 for item in tempList2:
 e=item.split(' ')
 tempRow=[tempList1[0], e[0],e[1]]
 df.loc[len(df)]=tempRow
print(df)

answered Apr 22 at 20:04

Rebin

297312

add a comment |

-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:
 content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[
 ['1','frack 0.733, shale 0.700,'],
 ['10', 'space 0.645, station 0.327, nasa 0.258,'],
 ['4','celebr 0.262, bahar 0.345 ']]

answered Apr 22 at 19:30

CedricLy

3

Your result is not the result asked for in the question.

– GiraffeMan91
Apr 22 at 19:31

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55799784%2fconverting-a-text-document-with-special-format-to-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

8 Answers
8

active

oldest

votes

8 Answers
8

active

oldest

votes

import re
import pandas as pd

SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


def parse(filepath: str):
 def _parse(filepath):
 with open(filepath) as f:
 for line in f:
 id, rest = SEP_RE.split(line, maxsplit=1)
 for match in DATA_RE.finditer(rest):
 yield [int(id), match["term"], float(match["weight"])]
 return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>> 
>>> df
 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

>>> df.dtypes
Id int64
Term object
weight float64
dtype: object

Walkthrough

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

edited Apr 23 at 2:00

answered Apr 22 at 19:35

Brad Solomon

15k84096

3

Brilliant answer, I must say.

– amanb
Apr 22 at 19:42

@amanb Thank you!

– Brad Solomon
Apr 22 at 19:45

add a comment |

import re
import pandas as pd

SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


def parse(filepath: str):
 def _parse(filepath):
 with open(filepath) as f:
 for line in f:
 id, rest = SEP_RE.split(line, maxsplit=1)
 for match in DATA_RE.finditer(rest):
 yield [int(id), match["term"], float(match["weight"])]
 return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>> 
>>> df
 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

>>> df.dtypes
Id int64
Term object
weight float64
dtype: object

Walkthrough

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

edited Apr 23 at 2:00

answered Apr 22 at 19:35

Brad Solomon

15k84096

3

Brilliant answer, I must say.

– amanb
Apr 22 at 19:42

@amanb Thank you!

– Brad Solomon
Apr 22 at 19:45

add a comment |

import re
import pandas as pd

SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


def parse(filepath: str):
 def _parse(filepath):
 with open(filepath) as f:
 for line in f:
 id, rest = SEP_RE.split(line, maxsplit=1)
 for match in DATA_RE.finditer(rest):
 yield [int(id), match["term"], float(match["weight"])]
 return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>> 
>>> df
 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

>>> df.dtypes
Id int64
Term object
weight float64
dtype: object

Walkthrough

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

edited Apr 23 at 2:00

answered Apr 22 at 19:35

Brad Solomon

15k84096

import re
import pandas as pd

SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


def parse(filepath: str):
 def _parse(filepath):
 with open(filepath) as f:
 for line in f:
 id, rest = SEP_RE.split(line, maxsplit=1)
 for match in DATA_RE.finditer(rest):
 yield [int(id), match["term"], float(match["weight"])]
 return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>> 
>>> df
 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

>>> df.dtypes
Id int64
Term object
weight float64
dtype: object

Walkthrough

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

edited Apr 23 at 2:00

answered Apr 22 at 19:35

Brad Solomon

15k84096

edited Apr 23 at 2:00

answered Apr 22 at 19:35

Brad Solomon

15k84096

answered Apr 22 at 19:35

Brad Solomon

15k84096

answered Apr 22 at 19:35

Brad Solomon

15k84096

3

Brilliant answer, I must say.

– amanb
Apr 22 at 19:42

@amanb Thank you!

– Brad Solomon
Apr 22 at 19:45

add a comment |

3

Brilliant answer, I must say.

– amanb
Apr 22 at 19:42

@amanb Thank you!

– Brad Solomon
Apr 22 at 19:45

Brilliant answer, I must say.

– amanb
Apr 22 at 19:42

@amanb Thank you!

– Brad Solomon
Apr 22 at 19:45

add a comment |

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
 list(
 chain.from_iterable(
 map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 )
 ), 
 columns=["Id", "Term", "weight"]
)

print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'], 
# ['10', ' space 0.645, station 0.327, nasa 0.258'], 
# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(
 [
 list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 ]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited Apr 22 at 19:44

answered Apr 22 at 19:39

pault

17.4k42854

add a comment |

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
 list(
 chain.from_iterable(
 map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 )
 ), 
 columns=["Id", "Term", "weight"]
)

print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'], 
# ['10', ' space 0.645, station 0.327, nasa 0.258'], 
# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(
 [
 list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 ]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited Apr 22 at 19:44

answered Apr 22 at 19:39

pault

17.4k42854

add a comment |

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
 list(
 chain.from_iterable(
 map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 )
 ), 
 columns=["Id", "Term", "weight"]
)

print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'], 
# ['10', ' space 0.645, station 0.327, nasa 0.258'], 
# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(
 [
 list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 ]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited Apr 22 at 19:44

answered Apr 22 at 19:39

pault

17.4k42854

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
 list(
 chain.from_iterable(
 map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 )
 ), 
 columns=["Id", "Term", "weight"]
)

print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'], 
# ['10', ' space 0.645, station 0.327, nasa 0.258'], 
# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(
 [
 list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 
 map(lambda x: x.strip(" ,").split(":"), text.splitlines())
 ]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

edited Apr 22 at 19:44

answered Apr 22 at 19:39

pault

17.4k42854

edited Apr 22 at 19:44

answered Apr 22 at 19:39

pault

17.4k42854

answered Apr 22 at 19:39

pault

17.4k42854

answered Apr 22 at 19:39

pault

17.4k42854

add a comment |

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)

# split the `,`
df = df[1].str.strip().str.split(',', expand=True)

# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345

# stack and drop empty
df = df.stack()
df = df[~df.eq('')]

# split ' '
df = df.str.strip().str.split(' ', expand=True)

# edit to give final expected output:

# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']

# final df
final_df = df.reset_index().drop('to_drop', axis=1)

edited Apr 22 at 19:57

answered Apr 22 at 19:43

Quang Hoang

4,03611020

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
Apr 22 at 19:55

1

@Rebin add engine='python'

– pault
Apr 22 at 19:58

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
Apr 22 at 20:02

I dont know how to add engine python? what is the command?

– Rebin
Apr 22 at 20:02

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
Apr 22 at 20:04

|
show 1 more comment

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)

# split the `,`
df = df[1].str.strip().str.split(',', expand=True)

# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345

# stack and drop empty
df = df.stack()
df = df[~df.eq('')]

# split ' '
df = df.str.strip().str.split(' ', expand=True)

# edit to give final expected output:

# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']

# final df
final_df = df.reset_index().drop('to_drop', axis=1)

edited Apr 22 at 19:57

answered Apr 22 at 19:43

Quang Hoang

4,03611020

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
Apr 22 at 19:55

1

@Rebin add engine='python'

– pault
Apr 22 at 19:58

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
Apr 22 at 20:02

I dont know how to add engine python? what is the command?

– Rebin
Apr 22 at 20:02

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
Apr 22 at 20:04

|
show 1 more comment

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)

# split the `,`
df = df[1].str.strip().str.split(',', expand=True)

# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345

# stack and drop empty
df = df.stack()
df = df[~df.eq('')]

# split ' '
df = df.str.strip().str.split(' ', expand=True)

# edit to give final expected output:

# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']

# final df
final_df = df.reset_index().drop('to_drop', axis=1)

edited Apr 22 at 19:57

answered Apr 22 at 19:43

Quang Hoang

4,03611020

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)

# split the `,`
df = df[1].str.strip().str.split(',', expand=True)

# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345

# stack and drop empty
df = df.stack()
df = df[~df.eq('')]

# split ' '
df = df.str.strip().str.split(' ', expand=True)

# edit to give final expected output:

# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']

# final df
final_df = df.reset_index().drop('to_drop', axis=1)

edited Apr 22 at 19:57

answered Apr 22 at 19:43

Quang Hoang

4,03611020

edited Apr 22 at 19:57

answered Apr 22 at 19:43

Quang Hoang

4,03611020

answered Apr 22 at 19:43

Quang Hoang

4,03611020

answered Apr 22 at 19:43

Quang Hoang

4,03611020

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
Apr 22 at 19:55

1

@Rebin add engine='python'

– pault
Apr 22 at 19:58

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
Apr 22 at 20:02

I dont know how to add engine python? what is the command?

– Rebin
Apr 22 at 20:02

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
Apr 22 at 20:04

|
show 1 more comment

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
Apr 22 at 19:55

1

@Rebin add engine='python'

– pault
Apr 22 at 19:58

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
Apr 22 at 20:02

I dont know how to add engine python? what is the command?

– Rebin
Apr 22 at 20:02

1

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
Apr 22 at 20:04

how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

– Rebin
Apr 22 at 19:55

@Rebin add engine='python'

– pault
Apr 22 at 19:58

@pault weird, 'cause I already split by ' '. It yields correct data on my computer.

– Quang Hoang
Apr 22 at 20:02

I dont know how to add engine python? what is the command?

– Rebin
Apr 22 at 20:02

@Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

– pault
Apr 22 at 20:04

|
show 1 more comment

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

file = """
1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 
"""

grammar = Grammar(
 r"""
 expr = (garbage / line)+

 line = id colon pair*
 pair = term ws weight sep? ws?
 garbage = ws+

 id = ~"d+"
 colon = ws? ":" ws?
 sep = ws? "," ws?

 term = ~"[a-zA-Z]+"
 weight = ~"d+(?:.d+)?"

 ws = ~"s+"
 """
)

tree = grammar.parse(file)

class PandasVisitor(NodeVisitor):
 def generic_visit(self, node, visited_children):
 return visited_children or node

 def visit_pair(self, node, visited_children):
 term, _, weight, *_ = visited_children
 return (term.text, weight.text)

 def visit_line(self, node, visited_children):
 id, _, pairs = visited_children
 return [(id.text, *pair) for pair in pairs]

 def visit_garbage(self, node, visited_children):
 return None

 def visit_expr(self, node, visited_children):
 return [item
 for lst in visited_children
 for sublst in lst if sublst
 for item in sublst]

pv = PandasVisitor()
out = pv.visit(tree)

df = pd.DataFrame(out, columns=["Id", "Term", "weight"])
print(df)

This yields

 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

Afterwards, we need a NodeVisitor class to actually do sth. with the retrieved ast.

edited 2 days ago

answered Apr 22 at 20:29

Jan

26.1k52750

add a comment |

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

file = """
1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 
"""

grammar = Grammar(
 r"""
 expr = (garbage / line)+

 line = id colon pair*
 pair = term ws weight sep? ws?
 garbage = ws+

 id = ~"d+"
 colon = ws? ":" ws?
 sep = ws? "," ws?

 term = ~"[a-zA-Z]+"
 weight = ~"d+(?:.d+)?"

 ws = ~"s+"
 """
)

tree = grammar.parse(file)

class PandasVisitor(NodeVisitor):
 def generic_visit(self, node, visited_children):
 return visited_children or node

 def visit_pair(self, node, visited_children):
 term, _, weight, *_ = visited_children
 return (term.text, weight.text)

 def visit_line(self, node, visited_children):
 id, _, pairs = visited_children
 return [(id.text, *pair) for pair in pairs]

 def visit_garbage(self, node, visited_children):
 return None

 def visit_expr(self, node, visited_children):
 return [item
 for lst in visited_children
 for sublst in lst if sublst
 for item in sublst]

pv = PandasVisitor()
out = pv.visit(tree)

df = pd.DataFrame(out, columns=["Id", "Term", "weight"])
print(df)

This yields

 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

Afterwards, we need a NodeVisitor class to actually do sth. with the retrieved ast.

edited 2 days ago

answered Apr 22 at 20:29

Jan

26.1k52750

add a comment |

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

file = """
1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 
"""

grammar = Grammar(
 r"""
 expr = (garbage / line)+

 line = id colon pair*
 pair = term ws weight sep? ws?
 garbage = ws+

 id = ~"d+"
 colon = ws? ":" ws?
 sep = ws? "," ws?

 term = ~"[a-zA-Z]+"
 weight = ~"d+(?:.d+)?"

 ws = ~"s+"
 """
)

tree = grammar.parse(file)

class PandasVisitor(NodeVisitor):
 def generic_visit(self, node, visited_children):
 return visited_children or node

 def visit_pair(self, node, visited_children):
 term, _, weight, *_ = visited_children
 return (term.text, weight.text)

 def visit_line(self, node, visited_children):
 id, _, pairs = visited_children
 return [(id.text, *pair) for pair in pairs]

 def visit_garbage(self, node, visited_children):
 return None

 def visit_expr(self, node, visited_children):
 return [item
 for lst in visited_children
 for sublst in lst if sublst
 for item in sublst]

pv = PandasVisitor()
out = pv.visit(tree)

df = pd.DataFrame(out, columns=["Id", "Term", "weight"])
print(df)

This yields

 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

Afterwards, we need a NodeVisitor class to actually do sth. with the retrieved ast.

edited 2 days ago

answered Apr 22 at 20:29

Jan

26.1k52750

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

file = """
1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 
"""

grammar = Grammar(
 r"""
 expr = (garbage / line)+

 line = id colon pair*
 pair = term ws weight sep? ws?
 garbage = ws+

 id = ~"d+"
 colon = ws? ":" ws?
 sep = ws? "," ws?

 term = ~"[a-zA-Z]+"
 weight = ~"d+(?:.d+)?"

 ws = ~"s+"
 """
)

tree = grammar.parse(file)

class PandasVisitor(NodeVisitor):
 def generic_visit(self, node, visited_children):
 return visited_children or node

 def visit_pair(self, node, visited_children):
 term, _, weight, *_ = visited_children
 return (term.text, weight.text)

 def visit_line(self, node, visited_children):
 id, _, pairs = visited_children
 return [(id.text, *pair) for pair in pairs]

 def visit_garbage(self, node, visited_children):
 return None

 def visit_expr(self, node, visited_children):
 return [item
 for lst in visited_children
 for sublst in lst if sublst
 for item in sublst]

pv = PandasVisitor()
out = pv.visit(tree)

df = pd.DataFrame(out, columns=["Id", "Term", "weight"])
print(df)

This yields

 Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

Afterwards, we need a NodeVisitor class to actually do sth. with the retrieved ast.

edited 2 days ago

answered Apr 22 at 20:29

Jan

26.1k52750

edited 2 days ago

answered Apr 22 at 20:29

Jan

26.1k52750

answered Apr 22 at 20:29

Jan

26.1k52750

answered Apr 22 at 20:29

Jan

26.1k52750

add a comment |

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
 for line in f.readlines():#looping every line
 my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
 for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
 my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names

answered Apr 22 at 19:55

JoPapou13

914

add a comment |

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
 for line in f.readlines():#looping every line
 my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
 for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
 my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names

answered Apr 22 at 19:55

JoPapou13

914

add a comment |

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
 for line in f.readlines():#looping every line
 my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
 for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
 my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names

answered Apr 22 at 19:55

JoPapou13

914

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
 for line in f.readlines():#looping every line
 my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
 for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
 my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names

answered Apr 22 at 19:55

JoPapou13

914

answered Apr 22 at 19:55

JoPapou13

914

answered Apr 22 at 19:55

JoPapou13

914

answered Apr 22 at 19:55

JoPapou13

914

add a comment |

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

#df:
 0 1
0 1 frack 0.733, shale 0.700, 
1 10 space 0.645, station 0.327, nasa 0.258, 
2 4 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)

dfs = []
for idx, rows in df.iterrows():
 print(rows)
 dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
 dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)

# this creates newdf:
 Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1 
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10 
7 4 celebr 0.262
8 4 bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

 Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345

answered Apr 22 at 19:58

Rocky Li

3,7081719

add a comment |

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

#df:
 0 1
0 1 frack 0.733, shale 0.700, 
1 10 space 0.645, station 0.327, nasa 0.258, 
2 4 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)

dfs = []
for idx, rows in df.iterrows():
 print(rows)
 dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
 dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)

# this creates newdf:
 Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1 
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10 
7 4 celebr 0.262
8 4 bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

 Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345

answered Apr 22 at 19:58

Rocky Li

3,7081719

add a comment |

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

#df:
 0 1
0 1 frack 0.733, shale 0.700, 
1 10 space 0.645, station 0.327, nasa 0.258, 
2 4 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)

dfs = []
for idx, rows in df.iterrows():
 print(rows)
 dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
 dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)

# this creates newdf:
 Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1 
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10 
7 4 celebr 0.262
8 4 bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

 Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345

answered Apr 22 at 19:58

Rocky Li

3,7081719

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

#df:
 0 1
0 1 frack 0.733, shale 0.700, 
1 10 space 0.645, station 0.327, nasa 0.258, 
2 4 celebr 0.262, bahar 0.345

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)

dfs = []
for idx, rows in df.iterrows():
 print(rows)
 dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
 dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)

# this creates newdf:
 Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1 
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10 
7 4 celebr 0.262
8 4 bahar 0.345

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

 Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345

answered Apr 22 at 19:58

Rocky Li

3,7081719

answered Apr 22 at 19:58

Rocky Li

3,7081719

answered Apr 22 at 19:58

Rocky Li

3,7081719

answered Apr 22 at 19:58

Rocky Li

3,7081719

add a comment |

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
 for line in readObject:
 line=line.rstrip('n')
 tempList1=line.split(':')
 tempList2=tempList1[1]
 tempList2=tempList2.rstrip(',')
 tempList2=tempList2.split(',')
 for item in tempList2:
 e=item.split(' ')
 tempRow=[tempList1[0], e[0],e[1]]
 df.loc[len(df)]=tempRow
print(df)

answered Apr 22 at 20:04

Rebin

297312

add a comment |

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
 for line in readObject:
 line=line.rstrip('n')
 tempList1=line.split(':')
 tempList2=tempList1[1]
 tempList2=tempList2.rstrip(',')
 tempList2=tempList2.split(',')
 for item in tempList2:
 e=item.split(' ')
 tempRow=[tempList1[0], e[0],e[1]]
 df.loc[len(df)]=tempRow
print(df)

answered Apr 22 at 20:04

Rebin

297312

add a comment |

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
 for line in readObject:
 line=line.rstrip('n')
 tempList1=line.split(':')
 tempList2=tempList1[1]
 tempList2=tempList2.rstrip(',')
 tempList2=tempList2.split(',')
 for item in tempList2:
 e=item.split(' ')
 tempRow=[tempList1[0], e[0],e[1]]
 df.loc[len(df)]=tempRow
print(df)

answered Apr 22 at 20:04

Rebin

297312

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
 for line in readObject:
 line=line.rstrip('n')
 tempList1=line.split(':')
 tempList2=tempList1[1]
 tempList2=tempList2.rstrip(',')
 tempList2=tempList2.split(',')
 for item in tempList2:
 e=item.split(' ')
 tempRow=[tempList1[0], e[0],e[1]]
 df.loc[len(df)]=tempRow
print(df)

answered Apr 22 at 20:04

Rebin

297312

answered Apr 22 at 20:04

Rebin

297312

answered Apr 22 at 20:04

Rebin

297312

answered Apr 22 at 20:04

Rebin

297312

add a comment |

-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:
 content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[
 ['1','frack 0.733, shale 0.700,'],
 ['10', 'space 0.645, station 0.327, nasa 0.258,'],
 ['4','celebr 0.262, bahar 0.345 ']]

answered Apr 22 at 19:30

CedricLy

3

Your result is not the result asked for in the question.

– GiraffeMan91
Apr 22 at 19:31

add a comment |

-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:
 content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[
 ['1','frack 0.733, shale 0.700,'],
 ['10', 'space 0.645, station 0.327, nasa 0.258,'],
 ['4','celebr 0.262, bahar 0.345 ']]

answered Apr 22 at 19:30

CedricLy

3

Your result is not the result asked for in the question.

– GiraffeMan91
Apr 22 at 19:31

add a comment |

-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:
 content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[
 ['1','frack 0.733, shale 0.700,'],
 ['10', 'space 0.645, station 0.327, nasa 0.258,'],
 ['4','celebr 0.262, bahar 0.345 ']]

answered Apr 22 at 19:30

CedricLy

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

with open('path/filename.txt','r') as filename:
 content = filename.readlines()

2)
content = [x.split(':') for x in content]

This will give you the following result:

content =[
 ['1','frack 0.733, shale 0.700,'],
 ['10', 'space 0.645, station 0.327, nasa 0.258,'],
 ['4','celebr 0.262, bahar 0.345 ']]

answered Apr 22 at 19:30

CedricLy

answered Apr 22 at 19:30

CedricLy

answered Apr 22 at 19:30

CedricLy

answered Apr 22 at 19:30

CedricLy

3

Your result is not the result asked for in the question.

– GiraffeMan91
Apr 22 at 19:31

add a comment |

3

Your result is not the result asked for in the question.

– GiraffeMan91
Apr 22 at 19:31

Your result is not the result asked for in the question.

– GiraffeMan91
Apr 22 at 19:31

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Uy sr1QEiJ6rsTYw689UM9RFjq040QP GZIDUQ5AU,MZdlD4ebKdS7bGjjRnyVOPB,oi3C

搜尋此網誌

Ttdfjt

8 Answers
8

Walkthrough

Disclaimer

Your Answer

Post as a guest

8 Answers
8

8 Answers
8

Walkthrough

Disclaimer

Walkthrough

Disclaimer

Walkthrough

Disclaimer

Walkthrough

Disclaimer

Post as a guest

Popular posts from this blog

8 Answers 8

Walkthrough

Disclaimer

Your Answer

Sign up or log in

Post as a guest

Post as a guest

8 Answers 8

8 Answers 8

Walkthrough

Disclaimer

Walkthrough

Disclaimer

Walkthrough

Disclaimer

Walkthrough

Disclaimer

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

8 Answers
8

8 Answers
8

8 Answers
8