How to slice a string input at a certain unknown indexHow do I check if a string is a number (float)?How do I parse a string to a float or int?How to remove an element from a list by index?How to substring a string in Python?How do I trim whitespace from a string?How can I print literal curly-brace characters in python string and also use .format on it?How do I lowercase a string in Python?How to read a text file into a string variable and strip newlines?How to change a string into uppercaseHow to check if the string is empty?

Replacing URI when using dynamic hosts in Nginx reverse proxy

How can we better understand multiplicative inverse modulo something?

Could the crash sites of the Apollo 11 and 16 LMs be seen by the LRO?

Why do candidates not quit if they no longer have a realistic chance to win in the 2020 US presidents election

(algebraic topology) question about the cellular approximation theorem

How to determine port and starboard on a rotating wheel space station?

Absconding a company after 1st day of joining

Confused about 誘われて (Sasowarete)

Possible isometry groups of open manifolds

What are some symbols representing peasants/oppressed persons fighting back?

Why use null function instead of == []

How did John Lennon tune his guitar

Are there any double stars that I can actually see orbit each other?

Was adding milk to tea started to reduce employee tea break time?

HackerRank: Electronics Shop

Filtering fine silt/mud from water (not necessarily bacteria etc.)

How are "soeben" and "eben" different from one another?

Why hasn't the U.S. government paid war reparations to any country it attacked?

Alternatives to using writing paper for writing practice

Ezek. 24:1-2, "Again in the ninth year, in the tenth month, in the tenth day of the month, ...." Which month was the tenth month?

Why does the trade federation become so alarmed upon learning the ambassadors are Jedi Knights?

Does entangle require vegetation?

Would letting a multiclass character rebuild their character to be single-classed be game-breaking?

Did the Shuttle's rudder or elevons operate when flown on its carrier 747?

How to slice a string input at a certain unknown index

How do I check if a string is a number (float)?How do I parse a string to a float or int?How to remove an element from a list by index?How to substring a string in Python?How do I trim whitespace from a string?How can I print literal curly-brace characters in python string and also use .format on it?How do I lowercase a string in Python?How to read a text file into a string variable and strip newlines?How to change a string into uppercaseHow to check if the string is empty?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.

So the input could be (but not limited to) the following:

1- "eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn" 2- "What is yournlastname and email?ndasf?lkjas" 3- "askjdmk.nGiven your skillsnhow would you rate yourself?nand your name? dasf?"

(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")

The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text

I only want to extract the question from the garbage input and nothing else.

I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted n in the input, but the question doesn't always have n before the first letter of the question.

def extractQuestion(input):
 index_end_q = input.find('?', 1)
 index_first_letter_of_q = 0 # TODO
 question = 'n ' . join(input[index_first_letter_of_q :index_end_q ])

edited Jul 6 at 20:29

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

asked Jul 6 at 9:29

LinkCoder

735 bronze badges

I think more examples may help determine if there is any invariant property about the start of the question to hook into.

– Andrew Allen
Jul 6 at 9:45

2

I think this TODO is the TODO of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.

– ForceBru
Jul 6 at 9:45

2

You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.

– user10987432
Jul 6 at 9:55

add a comment |

So the input could be (but not limited to) the following:

1- "eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn" 2- "What is yournlastname and email?ndasf?lkjas" 3- "askjdmk.nGiven your skillsnhow would you rate yourself?nand your name? dasf?"

(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")

The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text

I only want to extract the question from the garbage input and nothing else.

def extractQuestion(input):
 index_end_q = input.find('?', 1)
 index_first_letter_of_q = 0 # TODO
 question = 'n ' . join(input[index_first_letter_of_q :index_end_q ])

edited Jul 6 at 20:29

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

asked Jul 6 at 9:29

LinkCoder

735 bronze badges

I think more examples may help determine if there is any invariant property about the start of the question to hook into.

– Andrew Allen
Jul 6 at 9:45

2

I think this TODO is the TODO of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.

– ForceBru
Jul 6 at 9:45

2

You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.

– user10987432
Jul 6 at 9:55

add a comment |

So the input could be (but not limited to) the following:

1- "eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn" 2- "What is yournlastname and email?ndasf?lkjas" 3- "askjdmk.nGiven your skillsnhow would you rate yourself?nand your name? dasf?"

(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")

The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text

I only want to extract the question from the garbage input and nothing else.

def extractQuestion(input):
 index_end_q = input.find('?', 1)
 index_first_letter_of_q = 0 # TODO
 question = 'n ' . join(input[index_first_letter_of_q :index_end_q ])

edited Jul 6 at 20:29

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

asked Jul 6 at 9:29

LinkCoder

735 bronze badges

So the input could be (but not limited to) the following:

1- "eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn" 2- "What is yournlastname and email?ndasf?lkjas" 3- "askjdmk.nGiven your skillsnhow would you rate yourself?nand your name? dasf?"

(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")

The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text

I only want to extract the question from the garbage input and nothing else.

def extractQuestion(input):
 index_end_q = input.find('?', 1)
 index_first_letter_of_q = 0 # TODO
 question = 'n ' . join(input[index_first_letter_of_q :index_end_q ])

python

edited Jul 6 at 20:29

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

asked Jul 6 at 9:29

LinkCoder

735 bronze badges

edited Jul 6 at 20:29

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

asked Jul 6 at 9:29

LinkCoder

735 bronze badges

edited Jul 6 at 20:29

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

edited Jul 6 at 20:29

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

edited Jul 6 at 20:29

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

asked Jul 6 at 9:29

LinkCoder

735 bronze badges

asked Jul 6 at 9:29

LinkCoder

735 bronze badges

asked Jul 6 at 9:29

LinkCoder

735 bronze badges

I think more examples may help determine if there is any invariant property about the start of the question to hook into.

– Andrew Allen
Jul 6 at 9:45

2

I think this TODO is the TODO of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.

– ForceBru
Jul 6 at 9:45

2

You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.

– user10987432
Jul 6 at 9:55

add a comment |

I think more examples may help determine if there is any invariant property about the start of the question to hook into.

– Andrew Allen
Jul 6 at 9:45

2

I think this TODO is the TODO of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.

– ForceBru
Jul 6 at 9:45

2

You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.

– user10987432
Jul 6 at 9:55

I think more examples may help determine if there is any invariant property about the start of the question to hook into.

– Andrew Allen
Jul 6 at 9:45

I think this TODO is the TODO of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.

– ForceBru
Jul 6 at 9:45

You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.

– user10987432
Jul 6 at 9:55

add a comment |

2 Answers
2

active

oldest

votes

A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
 return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

for sentence in sentences:
 for i,w in enumerate(sentence.split()):
 if isWord(w):
 print('index: => '.format(i, w))
 break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given

edited Jul 6 at 18:50

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

answered Jul 6 at 9:42

game0ver

8585 silver badges19 bronze badges

Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw

– LinkCoder
Jul 6 at 9:49

@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).

– game0ver
Jul 6 at 9:53

Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.

– LinkCoder
Jul 6 at 9:59

@LinkCoder that's easy, you can use the python built-in find() function.

– game0ver
Jul 6 at 10:04

Ok I get it now thanks for the effort that you have put into your answer :)

– LinkCoder
Jul 6 at 10:19

|
show 1 more comment

You could try a regular expression like b[A-Z][a-z][^?]+?, meaning:

The start of a word b with an upper case letter [A-Z] followed by a lower case letter [a-z],

then a sequence of non-questionmark-characters [^?]+,

followed by a literal question mark ?.

This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.

>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
 "What is yournlastname and email?ndasf?lkjas",
 "nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

If that's one blob of text, you can use findall instead of search:

>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

Actually, this also seems to work reasonably well for questions with names in them:

>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?" 
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

edited Jul 6 at 11:10

answered Jul 6 at 9:43

tobias_k

61.3k9 gold badges73 silver badges116 bronze badges

Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?

– LinkCoder
Jul 6 at 10:19

@LinkCoder Wouldn't that be exactly what I did in the lower code with text?

– tobias_k
Jul 6 at 10:59

Yes I figured it out. I just used re.search(regex_pattern, input, flags=re.S).group().replace("n", " ") so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?

– LinkCoder
Jul 6 at 11:04

Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.

– tobias_k
Jul 6 at 11:06

Ok I think I understand it thanks very much for your effort and time :)

– LinkCoder
Jul 6 at 11:11

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56912823%2fhow-to-slice-a-string-input-at-a-certain-unknown-index%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
 return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

for sentence in sentences:
 for i,w in enumerate(sentence.split()):
 if isWord(w):
 print('index: => '.format(i, w))
 break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given

edited Jul 6 at 18:50

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

answered Jul 6 at 9:42

game0ver

8585 silver badges19 bronze badges

Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw

– LinkCoder
Jul 6 at 9:49

@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).

– game0ver
Jul 6 at 9:53

Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.

– LinkCoder
Jul 6 at 9:59

@LinkCoder that's easy, you can use the python built-in find() function.

– game0ver
Jul 6 at 10:04

Ok I get it now thanks for the effort that you have put into your answer :)

– LinkCoder
Jul 6 at 10:19

|
show 1 more comment

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
 return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

for sentence in sentences:
 for i,w in enumerate(sentence.split()):
 if isWord(w):
 print('index: => '.format(i, w))
 break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given

edited Jul 6 at 18:50

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

answered Jul 6 at 9:42

game0ver

8585 silver badges19 bronze badges

Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw

– LinkCoder
Jul 6 at 9:49

@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).

– game0ver
Jul 6 at 9:53

Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.

– LinkCoder
Jul 6 at 9:59

@LinkCoder that's easy, you can use the python built-in find() function.

– game0ver
Jul 6 at 10:04

Ok I get it now thanks for the effort that you have put into your answer :)

– LinkCoder
Jul 6 at 10:19

|
show 1 more comment

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
 return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

for sentence in sentences:
 for i,w in enumerate(sentence.split()):
 if isWord(w):
 print('index: => '.format(i, w))
 break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given

edited Jul 6 at 18:50

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

answered Jul 6 at 9:42

game0ver

8585 silver badges19 bronze badges

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
 return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

for sentence in sentences:
 for i,w in enumerate(sentence.split()):
 if isWord(w):
 print('index: => '.format(i, w))
 break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given

edited Jul 6 at 18:50

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

answered Jul 6 at 9:42

game0ver

8585 silver badges19 bronze badges

edited Jul 6 at 18:50

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

edited Jul 6 at 18:50

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

edited Jul 6 at 18:50

Peter Mortensen

14.2k19 gold badges88 silver badges115 bronze badges

answered Jul 6 at 9:42

game0ver

8585 silver badges19 bronze badges

answered Jul 6 at 9:42

game0ver

8585 silver badges19 bronze badges

answered Jul 6 at 9:42

game0ver

8585 silver badges19 bronze badges

Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw

– LinkCoder
Jul 6 at 9:49

@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).

– game0ver
Jul 6 at 9:53

Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.

– LinkCoder
Jul 6 at 9:59

@LinkCoder that's easy, you can use the python built-in find() function.

– game0ver
Jul 6 at 10:04

Ok I get it now thanks for the effort that you have put into your answer :)

– LinkCoder
Jul 6 at 10:19

|
show 1 more comment

Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw

– LinkCoder
Jul 6 at 9:49

@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).

– game0ver
Jul 6 at 9:53

Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.

– LinkCoder
Jul 6 at 9:59

@LinkCoder that's easy, you can use the python built-in find() function.

– game0ver
Jul 6 at 10:04

Ok I get it now thanks for the effort that you have put into your answer :)

– LinkCoder
Jul 6 at 10:19

Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw

– LinkCoder
Jul 6 at 9:49

@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).

– game0ver
Jul 6 at 9:53

Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.

– LinkCoder
Jul 6 at 9:59

@LinkCoder that's easy, you can use the python built-in find() function.

– game0ver
Jul 6 at 10:04

Ok I get it now thanks for the effort that you have put into your answer :)

– LinkCoder
Jul 6 at 10:19

|
show 1 more comment

You could try a regular expression like b[A-Z][a-z][^?]+?, meaning:

The start of a word b with an upper case letter [A-Z] followed by a lower case letter [a-z],

then a sequence of non-questionmark-characters [^?]+,

followed by a literal question mark ?.

This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.

>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
 "What is yournlastname and email?ndasf?lkjas",
 "nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

If that's one blob of text, you can use findall instead of search:

>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

Actually, this also seems to work reasonably well for questions with names in them:

>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?" 
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

edited Jul 6 at 11:10

answered Jul 6 at 9:43

tobias_k

61.3k9 gold badges73 silver badges116 bronze badges

Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?

– LinkCoder
Jul 6 at 10:19

@LinkCoder Wouldn't that be exactly what I did in the lower code with text?

– tobias_k
Jul 6 at 10:59

Yes I figured it out. I just used re.search(regex_pattern, input, flags=re.S).group().replace("n", " ") so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?

– LinkCoder
Jul 6 at 11:04

Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.

– tobias_k
Jul 6 at 11:06

Ok I think I understand it thanks very much for your effort and time :)

– LinkCoder
Jul 6 at 11:11

add a comment |

You could try a regular expression like b[A-Z][a-z][^?]+?, meaning:

The start of a word b with an upper case letter [A-Z] followed by a lower case letter [a-z],

then a sequence of non-questionmark-characters [^?]+,

followed by a literal question mark ?.

This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.

>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
 "What is yournlastname and email?ndasf?lkjas",
 "nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

If that's one blob of text, you can use findall instead of search:

>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

Actually, this also seems to work reasonably well for questions with names in them:

>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?" 
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

edited Jul 6 at 11:10

answered Jul 6 at 9:43

tobias_k

61.3k9 gold badges73 silver badges116 bronze badges

Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?

– LinkCoder
Jul 6 at 10:19

@LinkCoder Wouldn't that be exactly what I did in the lower code with text?

– tobias_k
Jul 6 at 10:59

Yes I figured it out. I just used re.search(regex_pattern, input, flags=re.S).group().replace("n", " ") so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?

– LinkCoder
Jul 6 at 11:04

Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.

– tobias_k
Jul 6 at 11:06

Ok I think I understand it thanks very much for your effort and time :)

– LinkCoder
Jul 6 at 11:11

add a comment |

You could try a regular expression like b[A-Z][a-z][^?]+?, meaning:

The start of a word b with an upper case letter [A-Z] followed by a lower case letter [a-z],

then a sequence of non-questionmark-characters [^?]+,

followed by a literal question mark ?.

This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.

>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
 "What is yournlastname and email?ndasf?lkjas",
 "nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

If that's one blob of text, you can use findall instead of search:

>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

Actually, this also seems to work reasonably well for questions with names in them:

>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?" 
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

edited Jul 6 at 11:10

answered Jul 6 at 9:43

tobias_k

61.3k9 gold badges73 silver badges116 bronze badges

You could try a regular expression like b[A-Z][a-z][^?]+?, meaning:

The start of a word b with an upper case letter [A-Z] followed by a lower case letter [a-z],

then a sequence of non-questionmark-characters [^?]+,

followed by a literal question mark ?.

This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.

>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
 "What is yournlastname and email?ndasf?lkjas",
 "nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]

>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

If that's one blob of text, you can use findall instead of search:

>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
 'What is yournlastname and email?',
 'Given your skillsnhow would you rate yourself?']

Actually, this also seems to work reasonably well for questions with names in them:

>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?" 
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

edited Jul 6 at 11:10

answered Jul 6 at 9:43

tobias_k

61.3k9 gold badges73 silver badges116 bronze badges

edited Jul 6 at 11:10

answered Jul 6 at 9:43

tobias_k

61.3k9 gold badges73 silver badges116 bronze badges

answered Jul 6 at 9:43

tobias_k

61.3k9 gold badges73 silver badges116 bronze badges

answered Jul 6 at 9:43

tobias_k

61.3k9 gold badges73 silver badges116 bronze badges

Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?

– LinkCoder
Jul 6 at 10:19

@LinkCoder Wouldn't that be exactly what I did in the lower code with text?

– tobias_k
Jul 6 at 10:59

Yes I figured it out. I just used re.search(regex_pattern, input, flags=re.S).group().replace("n", " ") so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?

– LinkCoder
Jul 6 at 11:04

Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.

– tobias_k
Jul 6 at 11:06

Ok I think I understand it thanks very much for your effort and time :)

– LinkCoder
Jul 6 at 11:11

add a comment |

Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?

– LinkCoder
Jul 6 at 10:19

@LinkCoder Wouldn't that be exactly what I did in the lower code with text?

– tobias_k
Jul 6 at 10:59

Yes I figured it out. I just used re.search(regex_pattern, input, flags=re.S).group().replace("n", " ") so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?

– LinkCoder
Jul 6 at 11:04

Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.

– tobias_k
Jul 6 at 11:06

Ok I think I understand it thanks very much for your effort and time :)

– LinkCoder
Jul 6 at 11:11

Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?

– LinkCoder
Jul 6 at 10:19

@LinkCoder Wouldn't that be exactly what I did in the lower code with text?

– tobias_k
Jul 6 at 10:59

Yes I figured it out. I just used re.search(regex_pattern, input, flags=re.S).group().replace("n", " ") so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?

– LinkCoder
Jul 6 at 11:04

Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.

– tobias_k
Jul 6 at 11:06

Ok I think I understand it thanks very much for your effort and time :)

– LinkCoder
Jul 6 at 11:11

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

bYjdJNzY,wTpLDZVnQF x0ijgrX 3,jY1pq56i GmGOeQZvnazQgG,hNahQg3moKktNvP2Oq,5RdTRO

搜尋此網誌

Ttdfjt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2