How to slice a string input at a certain unknown indexHow do I check if a string is a number (float)?How do I parse a string to a float or int?How to remove an element from a list by index?How to substring a string in Python?How do I trim whitespace from a string?How can I print literal curly-brace characters in python string and also use .format on it?How do I lowercase a string in Python?How to read a text file into a string variable and strip newlines?How to change a string into uppercaseHow to check if the string is empty?
Replacing URI when using dynamic hosts in Nginx reverse proxy
How can we better understand multiplicative inverse modulo something?
Could the crash sites of the Apollo 11 and 16 LMs be seen by the LRO?
Why do candidates not quit if they no longer have a realistic chance to win in the 2020 US presidents election
(algebraic topology) question about the cellular approximation theorem
How to determine port and starboard on a rotating wheel space station?
Absconding a company after 1st day of joining
Confused about 誘われて (Sasowarete)
Possible isometry groups of open manifolds
What are some symbols representing peasants/oppressed persons fighting back?
Why use null function instead of == []
How did John Lennon tune his guitar
Are there any double stars that I can actually see orbit each other?
Was adding milk to tea started to reduce employee tea break time?
HackerRank: Electronics Shop
Filtering fine silt/mud from water (not necessarily bacteria etc.)
How are "soeben" and "eben" different from one another?
Why hasn't the U.S. government paid war reparations to any country it attacked?
Alternatives to using writing paper for writing practice
Ezek. 24:1-2, "Again in the ninth year, in the tenth month, in the tenth day of the month, ...." Which month was the tenth month?
Why does the trade federation become so alarmed upon learning the ambassadors are Jedi Knights?
Does entangle require vegetation?
Would letting a multiclass character rebuild their character to be single-classed be game-breaking?
Did the Shuttle's rudder or elevons operate when flown on its carrier 747?
How to slice a string input at a certain unknown index
How do I check if a string is a number (float)?How do I parse a string to a float or int?How to remove an element from a list by index?How to substring a string in Python?How do I trim whitespace from a string?How can I print literal curly-brace characters in python string and also use .format on it?How do I lowercase a string in Python?How to read a text file into a string variable and strip newlines?How to change a string into uppercaseHow to check if the string is empty?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn"
2- "What is yournlastname and email?ndasf?lkjas"
3- "askjdmk.nGiven your skillsnhow would you rate yourself?nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted n in the input, but the question doesn't always have n before the first letter of the question.
def extractQuestion(input):
index_end_q = input.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = 'n ' . join(input[index_first_letter_of_q :index_end_q ])
python
add a comment |
A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn"
2- "What is yournlastname and email?ndasf?lkjas"
3- "askjdmk.nGiven your skillsnhow would you rate yourself?nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted n in the input, but the question doesn't always have n before the first letter of the question.
def extractQuestion(input):
index_end_q = input.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = 'n ' . join(input[index_first_letter_of_q :index_end_q ])
python
I think more examples may help determine if there is any invariant property about the start of the question to hook into.
– Andrew Allen
Jul 6 at 9:45
2
I think thisTODO
is theTODO
of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.
– ForceBru
Jul 6 at 9:45
2
You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.
– user10987432
Jul 6 at 9:55
add a comment |
A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn"
2- "What is yournlastname and email?ndasf?lkjas"
3- "askjdmk.nGiven your skillsnhow would you rate yourself?nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted n in the input, but the question doesn't always have n before the first letter of the question.
def extractQuestion(input):
index_end_q = input.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = 'n ' . join(input[index_first_letter_of_q :index_end_q ])
python
A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn"
2- "What is yournlastname and email?ndasf?lkjas"
3- "askjdmk.nGiven your skillsnhow would you rate yourself?nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted n in the input, but the question doesn't always have n before the first letter of the question.
def extractQuestion(input):
index_end_q = input.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = 'n ' . join(input[index_first_letter_of_q :index_end_q ])
python
python
edited Jul 6 at 20:29
Peter Mortensen
14.2k19 gold badges88 silver badges115 bronze badges
14.2k19 gold badges88 silver badges115 bronze badges
asked Jul 6 at 9:29
LinkCoderLinkCoder
735 bronze badges
735 bronze badges
I think more examples may help determine if there is any invariant property about the start of the question to hook into.
– Andrew Allen
Jul 6 at 9:45
2
I think thisTODO
is theTODO
of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.
– ForceBru
Jul 6 at 9:45
2
You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.
– user10987432
Jul 6 at 9:55
add a comment |
I think more examples may help determine if there is any invariant property about the start of the question to hook into.
– Andrew Allen
Jul 6 at 9:45
2
I think thisTODO
is theTODO
of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.
– ForceBru
Jul 6 at 9:45
2
You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.
– user10987432
Jul 6 at 9:55
I think more examples may help determine if there is any invariant property about the start of the question to hook into.
– Andrew Allen
Jul 6 at 9:45
I think more examples may help determine if there is any invariant property about the start of the question to hook into.
– Andrew Allen
Jul 6 at 9:45
2
2
I think this
TODO
is the TODO
of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.– ForceBru
Jul 6 at 9:45
I think this
TODO
is the TODO
of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.– ForceBru
Jul 6 at 9:45
2
2
You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.
– user10987432
Jul 6 at 9:55
You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.
– user10987432
Jul 6 at 9:55
add a comment |
2 Answers
2
active
oldest
votes
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant
:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: => '.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw
– LinkCoder
Jul 6 at 9:49
@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).
– game0ver
Jul 6 at 9:53
Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.
– LinkCoder
Jul 6 at 9:59
@LinkCoder that's easy, you can use the python built-infind()
function.
– game0ver
Jul 6 at 10:04
Ok I get it now thanks for the effort that you have put into your answer :)
– LinkCoder
Jul 6 at 10:19
|
show 1 more comment
You could try a regular expression like b[A-Z][a-z][^?]+?
, meaning:
- The start of a word
b
with an upper case letter[A-Z]
followed by a lower case letter[a-z]
, - then a sequence of non-questionmark-characters
[^?]+
, - followed by a literal question mark
?
.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
If that's one blob of text, you can use findall
instead of search
:
>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'
Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?
– LinkCoder
Jul 6 at 10:19
@LinkCoder Wouldn't that be exactly what I did in the lower code withtext
?
– tobias_k
Jul 6 at 10:59
Yes I figured it out. I just usedre.search(regex_pattern, input, flags=re.S).group().replace("n", " ")
so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?
– LinkCoder
Jul 6 at 11:04
Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.
– tobias_k
Jul 6 at 11:06
Ok I think I understand it thanks very much for your effort and time :)
– LinkCoder
Jul 6 at 11:11
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56912823%2fhow-to-slice-a-string-input-at-a-certain-unknown-index%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant
:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: => '.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw
– LinkCoder
Jul 6 at 9:49
@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).
– game0ver
Jul 6 at 9:53
Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.
– LinkCoder
Jul 6 at 9:59
@LinkCoder that's easy, you can use the python built-infind()
function.
– game0ver
Jul 6 at 10:04
Ok I get it now thanks for the effort that you have put into your answer :)
– LinkCoder
Jul 6 at 10:19
|
show 1 more comment
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant
:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: => '.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw
– LinkCoder
Jul 6 at 9:49
@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).
– game0ver
Jul 6 at 9:53
Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.
– LinkCoder
Jul 6 at 9:59
@LinkCoder that's easy, you can use the python built-infind()
function.
– game0ver
Jul 6 at 10:04
Ok I get it now thanks for the effort that you have put into your answer :)
– LinkCoder
Jul 6 at 10:19
|
show 1 more comment
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant
:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: => '.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant
:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: => '.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
edited Jul 6 at 18:50
Peter Mortensen
14.2k19 gold badges88 silver badges115 bronze badges
14.2k19 gold badges88 silver badges115 bronze badges
answered Jul 6 at 9:42
game0vergame0ver
8585 silver badges19 bronze badges
8585 silver badges19 bronze badges
Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw
– LinkCoder
Jul 6 at 9:49
@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).
– game0ver
Jul 6 at 9:53
Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.
– LinkCoder
Jul 6 at 9:59
@LinkCoder that's easy, you can use the python built-infind()
function.
– game0ver
Jul 6 at 10:04
Ok I get it now thanks for the effort that you have put into your answer :)
– LinkCoder
Jul 6 at 10:19
|
show 1 more comment
Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw
– LinkCoder
Jul 6 at 9:49
@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).
– game0ver
Jul 6 at 9:53
Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.
– LinkCoder
Jul 6 at 9:59
@LinkCoder that's easy, you can use the python built-infind()
function.
– game0ver
Jul 6 at 10:04
Ok I get it now thanks for the effort that you have put into your answer :)
– LinkCoder
Jul 6 at 10:19
Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw
– LinkCoder
Jul 6 at 9:49
Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw
– LinkCoder
Jul 6 at 9:49
@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).
– game0ver
Jul 6 at 9:53
@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).
– game0ver
Jul 6 at 9:53
Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.
– LinkCoder
Jul 6 at 9:59
Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.
– LinkCoder
Jul 6 at 9:59
@LinkCoder that's easy, you can use the python built-in
find()
function.– game0ver
Jul 6 at 10:04
@LinkCoder that's easy, you can use the python built-in
find()
function.– game0ver
Jul 6 at 10:04
Ok I get it now thanks for the effort that you have put into your answer :)
– LinkCoder
Jul 6 at 10:19
Ok I get it now thanks for the effort that you have put into your answer :)
– LinkCoder
Jul 6 at 10:19
|
show 1 more comment
You could try a regular expression like b[A-Z][a-z][^?]+?
, meaning:
- The start of a word
b
with an upper case letter[A-Z]
followed by a lower case letter[a-z]
, - then a sequence of non-questionmark-characters
[^?]+
, - followed by a literal question mark
?
.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
If that's one blob of text, you can use findall
instead of search
:
>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'
Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?
– LinkCoder
Jul 6 at 10:19
@LinkCoder Wouldn't that be exactly what I did in the lower code withtext
?
– tobias_k
Jul 6 at 10:59
Yes I figured it out. I just usedre.search(regex_pattern, input, flags=re.S).group().replace("n", " ")
so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?
– LinkCoder
Jul 6 at 11:04
Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.
– tobias_k
Jul 6 at 11:06
Ok I think I understand it thanks very much for your effort and time :)
– LinkCoder
Jul 6 at 11:11
add a comment |
You could try a regular expression like b[A-Z][a-z][^?]+?
, meaning:
- The start of a word
b
with an upper case letter[A-Z]
followed by a lower case letter[a-z]
, - then a sequence of non-questionmark-characters
[^?]+
, - followed by a literal question mark
?
.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
If that's one blob of text, you can use findall
instead of search
:
>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'
Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?
– LinkCoder
Jul 6 at 10:19
@LinkCoder Wouldn't that be exactly what I did in the lower code withtext
?
– tobias_k
Jul 6 at 10:59
Yes I figured it out. I just usedre.search(regex_pattern, input, flags=re.S).group().replace("n", " ")
so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?
– LinkCoder
Jul 6 at 11:04
Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.
– tobias_k
Jul 6 at 11:06
Ok I think I understand it thanks very much for your effort and time :)
– LinkCoder
Jul 6 at 11:11
add a comment |
You could try a regular expression like b[A-Z][a-z][^?]+?
, meaning:
- The start of a word
b
with an upper case letter[A-Z]
followed by a lower case letter[a-z]
, - then a sequence of non-questionmark-characters
[^?]+
, - followed by a literal question mark
?
.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
If that's one blob of text, you can use findall
instead of search
:
>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'
You could try a regular expression like b[A-Z][a-z][^?]+?
, meaning:
- The start of a word
b
with an upper case letter[A-Z]
followed by a lower case letter[a-z]
, - then a sequence of non-questionmark-characters
[^?]+
, - followed by a literal question mark
?
.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EGnnWhat is your name?nkgda dasflkjasn",
"What is yournlastname and email?ndasf?lkjas",
"nGiven your skillsnhow would you rate yourself?nand your name? dasf?"]
>>> import re
>>> p = r"b[A-Z][a-z][^?]+?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
If that's one blob of text, you can use findall
instead of search
:
>>> text = "n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is yournlastname and email?',
'Given your skillsnhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasdnHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'
edited Jul 6 at 11:10
answered Jul 6 at 9:43
tobias_ktobias_k
61.3k9 gold badges73 silver badges116 bronze badges
61.3k9 gold badges73 silver badges116 bronze badges
Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?
– LinkCoder
Jul 6 at 10:19
@LinkCoder Wouldn't that be exactly what I did in the lower code withtext
?
– tobias_k
Jul 6 at 10:59
Yes I figured it out. I just usedre.search(regex_pattern, input, flags=re.S).group().replace("n", " ")
so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?
– LinkCoder
Jul 6 at 11:04
Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.
– tobias_k
Jul 6 at 11:06
Ok I think I understand it thanks very much for your effort and time :)
– LinkCoder
Jul 6 at 11:11
add a comment |
Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?
– LinkCoder
Jul 6 at 10:19
@LinkCoder Wouldn't that be exactly what I did in the lower code withtext
?
– tobias_k
Jul 6 at 10:59
Yes I figured it out. I just usedre.search(regex_pattern, input, flags=re.S).group().replace("n", " ")
so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?
– LinkCoder
Jul 6 at 11:04
Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.
– tobias_k
Jul 6 at 11:06
Ok I think I understand it thanks very much for your effort and time :)
– LinkCoder
Jul 6 at 11:11
Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?
– LinkCoder
Jul 6 at 10:19
Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?
– LinkCoder
Jul 6 at 10:19
@LinkCoder Wouldn't that be exactly what I did in the lower code with
text
?– tobias_k
Jul 6 at 10:59
@LinkCoder Wouldn't that be exactly what I did in the lower code with
text
?– tobias_k
Jul 6 at 10:59
Yes I figured it out. I just used
re.search(regex_pattern, input, flags=re.S).group().replace("n", " ")
so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?– LinkCoder
Jul 6 at 11:04
Yes I figured it out. I just used
re.search(regex_pattern, input, flags=re.S).group().replace("n", " ")
so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?– LinkCoder
Jul 6 at 11:04
Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.
– tobias_k
Jul 6 at 11:06
Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.
– tobias_k
Jul 6 at 11:06
Ok I think I understand it thanks very much for your effort and time :)
– LinkCoder
Jul 6 at 11:11
Ok I think I understand it thanks very much for your effort and time :)
– LinkCoder
Jul 6 at 11:11
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56912823%2fhow-to-slice-a-string-input-at-a-certain-unknown-index%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I think more examples may help determine if there is any invariant property about the start of the question to hook into.
– Andrew Allen
Jul 6 at 9:45
2
I think this
TODO
is theTODO
of humanity right now because you'll need to make your program understand human language in order to properly solve this, and this task remains largely unsolved now.– ForceBru
Jul 6 at 9:45
2
You said the input could be, but not limited to, the following. Well, what is it limited to? Maybe you can tell us where you're getting these inputs and we may be able to provide a solution that navigates around having messy inputs in the first place.
– user10987432
Jul 6 at 9:55