Find all the numbers in one file that are not in another file in pythonLazy Method for Reading Big File in Python?Efficient Out-Of-Core SortingHow do I check whether a file exists without exceptions?Calling an external command in PythonWhat are metaclasses in Python?How do I copy a file in Python?Finding the index of an item given a list containing it in PythonDoes Python have a ternary conditional operator?How do I include a JavaScript file in another JavaScript file?How do I list all files of a directory?Does Python have a string 'contains' substring method?Find an integer not among four billion given ones
Multirow in tabularx?
What are the conventions for transcribing Semitic languages into Greek?
Te-form and かつ and も?
Sign changes after taking the square root inequality. Why?
First amendment and employment: Can a police department terminate an officer for speech?
constant evaluation when using differential equations.
Does this Foo machine halt?
Acceptable to cut steak before searing?
Is it legal for a company to enter an agreement not to hire employees from another company?
How to mark beverage cans in a cooler for a blind person?
On the Rømer experiments and the speed if light
create a tuple from pairs
Is refreshing multiple times a test case for web applications?
What does Apple mean by "This may decrease battery life"?
Understanding the point of a kölsche Witz
What is my malfunctioning AI harvesting from humans?
In SQL Server, why does backward scan of clustered index cannot use parallelism?
n-types of the theory of natural numbers?
What does this double-treble double-bass staff mean?
Why did the RAAF procure the F/A-18 despite being purpose-built for carriers?
Can the ground attached to neutral fool a receptacle tester?
Y2K... in 2019?
Not going forward with internship interview process
ICO or PNG Format?
Find all the numbers in one file that are not in another file in python
Lazy Method for Reading Big File in Python?Efficient Out-Of-Core SortingHow do I check whether a file exists without exceptions?Calling an external command in PythonWhat are metaclasses in Python?How do I copy a file in Python?Finding the index of an item given a list containing it in PythonDoes Python have a ternary conditional operator?How do I include a JavaScript file in another JavaScript file?How do I list all files of a directory?Does Python have a string 'contains' substring method?Find an integer not among four billion given ones
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,
Input:
FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]
Output:
[2, 5, ...]
The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.
So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.
set(contentsofFileA)-set(contentsofFileB)
But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.
Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.
The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?
python arrays file out-of-memory
|
show 5 more comments
There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,
Input:
FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]
Output:
[2, 5, ...]
The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.
So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.
set(contentsofFileA)-set(contentsofFileB)
But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.
Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.
The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?
python arrays file out-of-memory
1
Are the files already in sorted ordered?
– Chris Doyle
Jul 31 at 9:24
Is there a possibility to load at least one file fully in the memory and the other file via batch process ?
– venkata krishnan
Jul 31 at 9:24
@ChrisDoyle The files are sorted seperatley
– Arjun Muraleedharan
Jul 31 at 9:30
2
Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a
– Sayse
Jul 31 at 9:31
1
Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A
– Chris Doyle
Jul 31 at 9:31
|
show 5 more comments
There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,
Input:
FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]
Output:
[2, 5, ...]
The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.
So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.
set(contentsofFileA)-set(contentsofFileB)
But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.
Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.
The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?
python arrays file out-of-memory
There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,
Input:
FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]
Output:
[2, 5, ...]
The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.
So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.
set(contentsofFileA)-set(contentsofFileB)
But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.
Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.
The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?
python arrays file out-of-memory
python arrays file out-of-memory
asked Jul 31 at 9:21
Arjun MuraleedharanArjun Muraleedharan
766 bronze badges
766 bronze badges
1
Are the files already in sorted ordered?
– Chris Doyle
Jul 31 at 9:24
Is there a possibility to load at least one file fully in the memory and the other file via batch process ?
– venkata krishnan
Jul 31 at 9:24
@ChrisDoyle The files are sorted seperatley
– Arjun Muraleedharan
Jul 31 at 9:30
2
Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a
– Sayse
Jul 31 at 9:31
1
Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A
– Chris Doyle
Jul 31 at 9:31
|
show 5 more comments
1
Are the files already in sorted ordered?
– Chris Doyle
Jul 31 at 9:24
Is there a possibility to load at least one file fully in the memory and the other file via batch process ?
– venkata krishnan
Jul 31 at 9:24
@ChrisDoyle The files are sorted seperatley
– Arjun Muraleedharan
Jul 31 at 9:30
2
Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a
– Sayse
Jul 31 at 9:31
1
Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A
– Chris Doyle
Jul 31 at 9:31
1
1
Are the files already in sorted ordered?
– Chris Doyle
Jul 31 at 9:24
Are the files already in sorted ordered?
– Chris Doyle
Jul 31 at 9:24
Is there a possibility to load at least one file fully in the memory and the other file via batch process ?
– venkata krishnan
Jul 31 at 9:24
Is there a possibility to load at least one file fully in the memory and the other file via batch process ?
– venkata krishnan
Jul 31 at 9:24
@ChrisDoyle The files are sorted seperatley
– Arjun Muraleedharan
Jul 31 at 9:30
@ChrisDoyle The files are sorted seperatley
– Arjun Muraleedharan
Jul 31 at 9:30
2
2
Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a
– Sayse
Jul 31 at 9:31
Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a
– Sayse
Jul 31 at 9:31
1
1
Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A
– Chris Doyle
Jul 31 at 9:31
Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A
– Chris Doyle
Jul 31 at 9:31
|
show 5 more comments
5 Answers
5
active
oldest
votes
If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:
First in your terminal you can do this to generate some test files:
seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt
Then you run this code:
i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
try:
if a < b:
aNotB += [a]
a = int(next(i1, None))
elif a > b:
# bNotA += [a]
b = int(next(i2, None))
elif a == b:
a = int(next(i1, None))
b = int(next(i2, None))
except TypeError:
if not b:
aNotB += list(i1)
break
else:
# bNotA += list(i1)
break
print(aNotB)
Output:
[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
If you want both the result for aNotB and bNotA you can uncomment those two lines.
Timing comparison with Andrej Kesely's answer:
$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py
python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py
python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total
I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?
– Arjun Muraleedharan
Jul 31 at 10:49
1
@ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.
– yukashima huksay
Jul 31 at 10:51
@ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm
– Andrej Kesely
Jul 31 at 10:51
1
This should be the accepted answer - it's fastest :)
– Andrej Kesely
Jul 31 at 11:02
1
@ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.
– yukashima huksay
Jul 31 at 12:06
|
show 3 more comments
As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.
def strip_read(file):
return file.readline().rstrip()
in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
with open("fileB") as B:
Aline = strip_read(A)
Bline = strip_read(B)
while Aline or Bline:
if Aline < Bline and Aline:
in_a_not_b.append(Aline)
Aline = strip_read(A)
elif Aline > Bline and Bline:
in_b_not_a.append(Bline)
Bline = strip_read(B)
else:
Aline = strip_read(A)
Bline = strip_read(B)
print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)
OUTPUT for my sample Files
in A not in B ['2', '5', '7']
in B not in A ['6']
How is this in any way a new answer? there are already two answers exactly doing this.
– yukashima huksay
Jul 31 at 10:07
1
I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A
– Chris Doyle
Jul 31 at 10:08
1
For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file
– Chris Doyle
Jul 31 at 10:09
1
Thanks for pointing out my mistake, I have now corrected my answer.
– Mattias
Jul 31 at 10:16
2
It was certainly a nice question and its good to see it inspiring minds
– Chris Doyle
Jul 31 at 10:18
|
show 6 more comments
You can combine itertools.groupby
(doc) and heapq.merge
(doc) to iterate through FileA
and FileB
lazily (it works as long the files are sorted!)
FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]
from itertools import groupby
from heapq import merge
gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
if any(v[1] == 'FileB' for v in g):
continue
print(v)
Prints:
2
5
EDIT (Reading from files):
from itertools import groupby
from heapq import merge
gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
if any(v[1] == 2 for v in g):
continue
print(v)
Benchmark:
Generating files with 10_000_000 items:
seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt
The script takes ~10sec to complete:
real 0m10,656s
user 0m10,557s
sys 0m0,076s
@AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?
– Arjun Muraleedharan
Jul 31 at 11:09
@ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example(int(v.strip()), 1)
Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.
– Andrej Kesely
Jul 31 at 11:11
@AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?
– Arjun Muraleedharan
Jul 31 at 11:56
add a comment |
A simple solution based on file reading (asuming that each line hold a number):
results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
var1 = file1.readline()
var2 = file2.readline()
while var1:
while var1 and var2:
if int(var1) < int(var2):
results.append(int(var1))
var1 = file1.readline()
elif int(var1) > int(var2):
var2 = file2.readline()
elif int(var1) == int(var2):
var1 = file1.readline()
var2 = file2.readline()
if var1:
results.append(int(var1))
var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]
You are missing the calls toint
before the comparisons. Otherwise you get problems like"9" > "10"
. Also, you can open multiple files in onewith
statement.
– Graipher
Jul 31 at 18:37
Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher
– Mattias
Aug 1 at 6:04
add a comment |
This is similar to the classic Knuth Sorting and Searching.
You may wish to consider reading stack question,
on-line lecture notes pdf, and Wikipedia.
The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57287400%2ffind-all-the-numbers-in-one-file-that-are-not-in-another-file-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:
First in your terminal you can do this to generate some test files:
seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt
Then you run this code:
i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
try:
if a < b:
aNotB += [a]
a = int(next(i1, None))
elif a > b:
# bNotA += [a]
b = int(next(i2, None))
elif a == b:
a = int(next(i1, None))
b = int(next(i2, None))
except TypeError:
if not b:
aNotB += list(i1)
break
else:
# bNotA += list(i1)
break
print(aNotB)
Output:
[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
If you want both the result for aNotB and bNotA you can uncomment those two lines.
Timing comparison with Andrej Kesely's answer:
$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py
python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py
python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total
I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?
– Arjun Muraleedharan
Jul 31 at 10:49
1
@ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.
– yukashima huksay
Jul 31 at 10:51
@ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm
– Andrej Kesely
Jul 31 at 10:51
1
This should be the accepted answer - it's fastest :)
– Andrej Kesely
Jul 31 at 11:02
1
@ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.
– yukashima huksay
Jul 31 at 12:06
|
show 3 more comments
If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:
First in your terminal you can do this to generate some test files:
seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt
Then you run this code:
i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
try:
if a < b:
aNotB += [a]
a = int(next(i1, None))
elif a > b:
# bNotA += [a]
b = int(next(i2, None))
elif a == b:
a = int(next(i1, None))
b = int(next(i2, None))
except TypeError:
if not b:
aNotB += list(i1)
break
else:
# bNotA += list(i1)
break
print(aNotB)
Output:
[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
If you want both the result for aNotB and bNotA you can uncomment those two lines.
Timing comparison with Andrej Kesely's answer:
$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py
python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py
python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total
I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?
– Arjun Muraleedharan
Jul 31 at 10:49
1
@ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.
– yukashima huksay
Jul 31 at 10:51
@ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm
– Andrej Kesely
Jul 31 at 10:51
1
This should be the accepted answer - it's fastest :)
– Andrej Kesely
Jul 31 at 11:02
1
@ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.
– yukashima huksay
Jul 31 at 12:06
|
show 3 more comments
If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:
First in your terminal you can do this to generate some test files:
seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt
Then you run this code:
i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
try:
if a < b:
aNotB += [a]
a = int(next(i1, None))
elif a > b:
# bNotA += [a]
b = int(next(i2, None))
elif a == b:
a = int(next(i1, None))
b = int(next(i2, None))
except TypeError:
if not b:
aNotB += list(i1)
break
else:
# bNotA += list(i1)
break
print(aNotB)
Output:
[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
If you want both the result for aNotB and bNotA you can uncomment those two lines.
Timing comparison with Andrej Kesely's answer:
$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py
python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py
python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total
If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:
First in your terminal you can do this to generate some test files:
seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt
Then you run this code:
i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
try:
if a < b:
aNotB += [a]
a = int(next(i1, None))
elif a > b:
# bNotA += [a]
b = int(next(i2, None))
elif a == b:
a = int(next(i1, None))
b = int(next(i2, None))
except TypeError:
if not b:
aNotB += list(i1)
break
else:
# bNotA += list(i1)
break
print(aNotB)
Output:
[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
If you want both the result for aNotB and bNotA you can uncomment those two lines.
Timing comparison with Andrej Kesely's answer:
$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py
python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py
python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total
edited Aug 7 at 10:50
answered Jul 31 at 9:37
yukashima huksayyukashima huksay
1,1142 gold badges14 silver badges38 bronze badges
1,1142 gold badges14 silver badges38 bronze badges
I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?
– Arjun Muraleedharan
Jul 31 at 10:49
1
@ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.
– yukashima huksay
Jul 31 at 10:51
@ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm
– Andrej Kesely
Jul 31 at 10:51
1
This should be the accepted answer - it's fastest :)
– Andrej Kesely
Jul 31 at 11:02
1
@ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.
– yukashima huksay
Jul 31 at 12:06
|
show 3 more comments
I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?
– Arjun Muraleedharan
Jul 31 at 10:49
1
@ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.
– yukashima huksay
Jul 31 at 10:51
@ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm
– Andrej Kesely
Jul 31 at 10:51
1
This should be the accepted answer - it's fastest :)
– Andrej Kesely
Jul 31 at 11:02
1
@ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.
– yukashima huksay
Jul 31 at 12:06
I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?
– Arjun Muraleedharan
Jul 31 at 10:49
I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?
– Arjun Muraleedharan
Jul 31 at 10:49
1
1
@ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.
– yukashima huksay
Jul 31 at 10:51
@ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.
– yukashima huksay
Jul 31 at 10:51
@ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm
– Andrej Kesely
Jul 31 at 10:51
@ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm
– Andrej Kesely
Jul 31 at 10:51
1
1
This should be the accepted answer - it's fastest :)
– Andrej Kesely
Jul 31 at 11:02
This should be the accepted answer - it's fastest :)
– Andrej Kesely
Jul 31 at 11:02
1
1
@ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.
– yukashima huksay
Jul 31 at 12:06
@ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.
– yukashima huksay
Jul 31 at 12:06
|
show 3 more comments
As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.
def strip_read(file):
return file.readline().rstrip()
in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
with open("fileB") as B:
Aline = strip_read(A)
Bline = strip_read(B)
while Aline or Bline:
if Aline < Bline and Aline:
in_a_not_b.append(Aline)
Aline = strip_read(A)
elif Aline > Bline and Bline:
in_b_not_a.append(Bline)
Bline = strip_read(B)
else:
Aline = strip_read(A)
Bline = strip_read(B)
print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)
OUTPUT for my sample Files
in A not in B ['2', '5', '7']
in B not in A ['6']
How is this in any way a new answer? there are already two answers exactly doing this.
– yukashima huksay
Jul 31 at 10:07
1
I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A
– Chris Doyle
Jul 31 at 10:08
1
For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file
– Chris Doyle
Jul 31 at 10:09
1
Thanks for pointing out my mistake, I have now corrected my answer.
– Mattias
Jul 31 at 10:16
2
It was certainly a nice question and its good to see it inspiring minds
– Chris Doyle
Jul 31 at 10:18
|
show 6 more comments
As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.
def strip_read(file):
return file.readline().rstrip()
in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
with open("fileB") as B:
Aline = strip_read(A)
Bline = strip_read(B)
while Aline or Bline:
if Aline < Bline and Aline:
in_a_not_b.append(Aline)
Aline = strip_read(A)
elif Aline > Bline and Bline:
in_b_not_a.append(Bline)
Bline = strip_read(B)
else:
Aline = strip_read(A)
Bline = strip_read(B)
print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)
OUTPUT for my sample Files
in A not in B ['2', '5', '7']
in B not in A ['6']
How is this in any way a new answer? there are already two answers exactly doing this.
– yukashima huksay
Jul 31 at 10:07
1
I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A
– Chris Doyle
Jul 31 at 10:08
1
For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file
– Chris Doyle
Jul 31 at 10:09
1
Thanks for pointing out my mistake, I have now corrected my answer.
– Mattias
Jul 31 at 10:16
2
It was certainly a nice question and its good to see it inspiring minds
– Chris Doyle
Jul 31 at 10:18
|
show 6 more comments
As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.
def strip_read(file):
return file.readline().rstrip()
in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
with open("fileB") as B:
Aline = strip_read(A)
Bline = strip_read(B)
while Aline or Bline:
if Aline < Bline and Aline:
in_a_not_b.append(Aline)
Aline = strip_read(A)
elif Aline > Bline and Bline:
in_b_not_a.append(Bline)
Bline = strip_read(B)
else:
Aline = strip_read(A)
Bline = strip_read(B)
print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)
OUTPUT for my sample Files
in A not in B ['2', '5', '7']
in B not in A ['6']
As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.
def strip_read(file):
return file.readline().rstrip()
in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
with open("fileB") as B:
Aline = strip_read(A)
Bline = strip_read(B)
while Aline or Bline:
if Aline < Bline and Aline:
in_a_not_b.append(Aline)
Aline = strip_read(A)
elif Aline > Bline and Bline:
in_b_not_a.append(Bline)
Bline = strip_read(B)
else:
Aline = strip_read(A)
Bline = strip_read(B)
print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)
OUTPUT for my sample Files
in A not in B ['2', '5', '7']
in B not in A ['6']
edited Jul 31 at 10:17
answered Jul 31 at 10:01
Chris DoyleChris Doyle
1,6051 gold badge12 silver badges25 bronze badges
1,6051 gold badge12 silver badges25 bronze badges
How is this in any way a new answer? there are already two answers exactly doing this.
– yukashima huksay
Jul 31 at 10:07
1
I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A
– Chris Doyle
Jul 31 at 10:08
1
For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file
– Chris Doyle
Jul 31 at 10:09
1
Thanks for pointing out my mistake, I have now corrected my answer.
– Mattias
Jul 31 at 10:16
2
It was certainly a nice question and its good to see it inspiring minds
– Chris Doyle
Jul 31 at 10:18
|
show 6 more comments
How is this in any way a new answer? there are already two answers exactly doing this.
– yukashima huksay
Jul 31 at 10:07
1
I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A
– Chris Doyle
Jul 31 at 10:08
1
For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file
– Chris Doyle
Jul 31 at 10:09
1
Thanks for pointing out my mistake, I have now corrected my answer.
– Mattias
Jul 31 at 10:16
2
It was certainly a nice question and its good to see it inspiring minds
– Chris Doyle
Jul 31 at 10:18
How is this in any way a new answer? there are already two answers exactly doing this.
– yukashima huksay
Jul 31 at 10:07
How is this in any way a new answer? there are already two answers exactly doing this.
– yukashima huksay
Jul 31 at 10:07
1
1
I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A
– Chris Doyle
Jul 31 at 10:08
I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A
– Chris Doyle
Jul 31 at 10:08
1
1
For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file
– Chris Doyle
Jul 31 at 10:09
For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file
– Chris Doyle
Jul 31 at 10:09
1
1
Thanks for pointing out my mistake, I have now corrected my answer.
– Mattias
Jul 31 at 10:16
Thanks for pointing out my mistake, I have now corrected my answer.
– Mattias
Jul 31 at 10:16
2
2
It was certainly a nice question and its good to see it inspiring minds
– Chris Doyle
Jul 31 at 10:18
It was certainly a nice question and its good to see it inspiring minds
– Chris Doyle
Jul 31 at 10:18
|
show 6 more comments
You can combine itertools.groupby
(doc) and heapq.merge
(doc) to iterate through FileA
and FileB
lazily (it works as long the files are sorted!)
FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]
from itertools import groupby
from heapq import merge
gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
if any(v[1] == 'FileB' for v in g):
continue
print(v)
Prints:
2
5
EDIT (Reading from files):
from itertools import groupby
from heapq import merge
gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
if any(v[1] == 2 for v in g):
continue
print(v)
Benchmark:
Generating files with 10_000_000 items:
seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt
The script takes ~10sec to complete:
real 0m10,656s
user 0m10,557s
sys 0m0,076s
@AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?
– Arjun Muraleedharan
Jul 31 at 11:09
@ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example(int(v.strip()), 1)
Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.
– Andrej Kesely
Jul 31 at 11:11
@AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?
– Arjun Muraleedharan
Jul 31 at 11:56
add a comment |
You can combine itertools.groupby
(doc) and heapq.merge
(doc) to iterate through FileA
and FileB
lazily (it works as long the files are sorted!)
FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]
from itertools import groupby
from heapq import merge
gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
if any(v[1] == 'FileB' for v in g):
continue
print(v)
Prints:
2
5
EDIT (Reading from files):
from itertools import groupby
from heapq import merge
gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
if any(v[1] == 2 for v in g):
continue
print(v)
Benchmark:
Generating files with 10_000_000 items:
seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt
The script takes ~10sec to complete:
real 0m10,656s
user 0m10,557s
sys 0m0,076s
@AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?
– Arjun Muraleedharan
Jul 31 at 11:09
@ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example(int(v.strip()), 1)
Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.
– Andrej Kesely
Jul 31 at 11:11
@AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?
– Arjun Muraleedharan
Jul 31 at 11:56
add a comment |
You can combine itertools.groupby
(doc) and heapq.merge
(doc) to iterate through FileA
and FileB
lazily (it works as long the files are sorted!)
FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]
from itertools import groupby
from heapq import merge
gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
if any(v[1] == 'FileB' for v in g):
continue
print(v)
Prints:
2
5
EDIT (Reading from files):
from itertools import groupby
from heapq import merge
gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
if any(v[1] == 2 for v in g):
continue
print(v)
Benchmark:
Generating files with 10_000_000 items:
seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt
The script takes ~10sec to complete:
real 0m10,656s
user 0m10,557s
sys 0m0,076s
You can combine itertools.groupby
(doc) and heapq.merge
(doc) to iterate through FileA
and FileB
lazily (it works as long the files are sorted!)
FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]
from itertools import groupby
from heapq import merge
gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
if any(v[1] == 'FileB' for v in g):
continue
print(v)
Prints:
2
5
EDIT (Reading from files):
from itertools import groupby
from heapq import merge
gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
if any(v[1] == 2 for v in g):
continue
print(v)
Benchmark:
Generating files with 10_000_000 items:
seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt
The script takes ~10sec to complete:
real 0m10,656s
user 0m10,557s
sys 0m0,076s
edited Jul 31 at 10:47
answered Jul 31 at 9:36
Andrej KeselyAndrej Kesely
18k2 gold badges10 silver badges34 bronze badges
18k2 gold badges10 silver badges34 bronze badges
@AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?
– Arjun Muraleedharan
Jul 31 at 11:09
@ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example(int(v.strip()), 1)
Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.
– Andrej Kesely
Jul 31 at 11:11
@AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?
– Arjun Muraleedharan
Jul 31 at 11:56
add a comment |
@AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?
– Arjun Muraleedharan
Jul 31 at 11:09
@ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example(int(v.strip()), 1)
Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.
– Andrej Kesely
Jul 31 at 11:11
@AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?
– Arjun Muraleedharan
Jul 31 at 11:56
@AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?
– Arjun Muraleedharan
Jul 31 at 11:09
@AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?
– Arjun Muraleedharan
Jul 31 at 11:09
@ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example
(int(v.strip()), 1)
Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.– Andrej Kesely
Jul 31 at 11:11
@ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example
(int(v.strip()), 1)
Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.– Andrej Kesely
Jul 31 at 11:11
@AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?
– Arjun Muraleedharan
Jul 31 at 11:56
@AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?
– Arjun Muraleedharan
Jul 31 at 11:56
add a comment |
A simple solution based on file reading (asuming that each line hold a number):
results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
var1 = file1.readline()
var2 = file2.readline()
while var1:
while var1 and var2:
if int(var1) < int(var2):
results.append(int(var1))
var1 = file1.readline()
elif int(var1) > int(var2):
var2 = file2.readline()
elif int(var1) == int(var2):
var1 = file1.readline()
var2 = file2.readline()
if var1:
results.append(int(var1))
var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]
You are missing the calls toint
before the comparisons. Otherwise you get problems like"9" > "10"
. Also, you can open multiple files in onewith
statement.
– Graipher
Jul 31 at 18:37
Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher
– Mattias
Aug 1 at 6:04
add a comment |
A simple solution based on file reading (asuming that each line hold a number):
results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
var1 = file1.readline()
var2 = file2.readline()
while var1:
while var1 and var2:
if int(var1) < int(var2):
results.append(int(var1))
var1 = file1.readline()
elif int(var1) > int(var2):
var2 = file2.readline()
elif int(var1) == int(var2):
var1 = file1.readline()
var2 = file2.readline()
if var1:
results.append(int(var1))
var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]
You are missing the calls toint
before the comparisons. Otherwise you get problems like"9" > "10"
. Also, you can open multiple files in onewith
statement.
– Graipher
Jul 31 at 18:37
Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher
– Mattias
Aug 1 at 6:04
add a comment |
A simple solution based on file reading (asuming that each line hold a number):
results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
var1 = file1.readline()
var2 = file2.readline()
while var1:
while var1 and var2:
if int(var1) < int(var2):
results.append(int(var1))
var1 = file1.readline()
elif int(var1) > int(var2):
var2 = file2.readline()
elif int(var1) == int(var2):
var1 = file1.readline()
var2 = file2.readline()
if var1:
results.append(int(var1))
var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]
A simple solution based on file reading (asuming that each line hold a number):
results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
var1 = file1.readline()
var2 = file2.readline()
while var1:
while var1 and var2:
if int(var1) < int(var2):
results.append(int(var1))
var1 = file1.readline()
elif int(var1) > int(var2):
var2 = file2.readline()
elif int(var1) == int(var2):
var1 = file1.readline()
var2 = file2.readline()
if var1:
results.append(int(var1))
var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]
edited Aug 1 at 6:03
answered Jul 31 at 9:50
MattiasMattias
3461 silver badge11 bronze badges
3461 silver badge11 bronze badges
You are missing the calls toint
before the comparisons. Otherwise you get problems like"9" > "10"
. Also, you can open multiple files in onewith
statement.
– Graipher
Jul 31 at 18:37
Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher
– Mattias
Aug 1 at 6:04
add a comment |
You are missing the calls toint
before the comparisons. Otherwise you get problems like"9" > "10"
. Also, you can open multiple files in onewith
statement.
– Graipher
Jul 31 at 18:37
Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher
– Mattias
Aug 1 at 6:04
You are missing the calls to
int
before the comparisons. Otherwise you get problems like "9" > "10"
. Also, you can open multiple files in one with
statement.– Graipher
Jul 31 at 18:37
You are missing the calls to
int
before the comparisons. Otherwise you get problems like "9" > "10"
. Also, you can open multiple files in one with
statement.– Graipher
Jul 31 at 18:37
Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher
– Mattias
Aug 1 at 6:04
Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher
– Mattias
Aug 1 at 6:04
add a comment |
This is similar to the classic Knuth Sorting and Searching.
You may wish to consider reading stack question,
on-line lecture notes pdf, and Wikipedia.
The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.
add a comment |
This is similar to the classic Knuth Sorting and Searching.
You may wish to consider reading stack question,
on-line lecture notes pdf, and Wikipedia.
The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.
add a comment |
This is similar to the classic Knuth Sorting and Searching.
You may wish to consider reading stack question,
on-line lecture notes pdf, and Wikipedia.
The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.
This is similar to the classic Knuth Sorting and Searching.
You may wish to consider reading stack question,
on-line lecture notes pdf, and Wikipedia.
The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.
answered Aug 3 at 23:26
punchcardpunchcard
3291 silver badge5 bronze badges
3291 silver badge5 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57287400%2ffind-all-the-numbers-in-one-file-that-are-not-in-another-file-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Are the files already in sorted ordered?
– Chris Doyle
Jul 31 at 9:24
Is there a possibility to load at least one file fully in the memory and the other file via batch process ?
– venkata krishnan
Jul 31 at 9:24
@ChrisDoyle The files are sorted seperatley
– Arjun Muraleedharan
Jul 31 at 9:30
2
Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a
– Sayse
Jul 31 at 9:31
1
Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A
– Chris Doyle
Jul 31 at 9:31