Find all the numbers in one file that are not in another file in pythonLazy Method for Reading Big File in Python?Efficient Out-Of-Core SortingHow do I check whether a file exists without exceptions?Calling an external command in PythonWhat are metaclasses in Python?How do I copy a file in Python?Finding the index of an item given a list containing it in PythonDoes Python have a ternary conditional operator?How do I include a JavaScript file in another JavaScript file?How do I list all files of a directory?Does Python have a string 'contains' substring method?Find an integer not among four billion given ones

Multirow in tabularx?

What are the conventions for transcribing Semitic languages into Greek?

Te-form and かつ and も?

Sign changes after taking the square root inequality. Why?

First amendment and employment: Can a police department terminate an officer for speech?

constant evaluation when using differential equations.

Does this Foo machine halt?

Acceptable to cut steak before searing?

Is it legal for a company to enter an agreement not to hire employees from another company?

How to mark beverage cans in a cooler for a blind person?

On the Rømer experiments and the speed if light

create a tuple from pairs

Is refreshing multiple times a test case for web applications?

What does Apple mean by "This may decrease battery life"?

Understanding the point of a kölsche Witz

What is my malfunctioning AI harvesting from humans?

In SQL Server, why does backward scan of clustered index cannot use parallelism?

n-types of the theory of natural numbers?

What does this double-treble double-bass staff mean?

Why did the RAAF procure the F/A-18 despite being purpose-built for carriers?

Can the ground attached to neutral fool a receptacle tester?

Y2K... in 2019?

Not going forward with internship interview process

ICO or PNG Format?



Find all the numbers in one file that are not in another file in python


Lazy Method for Reading Big File in Python?Efficient Out-Of-Core SortingHow do I check whether a file exists without exceptions?Calling an external command in PythonWhat are metaclasses in Python?How do I copy a file in Python?Finding the index of an item given a list containing it in PythonDoes Python have a ternary conditional operator?How do I include a JavaScript file in another JavaScript file?How do I list all files of a directory?Does Python have a string 'contains' substring method?Find an integer not among four billion given ones






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








14















There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,



Input:



FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]


Output:



[2, 5, ...]


The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.



So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.



set(contentsofFileA)-set(contentsofFileB)


But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.



Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.



The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?










share|improve this question



















  • 1





    Are the files already in sorted ordered?

    – Chris Doyle
    Jul 31 at 9:24











  • Is there a possibility to load at least one file fully in the memory and the other file via batch process ?

    – venkata krishnan
    Jul 31 at 9:24











  • @ChrisDoyle The files are sorted seperatley

    – Arjun Muraleedharan
    Jul 31 at 9:30






  • 2





    Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a

    – Sayse
    Jul 31 at 9:31






  • 1





    Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A

    – Chris Doyle
    Jul 31 at 9:31

















14















There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,



Input:



FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]


Output:



[2, 5, ...]


The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.



So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.



set(contentsofFileA)-set(contentsofFileB)


But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.



Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.



The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?










share|improve this question



















  • 1





    Are the files already in sorted ordered?

    – Chris Doyle
    Jul 31 at 9:24











  • Is there a possibility to load at least one file fully in the memory and the other file via batch process ?

    – venkata krishnan
    Jul 31 at 9:24











  • @ChrisDoyle The files are sorted seperatley

    – Arjun Muraleedharan
    Jul 31 at 9:30






  • 2





    Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a

    – Sayse
    Jul 31 at 9:31






  • 1





    Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A

    – Chris Doyle
    Jul 31 at 9:31













14












14








14








There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,



Input:



FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]


Output:



[2, 5, ...]


The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.



So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.



set(contentsofFileA)-set(contentsofFileB)


But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.



Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.



The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?










share|improve this question














There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,



Input:



FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]


Output:



[2, 5, ...]


The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.



So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.



set(contentsofFileA)-set(contentsofFileB)


But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.



Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.



The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?







python arrays file out-of-memory






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jul 31 at 9:21









Arjun MuraleedharanArjun Muraleedharan

766 bronze badges




766 bronze badges










  • 1





    Are the files already in sorted ordered?

    – Chris Doyle
    Jul 31 at 9:24











  • Is there a possibility to load at least one file fully in the memory and the other file via batch process ?

    – venkata krishnan
    Jul 31 at 9:24











  • @ChrisDoyle The files are sorted seperatley

    – Arjun Muraleedharan
    Jul 31 at 9:30






  • 2





    Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a

    – Sayse
    Jul 31 at 9:31






  • 1





    Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A

    – Chris Doyle
    Jul 31 at 9:31












  • 1





    Are the files already in sorted ordered?

    – Chris Doyle
    Jul 31 at 9:24











  • Is there a possibility to load at least one file fully in the memory and the other file via batch process ?

    – venkata krishnan
    Jul 31 at 9:24











  • @ChrisDoyle The files are sorted seperatley

    – Arjun Muraleedharan
    Jul 31 at 9:30






  • 2





    Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a

    – Sayse
    Jul 31 at 9:31






  • 1





    Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A

    – Chris Doyle
    Jul 31 at 9:31







1




1





Are the files already in sorted ordered?

– Chris Doyle
Jul 31 at 9:24





Are the files already in sorted ordered?

– Chris Doyle
Jul 31 at 9:24













Is there a possibility to load at least one file fully in the memory and the other file via batch process ?

– venkata krishnan
Jul 31 at 9:24





Is there a possibility to load at least one file fully in the memory and the other file via batch process ?

– venkata krishnan
Jul 31 at 9:24













@ChrisDoyle The files are sorted seperatley

– Arjun Muraleedharan
Jul 31 at 9:30





@ChrisDoyle The files are sorted seperatley

– Arjun Muraleedharan
Jul 31 at 9:30




2




2





Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a

– Sayse
Jul 31 at 9:31





Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a

– Sayse
Jul 31 at 9:31




1




1





Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A

– Chris Doyle
Jul 31 at 9:31





Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A

– Chris Doyle
Jul 31 at 9:31












5 Answers
5






active

oldest

votes


















11














If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:



First in your terminal you can do this to generate some test files:



seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt


Then you run this code:



i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
try:
if a < b:
aNotB += [a]
a = int(next(i1, None))
elif a > b:
# bNotA += [a]
b = int(next(i2, None))
elif a == b:
a = int(next(i1, None))
b = int(next(i2, None))
except TypeError:
if not b:
aNotB += list(i1)
break
else:
# bNotA += list(i1)
break
print(aNotB)


Output:




[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
If you want both the result for aNotB and bNotA you can uncomment those two lines.




Timing comparison with Andrej Kesely's answer:



$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py
python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py
python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total





share|improve this answer



























  • I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?

    – Arjun Muraleedharan
    Jul 31 at 10:49






  • 1





    @ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.

    – yukashima huksay
    Jul 31 at 10:51











  • @ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm

    – Andrej Kesely
    Jul 31 at 10:51






  • 1





    This should be the accepted answer - it's fastest :)

    – Andrej Kesely
    Jul 31 at 11:02






  • 1





    @ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.

    – yukashima huksay
    Jul 31 at 12:06


















7














As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.



def strip_read(file):
return file.readline().rstrip()

in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
with open("fileB") as B:
Aline = strip_read(A)
Bline = strip_read(B)
while Aline or Bline:
if Aline < Bline and Aline:
in_a_not_b.append(Aline)
Aline = strip_read(A)
elif Aline > Bline and Bline:
in_b_not_a.append(Bline)
Bline = strip_read(B)
else:
Aline = strip_read(A)
Bline = strip_read(B)

print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)


OUTPUT for my sample Files



in A not in B ['2', '5', '7'] 
in B not in A ['6']





share|improve this answer



























  • How is this in any way a new answer? there are already two answers exactly doing this.

    – yukashima huksay
    Jul 31 at 10:07






  • 1





    I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A

    – Chris Doyle
    Jul 31 at 10:08






  • 1





    For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file

    – Chris Doyle
    Jul 31 at 10:09






  • 1





    Thanks for pointing out my mistake, I have now corrected my answer.

    – Mattias
    Jul 31 at 10:16






  • 2





    It was certainly a nice question and its good to see it inspiring minds

    – Chris Doyle
    Jul 31 at 10:18


















5














You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)



FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]

from itertools import groupby
from heapq import merge

gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
if any(v[1] == 'FileB' for v in g):
continue
print(v)


Prints:



2
5



EDIT (Reading from files):



from itertools import groupby
from heapq import merge

gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
if any(v[1] == 2 for v in g):
continue
print(v)


Benchmark:



Generating files with 10_000_000 items:



seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt


The script takes ~10sec to complete:



real 0m10,656s
user 0m10,557s
sys 0m0,076s





share|improve this answer



























  • @AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?

    – Arjun Muraleedharan
    Jul 31 at 11:09











  • @ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example (int(v.strip()), 1) Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.

    – Andrej Kesely
    Jul 31 at 11:11











  • @AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?

    – Arjun Muraleedharan
    Jul 31 at 11:56


















1














A simple solution based on file reading (asuming that each line hold a number):



results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
var1 = file1.readline()
var2 = file2.readline()
while var1:
while var1 and var2:
if int(var1) < int(var2):
results.append(int(var1))
var1 = file1.readline()
elif int(var1) > int(var2):
var2 = file2.readline()
elif int(var1) == int(var2):
var1 = file1.readline()
var2 = file2.readline()
if var1:
results.append(int(var1))
var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]





share|improve this answer



























  • You are missing the calls to int before the comparisons. Otherwise you get problems like "9" > "10". Also, you can open multiple files in one with statement.

    – Graipher
    Jul 31 at 18:37











  • Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher

    – Mattias
    Aug 1 at 6:04


















1














This is similar to the classic Knuth Sorting and Searching.
You may wish to consider reading stack question,
on-line lecture notes pdf, and Wikipedia.
The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.






share|improve this answer



























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57287400%2ffind-all-the-numbers-in-one-file-that-are-not-in-another-file-in-python%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    5 Answers
    5






    active

    oldest

    votes








    5 Answers
    5






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    11














    If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:



    First in your terminal you can do this to generate some test files:



    seq 0 3 100 > 3k.txt
    seq 0 2 100 > 2k.txt


    Then you run this code:



    i1 = iter(open("3k.txt"))
    i2 = iter(open("2k.txt"))
    a = int(next(i1))
    b = int(next(i2))
    aNotB = []
    # bNotA = []
    while True:
    try:
    if a < b:
    aNotB += [a]
    a = int(next(i1, None))
    elif a > b:
    # bNotA += [a]
    b = int(next(i2, None))
    elif a == b:
    a = int(next(i1, None))
    b = int(next(i2, None))
    except TypeError:
    if not b:
    aNotB += list(i1)
    break
    else:
    # bNotA += list(i1)
    break
    print(aNotB)


    Output:




    [3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
    If you want both the result for aNotB and bNotA you can uncomment those two lines.




    Timing comparison with Andrej Kesely's answer:



    $ seq 0 3 1000000 > 3k.txt
    $ seq 0 2 1000000 > 2k.txt
    $ time python manual_iter.py
    python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
    $ time python heapq_groupby.py
    python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total





    share|improve this answer



























    • I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?

      – Arjun Muraleedharan
      Jul 31 at 10:49






    • 1





      @ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.

      – yukashima huksay
      Jul 31 at 10:51











    • @ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm

      – Andrej Kesely
      Jul 31 at 10:51






    • 1





      This should be the accepted answer - it's fastest :)

      – Andrej Kesely
      Jul 31 at 11:02






    • 1





      @ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.

      – yukashima huksay
      Jul 31 at 12:06















    11














    If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:



    First in your terminal you can do this to generate some test files:



    seq 0 3 100 > 3k.txt
    seq 0 2 100 > 2k.txt


    Then you run this code:



    i1 = iter(open("3k.txt"))
    i2 = iter(open("2k.txt"))
    a = int(next(i1))
    b = int(next(i2))
    aNotB = []
    # bNotA = []
    while True:
    try:
    if a < b:
    aNotB += [a]
    a = int(next(i1, None))
    elif a > b:
    # bNotA += [a]
    b = int(next(i2, None))
    elif a == b:
    a = int(next(i1, None))
    b = int(next(i2, None))
    except TypeError:
    if not b:
    aNotB += list(i1)
    break
    else:
    # bNotA += list(i1)
    break
    print(aNotB)


    Output:




    [3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
    If you want both the result for aNotB and bNotA you can uncomment those two lines.




    Timing comparison with Andrej Kesely's answer:



    $ seq 0 3 1000000 > 3k.txt
    $ seq 0 2 1000000 > 2k.txt
    $ time python manual_iter.py
    python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
    $ time python heapq_groupby.py
    python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total





    share|improve this answer



























    • I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?

      – Arjun Muraleedharan
      Jul 31 at 10:49






    • 1





      @ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.

      – yukashima huksay
      Jul 31 at 10:51











    • @ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm

      – Andrej Kesely
      Jul 31 at 10:51






    • 1





      This should be the accepted answer - it's fastest :)

      – Andrej Kesely
      Jul 31 at 11:02






    • 1





      @ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.

      – yukashima huksay
      Jul 31 at 12:06













    11












    11








    11







    If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:



    First in your terminal you can do this to generate some test files:



    seq 0 3 100 > 3k.txt
    seq 0 2 100 > 2k.txt


    Then you run this code:



    i1 = iter(open("3k.txt"))
    i2 = iter(open("2k.txt"))
    a = int(next(i1))
    b = int(next(i2))
    aNotB = []
    # bNotA = []
    while True:
    try:
    if a < b:
    aNotB += [a]
    a = int(next(i1, None))
    elif a > b:
    # bNotA += [a]
    b = int(next(i2, None))
    elif a == b:
    a = int(next(i1, None))
    b = int(next(i2, None))
    except TypeError:
    if not b:
    aNotB += list(i1)
    break
    else:
    # bNotA += list(i1)
    break
    print(aNotB)


    Output:




    [3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
    If you want both the result for aNotB and bNotA you can uncomment those two lines.




    Timing comparison with Andrej Kesely's answer:



    $ seq 0 3 1000000 > 3k.txt
    $ seq 0 2 1000000 > 2k.txt
    $ time python manual_iter.py
    python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
    $ time python heapq_groupby.py
    python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total





    share|improve this answer















    If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:



    First in your terminal you can do this to generate some test files:



    seq 0 3 100 > 3k.txt
    seq 0 2 100 > 2k.txt


    Then you run this code:



    i1 = iter(open("3k.txt"))
    i2 = iter(open("2k.txt"))
    a = int(next(i1))
    b = int(next(i2))
    aNotB = []
    # bNotA = []
    while True:
    try:
    if a < b:
    aNotB += [a]
    a = int(next(i1, None))
    elif a > b:
    # bNotA += [a]
    b = int(next(i2, None))
    elif a == b:
    a = int(next(i1, None))
    b = int(next(i2, None))
    except TypeError:
    if not b:
    aNotB += list(i1)
    break
    else:
    # bNotA += list(i1)
    break
    print(aNotB)


    Output:




    [3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
    If you want both the result for aNotB and bNotA you can uncomment those two lines.




    Timing comparison with Andrej Kesely's answer:



    $ seq 0 3 1000000 > 3k.txt
    $ seq 0 2 1000000 > 2k.txt
    $ time python manual_iter.py
    python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
    $ time python heapq_groupby.py
    python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 7 at 10:50

























    answered Jul 31 at 9:37









    yukashima huksayyukashima huksay

    1,1142 gold badges14 silver badges38 bronze badges




    1,1142 gold badges14 silver badges38 bronze badges















    • I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?

      – Arjun Muraleedharan
      Jul 31 at 10:49






    • 1





      @ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.

      – yukashima huksay
      Jul 31 at 10:51











    • @ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm

      – Andrej Kesely
      Jul 31 at 10:51






    • 1





      This should be the accepted answer - it's fastest :)

      – Andrej Kesely
      Jul 31 at 11:02






    • 1





      @ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.

      – yukashima huksay
      Jul 31 at 12:06

















    • I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?

      – Arjun Muraleedharan
      Jul 31 at 10:49






    • 1





      @ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.

      – yukashima huksay
      Jul 31 at 10:51











    • @ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm

      – Andrej Kesely
      Jul 31 at 10:51






    • 1





      This should be the accepted answer - it's fastest :)

      – Andrej Kesely
      Jul 31 at 11:02






    • 1





      @ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.

      – yukashima huksay
      Jul 31 at 12:06
















    I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?

    – Arjun Muraleedharan
    Jul 31 at 10:49





    I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it?

    – Arjun Muraleedharan
    Jul 31 at 10:49




    1




    1





    @ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.

    – yukashima huksay
    Jul 31 at 10:51





    @ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file.

    – yukashima huksay
    Jul 31 at 10:51













    @ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm

    – Andrej Kesely
    Jul 31 at 10:51





    @ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm

    – Andrej Kesely
    Jul 31 at 10:51




    1




    1





    This should be the accepted answer - it's fastest :)

    – Andrej Kesely
    Jul 31 at 11:02





    This should be the accepted answer - it's fastest :)

    – Andrej Kesely
    Jul 31 at 11:02




    1




    1





    @ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.

    – yukashima huksay
    Jul 31 at 12:06





    @ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits.

    – yukashima huksay
    Jul 31 at 12:06













    7














    As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.



    def strip_read(file):
    return file.readline().rstrip()

    in_a_not_b = []
    in_b_not_a = []
    with open("fileA") as A:
    with open("fileB") as B:
    Aline = strip_read(A)
    Bline = strip_read(B)
    while Aline or Bline:
    if Aline < Bline and Aline:
    in_a_not_b.append(Aline)
    Aline = strip_read(A)
    elif Aline > Bline and Bline:
    in_b_not_a.append(Bline)
    Bline = strip_read(B)
    else:
    Aline = strip_read(A)
    Bline = strip_read(B)

    print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)


    OUTPUT for my sample Files



    in A not in B ['2', '5', '7'] 
    in B not in A ['6']





    share|improve this answer



























    • How is this in any way a new answer? there are already two answers exactly doing this.

      – yukashima huksay
      Jul 31 at 10:07






    • 1





      I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A

      – Chris Doyle
      Jul 31 at 10:08






    • 1





      For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file

      – Chris Doyle
      Jul 31 at 10:09






    • 1





      Thanks for pointing out my mistake, I have now corrected my answer.

      – Mattias
      Jul 31 at 10:16






    • 2





      It was certainly a nice question and its good to see it inspiring minds

      – Chris Doyle
      Jul 31 at 10:18















    7














    As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.



    def strip_read(file):
    return file.readline().rstrip()

    in_a_not_b = []
    in_b_not_a = []
    with open("fileA") as A:
    with open("fileB") as B:
    Aline = strip_read(A)
    Bline = strip_read(B)
    while Aline or Bline:
    if Aline < Bline and Aline:
    in_a_not_b.append(Aline)
    Aline = strip_read(A)
    elif Aline > Bline and Bline:
    in_b_not_a.append(Bline)
    Bline = strip_read(B)
    else:
    Aline = strip_read(A)
    Bline = strip_read(B)

    print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)


    OUTPUT for my sample Files



    in A not in B ['2', '5', '7'] 
    in B not in A ['6']





    share|improve this answer



























    • How is this in any way a new answer? there are already two answers exactly doing this.

      – yukashima huksay
      Jul 31 at 10:07






    • 1





      I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A

      – Chris Doyle
      Jul 31 at 10:08






    • 1





      For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file

      – Chris Doyle
      Jul 31 at 10:09






    • 1





      Thanks for pointing out my mistake, I have now corrected my answer.

      – Mattias
      Jul 31 at 10:16






    • 2





      It was certainly a nice question and its good to see it inspiring minds

      – Chris Doyle
      Jul 31 at 10:18













    7












    7








    7







    As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.



    def strip_read(file):
    return file.readline().rstrip()

    in_a_not_b = []
    in_b_not_a = []
    with open("fileA") as A:
    with open("fileB") as B:
    Aline = strip_read(A)
    Bline = strip_read(B)
    while Aline or Bline:
    if Aline < Bline and Aline:
    in_a_not_b.append(Aline)
    Aline = strip_read(A)
    elif Aline > Bline and Bline:
    in_b_not_a.append(Bline)
    Bline = strip_read(B)
    else:
    Aline = strip_read(A)
    Bline = strip_read(B)

    print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)


    OUTPUT for my sample Files



    in A not in B ['2', '5', '7'] 
    in B not in A ['6']





    share|improve this answer















    As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.



    def strip_read(file):
    return file.readline().rstrip()

    in_a_not_b = []
    in_b_not_a = []
    with open("fileA") as A:
    with open("fileB") as B:
    Aline = strip_read(A)
    Bline = strip_read(B)
    while Aline or Bline:
    if Aline < Bline and Aline:
    in_a_not_b.append(Aline)
    Aline = strip_read(A)
    elif Aline > Bline and Bline:
    in_b_not_a.append(Bline)
    Bline = strip_read(B)
    else:
    Aline = strip_read(A)
    Bline = strip_read(B)

    print("in A not in B", in_a_not_b, "nin B not in A", in_b_not_a)


    OUTPUT for my sample Files



    in A not in B ['2', '5', '7'] 
    in B not in A ['6']






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jul 31 at 10:17

























    answered Jul 31 at 10:01









    Chris DoyleChris Doyle

    1,6051 gold badge12 silver badges25 bronze badges




    1,6051 gold badge12 silver badges25 bronze badges















    • How is this in any way a new answer? there are already two answers exactly doing this.

      – yukashima huksay
      Jul 31 at 10:07






    • 1





      I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A

      – Chris Doyle
      Jul 31 at 10:08






    • 1





      For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file

      – Chris Doyle
      Jul 31 at 10:09






    • 1





      Thanks for pointing out my mistake, I have now corrected my answer.

      – Mattias
      Jul 31 at 10:16






    • 2





      It was certainly a nice question and its good to see it inspiring minds

      – Chris Doyle
      Jul 31 at 10:18

















    • How is this in any way a new answer? there are already two answers exactly doing this.

      – yukashima huksay
      Jul 31 at 10:07






    • 1





      I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A

      – Chris Doyle
      Jul 31 at 10:08






    • 1





      For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file

      – Chris Doyle
      Jul 31 at 10:09






    • 1





      Thanks for pointing out my mistake, I have now corrected my answer.

      – Mattias
      Jul 31 at 10:16






    • 2





      It was certainly a nice question and its good to see it inspiring minds

      – Chris Doyle
      Jul 31 at 10:18
















    How is this in any way a new answer? there are already two answers exactly doing this.

    – yukashima huksay
    Jul 31 at 10:07





    How is this in any way a new answer? there are already two answers exactly doing this.

    – yukashima huksay
    Jul 31 at 10:07




    1




    1





    I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A

    – Chris Doyle
    Jul 31 at 10:08





    I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A

    – Chris Doyle
    Jul 31 at 10:08




    1




    1





    For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file

    – Chris Doyle
    Jul 31 at 10:09





    For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file

    – Chris Doyle
    Jul 31 at 10:09




    1




    1





    Thanks for pointing out my mistake, I have now corrected my answer.

    – Mattias
    Jul 31 at 10:16





    Thanks for pointing out my mistake, I have now corrected my answer.

    – Mattias
    Jul 31 at 10:16




    2




    2





    It was certainly a nice question and its good to see it inspiring minds

    – Chris Doyle
    Jul 31 at 10:18





    It was certainly a nice question and its good to see it inspiring minds

    – Chris Doyle
    Jul 31 at 10:18











    5














    You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)



    FileA = [1, 1, 2, 3, 4, 5]
    FileB = [1, 3, 4, 6]

    from itertools import groupby
    from heapq import merge

    gen_a = ((v, 'FileA') for v in FileA)
    gen_b = ((v, 'FileB') for v in FileB)

    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
    if any(v[1] == 'FileB' for v in g):
    continue
    print(v)


    Prints:



    2
    5



    EDIT (Reading from files):



    from itertools import groupby
    from heapq import merge

    gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
    gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
    if any(v[1] == 2 for v in g):
    continue
    print(v)


    Benchmark:



    Generating files with 10_000_000 items:



    seq 0 3 10000000 > 3k.txt
    seq 0 2 10000000 > 2k.txt


    The script takes ~10sec to complete:



    real 0m10,656s
    user 0m10,557s
    sys 0m0,076s





    share|improve this answer



























    • @AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?

      – Arjun Muraleedharan
      Jul 31 at 11:09











    • @ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example (int(v.strip()), 1) Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.

      – Andrej Kesely
      Jul 31 at 11:11











    • @AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?

      – Arjun Muraleedharan
      Jul 31 at 11:56















    5














    You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)



    FileA = [1, 1, 2, 3, 4, 5]
    FileB = [1, 3, 4, 6]

    from itertools import groupby
    from heapq import merge

    gen_a = ((v, 'FileA') for v in FileA)
    gen_b = ((v, 'FileB') for v in FileB)

    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
    if any(v[1] == 'FileB' for v in g):
    continue
    print(v)


    Prints:



    2
    5



    EDIT (Reading from files):



    from itertools import groupby
    from heapq import merge

    gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
    gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
    if any(v[1] == 2 for v in g):
    continue
    print(v)


    Benchmark:



    Generating files with 10_000_000 items:



    seq 0 3 10000000 > 3k.txt
    seq 0 2 10000000 > 2k.txt


    The script takes ~10sec to complete:



    real 0m10,656s
    user 0m10,557s
    sys 0m0,076s





    share|improve this answer



























    • @AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?

      – Arjun Muraleedharan
      Jul 31 at 11:09











    • @ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example (int(v.strip()), 1) Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.

      – Andrej Kesely
      Jul 31 at 11:11











    • @AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?

      – Arjun Muraleedharan
      Jul 31 at 11:56













    5












    5








    5







    You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)



    FileA = [1, 1, 2, 3, 4, 5]
    FileB = [1, 3, 4, 6]

    from itertools import groupby
    from heapq import merge

    gen_a = ((v, 'FileA') for v in FileA)
    gen_b = ((v, 'FileB') for v in FileB)

    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
    if any(v[1] == 'FileB' for v in g):
    continue
    print(v)


    Prints:



    2
    5



    EDIT (Reading from files):



    from itertools import groupby
    from heapq import merge

    gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
    gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
    if any(v[1] == 2 for v in g):
    continue
    print(v)


    Benchmark:



    Generating files with 10_000_000 items:



    seq 0 3 10000000 > 3k.txt
    seq 0 2 10000000 > 2k.txt


    The script takes ~10sec to complete:



    real 0m10,656s
    user 0m10,557s
    sys 0m0,076s





    share|improve this answer















    You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)



    FileA = [1, 1, 2, 3, 4, 5]
    FileB = [1, 3, 4, 6]

    from itertools import groupby
    from heapq import merge

    gen_a = ((v, 'FileA') for v in FileA)
    gen_b = ((v, 'FileB') for v in FileB)

    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
    if any(v[1] == 'FileB' for v in g):
    continue
    print(v)


    Prints:



    2
    5



    EDIT (Reading from files):



    from itertools import groupby
    from heapq import merge

    gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
    gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
    if any(v[1] == 2 for v in g):
    continue
    print(v)


    Benchmark:



    Generating files with 10_000_000 items:



    seq 0 3 10000000 > 3k.txt
    seq 0 2 10000000 > 2k.txt


    The script takes ~10sec to complete:



    real 0m10,656s
    user 0m10,557s
    sys 0m0,076s






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jul 31 at 10:47

























    answered Jul 31 at 9:36









    Andrej KeselyAndrej Kesely

    18k2 gold badges10 silver badges34 bronze badges




    18k2 gold badges10 silver badges34 bronze badges















    • @AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?

      – Arjun Muraleedharan
      Jul 31 at 11:09











    • @ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example (int(v.strip()), 1) Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.

      – Andrej Kesely
      Jul 31 at 11:11











    • @AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?

      – Arjun Muraleedharan
      Jul 31 at 11:56

















    • @AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?

      – Arjun Muraleedharan
      Jul 31 at 11:09











    • @ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example (int(v.strip()), 1) Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.

      – Andrej Kesely
      Jul 31 at 11:11











    • @AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?

      – Arjun Muraleedharan
      Jul 31 at 11:56
















    @AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?

    – Arjun Muraleedharan
    Jul 31 at 11:09





    @AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity?

    – Arjun Muraleedharan
    Jul 31 at 11:09













    @ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example (int(v.strip()), 1) Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.

    – Andrej Kesely
    Jul 31 at 11:11





    @ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example (int(v.strip()), 1) Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up.

    – Andrej Kesely
    Jul 31 at 11:11













    @AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?

    – Arjun Muraleedharan
    Jul 31 at 11:56





    @AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea?

    – Arjun Muraleedharan
    Jul 31 at 11:56











    1














    A simple solution based on file reading (asuming that each line hold a number):



    results = []
    with open('file1.csv') as file1, open('file2.csv') as file2:
    var1 = file1.readline()
    var2 = file2.readline()
    while var1:
    while var1 and var2:
    if int(var1) < int(var2):
    results.append(int(var1))
    var1 = file1.readline()
    elif int(var1) > int(var2):
    var2 = file2.readline()
    elif int(var1) == int(var2):
    var1 = file1.readline()
    var2 = file2.readline()
    if var1:
    results.append(int(var1))
    var1 = file1.readline()
    print(results)
    output = [2, 5, 7, 9]





    share|improve this answer



























    • You are missing the calls to int before the comparisons. Otherwise you get problems like "9" > "10". Also, you can open multiple files in one with statement.

      – Graipher
      Jul 31 at 18:37











    • Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher

      – Mattias
      Aug 1 at 6:04















    1














    A simple solution based on file reading (asuming that each line hold a number):



    results = []
    with open('file1.csv') as file1, open('file2.csv') as file2:
    var1 = file1.readline()
    var2 = file2.readline()
    while var1:
    while var1 and var2:
    if int(var1) < int(var2):
    results.append(int(var1))
    var1 = file1.readline()
    elif int(var1) > int(var2):
    var2 = file2.readline()
    elif int(var1) == int(var2):
    var1 = file1.readline()
    var2 = file2.readline()
    if var1:
    results.append(int(var1))
    var1 = file1.readline()
    print(results)
    output = [2, 5, 7, 9]





    share|improve this answer



























    • You are missing the calls to int before the comparisons. Otherwise you get problems like "9" > "10". Also, you can open multiple files in one with statement.

      – Graipher
      Jul 31 at 18:37











    • Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher

      – Mattias
      Aug 1 at 6:04













    1












    1








    1







    A simple solution based on file reading (asuming that each line hold a number):



    results = []
    with open('file1.csv') as file1, open('file2.csv') as file2:
    var1 = file1.readline()
    var2 = file2.readline()
    while var1:
    while var1 and var2:
    if int(var1) < int(var2):
    results.append(int(var1))
    var1 = file1.readline()
    elif int(var1) > int(var2):
    var2 = file2.readline()
    elif int(var1) == int(var2):
    var1 = file1.readline()
    var2 = file2.readline()
    if var1:
    results.append(int(var1))
    var1 = file1.readline()
    print(results)
    output = [2, 5, 7, 9]





    share|improve this answer















    A simple solution based on file reading (asuming that each line hold a number):



    results = []
    with open('file1.csv') as file1, open('file2.csv') as file2:
    var1 = file1.readline()
    var2 = file2.readline()
    while var1:
    while var1 and var2:
    if int(var1) < int(var2):
    results.append(int(var1))
    var1 = file1.readline()
    elif int(var1) > int(var2):
    var2 = file2.readline()
    elif int(var1) == int(var2):
    var1 = file1.readline()
    var2 = file2.readline()
    if var1:
    results.append(int(var1))
    var1 = file1.readline()
    print(results)
    output = [2, 5, 7, 9]






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 1 at 6:03

























    answered Jul 31 at 9:50









    MattiasMattias

    3461 silver badge11 bronze badges




    3461 silver badge11 bronze badges















    • You are missing the calls to int before the comparisons. Otherwise you get problems like "9" > "10". Also, you can open multiple files in one with statement.

      – Graipher
      Jul 31 at 18:37











    • Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher

      – Mattias
      Aug 1 at 6:04

















    • You are missing the calls to int before the comparisons. Otherwise you get problems like "9" > "10". Also, you can open multiple files in one with statement.

      – Graipher
      Jul 31 at 18:37











    • Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher

      – Mattias
      Aug 1 at 6:04
















    You are missing the calls to int before the comparisons. Otherwise you get problems like "9" > "10". Also, you can open multiple files in one with statement.

    – Graipher
    Jul 31 at 18:37





    You are missing the calls to int before the comparisons. Otherwise you get problems like "9" > "10". Also, you can open multiple files in one with statement.

    – Graipher
    Jul 31 at 18:37













    Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher

    – Mattias
    Aug 1 at 6:04





    Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher

    – Mattias
    Aug 1 at 6:04











    1














    This is similar to the classic Knuth Sorting and Searching.
    You may wish to consider reading stack question,
    on-line lecture notes pdf, and Wikipedia.
    The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.






    share|improve this answer





























      1














      This is similar to the classic Knuth Sorting and Searching.
      You may wish to consider reading stack question,
      on-line lecture notes pdf, and Wikipedia.
      The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.






      share|improve this answer



























        1












        1








        1







        This is similar to the classic Knuth Sorting and Searching.
        You may wish to consider reading stack question,
        on-line lecture notes pdf, and Wikipedia.
        The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.






        share|improve this answer













        This is similar to the classic Knuth Sorting and Searching.
        You may wish to consider reading stack question,
        on-line lecture notes pdf, and Wikipedia.
        The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Aug 3 at 23:26









        punchcardpunchcard

        3291 silver badge5 bronze badges




        3291 silver badge5 bronze badges






























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57287400%2ffind-all-the-numbers-in-one-file-that-are-not-in-another-file-in-python%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Get product attribute by attribute group code in magento 2get product attribute by product attribute group in magento 2Magento 2 Log Bundle Product Data in List Page?How to get all product attribute of a attribute group of Default attribute set?Magento 2.1 Create a filter in the product grid by new attributeMagento 2 : Get Product Attribute values By GroupMagento 2 How to get all existing values for one attributeMagento 2 get custom attribute of a single product inside a pluginMagento 2.3 How to get all the Multi Source Inventory (MSI) locations collection in custom module?Magento2: how to develop rest API to get new productsGet product attribute by attribute group code ( [attribute_group_code] ) in magento 2

            Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

            Magento 2.3: How do i solve this, Not registered handle, on custom form?How can i rewrite TierPrice Block in Magento2magento 2 captcha not rendering if I override layout xmlmain.CRITICAL: Plugin class doesn't existMagento 2 : Problem while adding custom button order view page?Magento 2.2.5: Overriding Admin Controller sales/orderMagento 2.2.5: Add, Update and Delete existing products Custom OptionsMagento 2.3 : File Upload issue in UI Component FormMagento2 Not registered handleHow to configured Form Builder Js in my custom magento 2.3.0 module?Magento 2.3. How to create image upload field in an admin form