Finding dictionary keys whose values are duplicatesLoop an array of dictionary with keys containg sets; comparing each key, value pair; and combining dictionaries

Was Self-modifying-code possible just using BASIC?

Housemarks (superimposed & combined letters, heraldry)

Does putting salt first make it easier for attacker to bruteforce the hash?

Trying to get (more) accurate readings from thermistor (electronics, math, and code inside)

Assigning function to function pointer, const argument correctness?

If absolute velocity does not exist, how can we say a rocket accelerates in empty space?

Why did the World Bank set the global poverty line at $1.90?

What is the Leave No Trace way to dispose of coffee grounds?

Why do the Tie-fighter pilot helmets have similar ridges as the rebels?

Is Dumbledore a human lie detector?

What plausible reason could I give for my FTL drive only working in space

How was the airlock installed on the Space Shuttle mid deck?

Extracting data from Plot

How to get depth and other lengths of a font?

How to befriend someone who doesn't like to talk?

Should I refuse to be named as co-author of a low quality paper?

Proving that a Russian cryptographic standard is too structured

How can one's career as a reviewer be ended?

Why isn't Bash trap working if output is redirected to stdout?

What is the reason for setting flaps 1 on the ground at high temperatures?

ASCII Meme Arrow Generator

Tikz-cd diagram arrow passing under a node - not crossing it

Augment Export function to support custom number formatting

Why did Intel abandon unified CPU cache?



Finding dictionary keys whose values are duplicates


Loop an array of dictionary with keys containg sets; comparing each key, value pair; and combining dictionaries






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








7












$begingroup$


I currently have a dictionary (Duplicate_combos) that has a unique identifying number for the key value and the value is a list with two elements, a company code and then either a yes or no (both of these values are currently stored as strings). I am essentially just trying to see where the company code is equal and the second term is no for both.
So if this was my dictionary:



1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']


I would only want to return 1234 and 1235. The code below is what I currently have and I really need to optimize it because while it does work when I tested it on a small data set, I will need to use it on a much larger one (43,000 lines) and in early testing, it is taking 45+ minutes with seemingly no sign of ending soon.



def open_file():

in_file = open("./Data.csv","r")
blank = in_file.readline()
titles = in_file.readline()
titles = titles.strip()
titles = titles.split(',')

cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11

total_lines =

for line in in_file:

line = line.strip()
line = line.split(',')
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]


return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)


def find_duplicates(Duplicate_combos):

Real_duplicates = []
archive_duplicates = []


# loop through the dictionary of duplicate combos by the keys
for key in Duplicate_combos:
code = Duplicate_combos[key][0]
for key2 in Duplicate_combos:
# if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
if key == key2:
continue
# if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate
elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
# make sure that we haven't already dealt with this key before
if key not in archive_duplicates:
Real_duplicates.append(key)
archive_duplicates.append(key)

if key2 not in archive_duplicates:
Real_duplicates.append(key2)
archive_duplicates.append(key2)
continue
return(Real_duplicates)









share|improve this question









New contributor



Ben Naylor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$







  • 1




    $begingroup$
    Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
    $endgroup$
    – 200_success
    Jun 3 at 19:57










  • $begingroup$
    The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:00










  • $begingroup$
    In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
    $endgroup$
    – 200_success
    Jun 3 at 20:09











  • $begingroup$
    I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:17







  • 1




    $begingroup$
    Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
    $endgroup$
    – 200_success
    Jun 3 at 20:20

















7












$begingroup$


I currently have a dictionary (Duplicate_combos) that has a unique identifying number for the key value and the value is a list with two elements, a company code and then either a yes or no (both of these values are currently stored as strings). I am essentially just trying to see where the company code is equal and the second term is no for both.
So if this was my dictionary:



1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']


I would only want to return 1234 and 1235. The code below is what I currently have and I really need to optimize it because while it does work when I tested it on a small data set, I will need to use it on a much larger one (43,000 lines) and in early testing, it is taking 45+ minutes with seemingly no sign of ending soon.



def open_file():

in_file = open("./Data.csv","r")
blank = in_file.readline()
titles = in_file.readline()
titles = titles.strip()
titles = titles.split(',')

cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11

total_lines =

for line in in_file:

line = line.strip()
line = line.split(',')
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]


return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)


def find_duplicates(Duplicate_combos):

Real_duplicates = []
archive_duplicates = []


# loop through the dictionary of duplicate combos by the keys
for key in Duplicate_combos:
code = Duplicate_combos[key][0]
for key2 in Duplicate_combos:
# if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
if key == key2:
continue
# if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate
elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
# make sure that we haven't already dealt with this key before
if key not in archive_duplicates:
Real_duplicates.append(key)
archive_duplicates.append(key)

if key2 not in archive_duplicates:
Real_duplicates.append(key2)
archive_duplicates.append(key2)
continue
return(Real_duplicates)









share|improve this question









New contributor



Ben Naylor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$







  • 1




    $begingroup$
    Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
    $endgroup$
    – 200_success
    Jun 3 at 19:57










  • $begingroup$
    The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:00










  • $begingroup$
    In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
    $endgroup$
    – 200_success
    Jun 3 at 20:09











  • $begingroup$
    I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:17







  • 1




    $begingroup$
    Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
    $endgroup$
    – 200_success
    Jun 3 at 20:20













7












7








7





$begingroup$


I currently have a dictionary (Duplicate_combos) that has a unique identifying number for the key value and the value is a list with two elements, a company code and then either a yes or no (both of these values are currently stored as strings). I am essentially just trying to see where the company code is equal and the second term is no for both.
So if this was my dictionary:



1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']


I would only want to return 1234 and 1235. The code below is what I currently have and I really need to optimize it because while it does work when I tested it on a small data set, I will need to use it on a much larger one (43,000 lines) and in early testing, it is taking 45+ minutes with seemingly no sign of ending soon.



def open_file():

in_file = open("./Data.csv","r")
blank = in_file.readline()
titles = in_file.readline()
titles = titles.strip()
titles = titles.split(',')

cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11

total_lines =

for line in in_file:

line = line.strip()
line = line.split(',')
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]


return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)


def find_duplicates(Duplicate_combos):

Real_duplicates = []
archive_duplicates = []


# loop through the dictionary of duplicate combos by the keys
for key in Duplicate_combos:
code = Duplicate_combos[key][0]
for key2 in Duplicate_combos:
# if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
if key == key2:
continue
# if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate
elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
# make sure that we haven't already dealt with this key before
if key not in archive_duplicates:
Real_duplicates.append(key)
archive_duplicates.append(key)

if key2 not in archive_duplicates:
Real_duplicates.append(key2)
archive_duplicates.append(key2)
continue
return(Real_duplicates)









share|improve this question









New contributor



Ben Naylor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$




I currently have a dictionary (Duplicate_combos) that has a unique identifying number for the key value and the value is a list with two elements, a company code and then either a yes or no (both of these values are currently stored as strings). I am essentially just trying to see where the company code is equal and the second term is no for both.
So if this was my dictionary:



1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']


I would only want to return 1234 and 1235. The code below is what I currently have and I really need to optimize it because while it does work when I tested it on a small data set, I will need to use it on a much larger one (43,000 lines) and in early testing, it is taking 45+ minutes with seemingly no sign of ending soon.



def open_file():

in_file = open("./Data.csv","r")
blank = in_file.readline()
titles = in_file.readline()
titles = titles.strip()
titles = titles.split(',')

cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11

total_lines =

for line in in_file:

line = line.strip()
line = line.split(',')
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]


return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)


def find_duplicates(Duplicate_combos):

Real_duplicates = []
archive_duplicates = []


# loop through the dictionary of duplicate combos by the keys
for key in Duplicate_combos:
code = Duplicate_combos[key][0]
for key2 in Duplicate_combos:
# if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
if key == key2:
continue
# if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate
elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
# make sure that we haven't already dealt with this key before
if key not in archive_duplicates:
Real_duplicates.append(key)
archive_duplicates.append(key)

if key2 not in archive_duplicates:
Real_duplicates.append(key2)
archive_duplicates.append(key2)
continue
return(Real_duplicates)






python time-limit-exceeded dictionary






share|improve this question









New contributor



Ben Naylor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.










share|improve this question









New contributor



Ben Naylor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








share|improve this question




share|improve this question








edited Jun 3 at 20:14







Ben Naylor













New contributor



Ben Naylor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








asked Jun 3 at 19:52









Ben NaylorBen Naylor

385




385




New contributor



Ben Naylor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




New contributor




Ben Naylor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









  • 1




    $begingroup$
    Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
    $endgroup$
    – 200_success
    Jun 3 at 19:57










  • $begingroup$
    The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:00










  • $begingroup$
    In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
    $endgroup$
    – 200_success
    Jun 3 at 20:09











  • $begingroup$
    I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:17







  • 1




    $begingroup$
    Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
    $endgroup$
    – 200_success
    Jun 3 at 20:20












  • 1




    $begingroup$
    Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
    $endgroup$
    – 200_success
    Jun 3 at 19:57










  • $begingroup$
    The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:00










  • $begingroup$
    In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
    $endgroup$
    – 200_success
    Jun 3 at 20:09











  • $begingroup$
    I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:17







  • 1




    $begingroup$
    Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
    $endgroup$
    – 200_success
    Jun 3 at 20:20







1




1




$begingroup$
Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
$endgroup$
– 200_success
Jun 3 at 19:57




$begingroup$
Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
$endgroup$
– 200_success
Jun 3 at 19:57












$begingroup$
The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
$endgroup$
– Ben Naylor
Jun 3 at 20:00




$begingroup$
The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
$endgroup$
– Ben Naylor
Jun 3 at 20:00












$begingroup$
In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
$endgroup$
– 200_success
Jun 3 at 20:09





$begingroup$
In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
$endgroup$
– 200_success
Jun 3 at 20:09













$begingroup$
I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
$endgroup$
– Ben Naylor
Jun 3 at 20:17





$begingroup$
I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
$endgroup$
– Ben Naylor
Jun 3 at 20:17





1




1




$begingroup$
Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
$endgroup$
– 200_success
Jun 3 at 20:20




$begingroup$
Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
$endgroup$
– 200_success
Jun 3 at 20:20










3 Answers
3






active

oldest

votes


















7












$begingroup$


  1. It's easier to read code that tuple unpacks the values in the for from dict.items().



    for key1, (code1, option1) in Duplicate_combos.items():



  2. archive_duplicates is a duplicate of Real_duplicates. There's no need for it.


  3. It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.



    This alone speeds up your program from $O(n^3)$ to $O(n^2)$.



  4. Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.


def find_duplicates(items):
duplicates = set()
for key1, (code1, option1) in items.items():
for key2, (code2, option2) in items.items():
if key1 == key2:
continue
elif code1 == code2 and option1 == option2 == 'No':
duplicates.add(key1)
duplicates.add(key2)
return list(duplicates)



  1. You don't need to loop over Duplicate_combos twice.



    To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.



    After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.



def find_duplicates(items):
by_code =
for key, (code, option) in items.items():
if option == 'No':
by_code.setdefault(code, []).append(key)

return [
key
for keys in by_code.values()
if len(keys) >= 2
for key in keys
]


This now runs in $O(n)$ time rather than $O(n^3)$ time.



>>> find_duplicates(
101: ['1', 'No'], 102: ['1', 'No'],
103: ['1','Yes'], 104: ['1', 'No'],
201: ['2', 'No'], 202: ['2', 'No'],
301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]





share|improve this answer











$endgroup$












  • $begingroup$
    so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:34











  • $begingroup$
    @BenNaylor Yes this would do that. Please see the update with the example showing this.
    $endgroup$
    – Peilonrayz
    Jun 3 at 20:38










  • $begingroup$
    Thank you so much, this really really helps!
    $endgroup$
    – Ben Naylor
    Jun 4 at 12:20


















4












$begingroup$

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.



You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:



def open_file(filename='./Data.csv'):
cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11
total_lines =

with open(filename) as in_file:
next(in_file) # skip blank line
reader = csv.reader(in_file, delimiter=',')

for line in reader:
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]

return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center





share|improve this answer









$endgroup$












  • $begingroup$
    I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
    $endgroup$
    – Peilonrayz
    Jun 4 at 10:39










  • $begingroup$
    @Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
    $endgroup$
    – Mathias Ettinger
    Jun 4 at 12:46


















0












$begingroup$

Doing



def get_dupes(df):
if sum(df.loc[1]=='No')<2:
return None
else:
return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)


Got me



 0
124 None
123 [1234, 1235]
dtype: object


Your question wasn't quite clear on what you want the output to be if there are multiple company values with duplicate values (e.g. if the input is 1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No'],1238: [124,'No']
do you want [1234, 1235, 1237, 1238] or [[1234, 1235], [1237, 1238]]), so you can modify this code accordingly.






share|improve this answer









$endgroup$








  • 1




    $begingroup$
    You could just take a look at how the current code behaves to understand what output is expected...
    $endgroup$
    – Vogel612
    Jun 4 at 10:05






  • 2




    $begingroup$
    You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
    $endgroup$
    – Toby Speight
    Jun 4 at 10:07











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






Ben Naylor is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f221609%2ffinding-dictionary-keys-whose-values-are-duplicates%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









7












$begingroup$


  1. It's easier to read code that tuple unpacks the values in the for from dict.items().



    for key1, (code1, option1) in Duplicate_combos.items():



  2. archive_duplicates is a duplicate of Real_duplicates. There's no need for it.


  3. It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.



    This alone speeds up your program from $O(n^3)$ to $O(n^2)$.



  4. Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.


def find_duplicates(items):
duplicates = set()
for key1, (code1, option1) in items.items():
for key2, (code2, option2) in items.items():
if key1 == key2:
continue
elif code1 == code2 and option1 == option2 == 'No':
duplicates.add(key1)
duplicates.add(key2)
return list(duplicates)



  1. You don't need to loop over Duplicate_combos twice.



    To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.



    After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.



def find_duplicates(items):
by_code =
for key, (code, option) in items.items():
if option == 'No':
by_code.setdefault(code, []).append(key)

return [
key
for keys in by_code.values()
if len(keys) >= 2
for key in keys
]


This now runs in $O(n)$ time rather than $O(n^3)$ time.



>>> find_duplicates(
101: ['1', 'No'], 102: ['1', 'No'],
103: ['1','Yes'], 104: ['1', 'No'],
201: ['2', 'No'], 202: ['2', 'No'],
301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]





share|improve this answer











$endgroup$












  • $begingroup$
    so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:34











  • $begingroup$
    @BenNaylor Yes this would do that. Please see the update with the example showing this.
    $endgroup$
    – Peilonrayz
    Jun 3 at 20:38










  • $begingroup$
    Thank you so much, this really really helps!
    $endgroup$
    – Ben Naylor
    Jun 4 at 12:20















7












$begingroup$


  1. It's easier to read code that tuple unpacks the values in the for from dict.items().



    for key1, (code1, option1) in Duplicate_combos.items():



  2. archive_duplicates is a duplicate of Real_duplicates. There's no need for it.


  3. It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.



    This alone speeds up your program from $O(n^3)$ to $O(n^2)$.



  4. Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.


def find_duplicates(items):
duplicates = set()
for key1, (code1, option1) in items.items():
for key2, (code2, option2) in items.items():
if key1 == key2:
continue
elif code1 == code2 and option1 == option2 == 'No':
duplicates.add(key1)
duplicates.add(key2)
return list(duplicates)



  1. You don't need to loop over Duplicate_combos twice.



    To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.



    After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.



def find_duplicates(items):
by_code =
for key, (code, option) in items.items():
if option == 'No':
by_code.setdefault(code, []).append(key)

return [
key
for keys in by_code.values()
if len(keys) >= 2
for key in keys
]


This now runs in $O(n)$ time rather than $O(n^3)$ time.



>>> find_duplicates(
101: ['1', 'No'], 102: ['1', 'No'],
103: ['1','Yes'], 104: ['1', 'No'],
201: ['2', 'No'], 202: ['2', 'No'],
301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]





share|improve this answer











$endgroup$












  • $begingroup$
    so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:34











  • $begingroup$
    @BenNaylor Yes this would do that. Please see the update with the example showing this.
    $endgroup$
    – Peilonrayz
    Jun 3 at 20:38










  • $begingroup$
    Thank you so much, this really really helps!
    $endgroup$
    – Ben Naylor
    Jun 4 at 12:20













7












7








7





$begingroup$


  1. It's easier to read code that tuple unpacks the values in the for from dict.items().



    for key1, (code1, option1) in Duplicate_combos.items():



  2. archive_duplicates is a duplicate of Real_duplicates. There's no need for it.


  3. It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.



    This alone speeds up your program from $O(n^3)$ to $O(n^2)$.



  4. Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.


def find_duplicates(items):
duplicates = set()
for key1, (code1, option1) in items.items():
for key2, (code2, option2) in items.items():
if key1 == key2:
continue
elif code1 == code2 and option1 == option2 == 'No':
duplicates.add(key1)
duplicates.add(key2)
return list(duplicates)



  1. You don't need to loop over Duplicate_combos twice.



    To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.



    After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.



def find_duplicates(items):
by_code =
for key, (code, option) in items.items():
if option == 'No':
by_code.setdefault(code, []).append(key)

return [
key
for keys in by_code.values()
if len(keys) >= 2
for key in keys
]


This now runs in $O(n)$ time rather than $O(n^3)$ time.



>>> find_duplicates(
101: ['1', 'No'], 102: ['1', 'No'],
103: ['1','Yes'], 104: ['1', 'No'],
201: ['2', 'No'], 202: ['2', 'No'],
301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]





share|improve this answer











$endgroup$




  1. It's easier to read code that tuple unpacks the values in the for from dict.items().



    for key1, (code1, option1) in Duplicate_combos.items():



  2. archive_duplicates is a duplicate of Real_duplicates. There's no need for it.


  3. It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.



    This alone speeds up your program from $O(n^3)$ to $O(n^2)$.



  4. Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.


def find_duplicates(items):
duplicates = set()
for key1, (code1, option1) in items.items():
for key2, (code2, option2) in items.items():
if key1 == key2:
continue
elif code1 == code2 and option1 == option2 == 'No':
duplicates.add(key1)
duplicates.add(key2)
return list(duplicates)



  1. You don't need to loop over Duplicate_combos twice.



    To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.



    After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.



def find_duplicates(items):
by_code =
for key, (code, option) in items.items():
if option == 'No':
by_code.setdefault(code, []).append(key)

return [
key
for keys in by_code.values()
if len(keys) >= 2
for key in keys
]


This now runs in $O(n)$ time rather than $O(n^3)$ time.



>>> find_duplicates(
101: ['1', 'No'], 102: ['1', 'No'],
103: ['1','Yes'], 104: ['1', 'No'],
201: ['2', 'No'], 202: ['2', 'No'],
301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]






share|improve this answer














share|improve this answer



share|improve this answer








edited Jun 4 at 10:07

























answered Jun 3 at 20:24









PeilonrayzPeilonrayz

28.4k344118




28.4k344118











  • $begingroup$
    so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:34











  • $begingroup$
    @BenNaylor Yes this would do that. Please see the update with the example showing this.
    $endgroup$
    – Peilonrayz
    Jun 3 at 20:38










  • $begingroup$
    Thank you so much, this really really helps!
    $endgroup$
    – Ben Naylor
    Jun 4 at 12:20
















  • $begingroup$
    so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
    $endgroup$
    – Ben Naylor
    Jun 3 at 20:34











  • $begingroup$
    @BenNaylor Yes this would do that. Please see the update with the example showing this.
    $endgroup$
    – Peilonrayz
    Jun 3 at 20:38










  • $begingroup$
    Thank you so much, this really really helps!
    $endgroup$
    – Ben Naylor
    Jun 4 at 12:20















$begingroup$
so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
$endgroup$
– Ben Naylor
Jun 3 at 20:34





$begingroup$
so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
$endgroup$
– Ben Naylor
Jun 3 at 20:34













$begingroup$
@BenNaylor Yes this would do that. Please see the update with the example showing this.
$endgroup$
– Peilonrayz
Jun 3 at 20:38




$begingroup$
@BenNaylor Yes this would do that. Please see the update with the example showing this.
$endgroup$
– Peilonrayz
Jun 3 at 20:38












$begingroup$
Thank you so much, this really really helps!
$endgroup$
– Ben Naylor
Jun 4 at 12:20




$begingroup$
Thank you so much, this really really helps!
$endgroup$
– Ben Naylor
Jun 4 at 12:20













4












$begingroup$

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.



You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:



def open_file(filename='./Data.csv'):
cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11
total_lines =

with open(filename) as in_file:
next(in_file) # skip blank line
reader = csv.reader(in_file, delimiter=',')

for line in reader:
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]

return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center





share|improve this answer









$endgroup$












  • $begingroup$
    I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
    $endgroup$
    – Peilonrayz
    Jun 4 at 10:39










  • $begingroup$
    @Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
    $endgroup$
    – Mathias Ettinger
    Jun 4 at 12:46















4












$begingroup$

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.



You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:



def open_file(filename='./Data.csv'):
cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11
total_lines =

with open(filename) as in_file:
next(in_file) # skip blank line
reader = csv.reader(in_file, delimiter=',')

for line in reader:
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]

return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center





share|improve this answer









$endgroup$












  • $begingroup$
    I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
    $endgroup$
    – Peilonrayz
    Jun 4 at 10:39










  • $begingroup$
    @Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
    $endgroup$
    – Mathias Ettinger
    Jun 4 at 12:46













4












4








4





$begingroup$

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.



You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:



def open_file(filename='./Data.csv'):
cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11
total_lines =

with open(filename) as in_file:
next(in_file) # skip blank line
reader = csv.reader(in_file, delimiter=',')

for line in reader:
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]

return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center





share|improve this answer









$endgroup$



When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.



You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:



def open_file(filename='./Data.csv'):
cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11
total_lines =

with open(filename) as in_file:
next(in_file) # skip blank line
reader = csv.reader(in_file, delimiter=',')

for line in reader:
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])

# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]

return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center






share|improve this answer












share|improve this answer



share|improve this answer










answered Jun 4 at 8:15









Mathias EttingerMathias Ettinger

25.6k33387




25.6k33387











  • $begingroup$
    I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
    $endgroup$
    – Peilonrayz
    Jun 4 at 10:39










  • $begingroup$
    @Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
    $endgroup$
    – Mathias Ettinger
    Jun 4 at 12:46
















  • $begingroup$
    I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
    $endgroup$
    – Peilonrayz
    Jun 4 at 10:39










  • $begingroup$
    @Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
    $endgroup$
    – Mathias Ettinger
    Jun 4 at 12:46















$begingroup$
I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
$endgroup$
– Peilonrayz
Jun 4 at 10:39




$begingroup$
I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
$endgroup$
– Peilonrayz
Jun 4 at 10:39












$begingroup$
@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
$endgroup$
– Mathias Ettinger
Jun 4 at 12:46




$begingroup$
@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
$endgroup$
– Mathias Ettinger
Jun 4 at 12:46











0












$begingroup$

Doing



def get_dupes(df):
if sum(df.loc[1]=='No')<2:
return None
else:
return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)


Got me



 0
124 None
123 [1234, 1235]
dtype: object


Your question wasn't quite clear on what you want the output to be if there are multiple company values with duplicate values (e.g. if the input is 1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No'],1238: [124,'No']
do you want [1234, 1235, 1237, 1238] or [[1234, 1235], [1237, 1238]]), so you can modify this code accordingly.






share|improve this answer









$endgroup$








  • 1




    $begingroup$
    You could just take a look at how the current code behaves to understand what output is expected...
    $endgroup$
    – Vogel612
    Jun 4 at 10:05






  • 2




    $begingroup$
    You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
    $endgroup$
    – Toby Speight
    Jun 4 at 10:07















0












$begingroup$

Doing



def get_dupes(df):
if sum(df.loc[1]=='No')<2:
return None
else:
return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)


Got me



 0
124 None
123 [1234, 1235]
dtype: object


Your question wasn't quite clear on what you want the output to be if there are multiple company values with duplicate values (e.g. if the input is 1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No'],1238: [124,'No']
do you want [1234, 1235, 1237, 1238] or [[1234, 1235], [1237, 1238]]), so you can modify this code accordingly.






share|improve this answer









$endgroup$








  • 1




    $begingroup$
    You could just take a look at how the current code behaves to understand what output is expected...
    $endgroup$
    – Vogel612
    Jun 4 at 10:05






  • 2




    $begingroup$
    You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
    $endgroup$
    – Toby Speight
    Jun 4 at 10:07













0












0








0





$begingroup$

Doing



def get_dupes(df):
if sum(df.loc[1]=='No')<2:
return None
else:
return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)


Got me



 0
124 None
123 [1234, 1235]
dtype: object


Your question wasn't quite clear on what you want the output to be if there are multiple company values with duplicate values (e.g. if the input is 1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No'],1238: [124,'No']
do you want [1234, 1235, 1237, 1238] or [[1234, 1235], [1237, 1238]]), so you can modify this code accordingly.






share|improve this answer









$endgroup$



Doing



def get_dupes(df):
if sum(df.loc[1]=='No')<2:
return None
else:
return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)


Got me



 0
124 None
123 [1234, 1235]
dtype: object


Your question wasn't quite clear on what you want the output to be if there are multiple company values with duplicate values (e.g. if the input is 1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No'],1238: [124,'No']
do you want [1234, 1235, 1237, 1238] or [[1234, 1235], [1237, 1238]]), so you can modify this code accordingly.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jun 3 at 23:16









AcccumulationAcccumulation

1,12515




1,12515







  • 1




    $begingroup$
    You could just take a look at how the current code behaves to understand what output is expected...
    $endgroup$
    – Vogel612
    Jun 4 at 10:05






  • 2




    $begingroup$
    You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
    $endgroup$
    – Toby Speight
    Jun 4 at 10:07












  • 1




    $begingroup$
    You could just take a look at how the current code behaves to understand what output is expected...
    $endgroup$
    – Vogel612
    Jun 4 at 10:05






  • 2




    $begingroup$
    You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
    $endgroup$
    – Toby Speight
    Jun 4 at 10:07







1




1




$begingroup$
You could just take a look at how the current code behaves to understand what output is expected...
$endgroup$
– Vogel612
Jun 4 at 10:05




$begingroup$
You could just take a look at how the current code behaves to understand what output is expected...
$endgroup$
– Vogel612
Jun 4 at 10:05




2




2




$begingroup$
You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
$endgroup$
– Toby Speight
Jun 4 at 10:07




$begingroup$
You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
$endgroup$
– Toby Speight
Jun 4 at 10:07










Ben Naylor is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















Ben Naylor is a new contributor. Be nice, and check out our Code of Conduct.












Ben Naylor is a new contributor. Be nice, and check out our Code of Conduct.











Ben Naylor is a new contributor. Be nice, and check out our Code of Conduct.














Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f221609%2ffinding-dictionary-keys-whose-values-are-duplicates%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Get product attribute by attribute group code in magento 2get product attribute by product attribute group in magento 2Magento 2 Log Bundle Product Data in List Page?How to get all product attribute of a attribute group of Default attribute set?Magento 2.1 Create a filter in the product grid by new attributeMagento 2 : Get Product Attribute values By GroupMagento 2 How to get all existing values for one attributeMagento 2 get custom attribute of a single product inside a pluginMagento 2.3 How to get all the Multi Source Inventory (MSI) locations collection in custom module?Magento2: how to develop rest API to get new productsGet product attribute by attribute group code ( [attribute_group_code] ) in magento 2

Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

Magento 2.3: How do i solve this, Not registered handle, on custom form?How can i rewrite TierPrice Block in Magento2magento 2 captcha not rendering if I override layout xmlmain.CRITICAL: Plugin class doesn't existMagento 2 : Problem while adding custom button order view page?Magento 2.2.5: Overriding Admin Controller sales/orderMagento 2.2.5: Add, Update and Delete existing products Custom OptionsMagento 2.3 : File Upload issue in UI Component FormMagento2 Not registered handleHow to configured Form Builder Js in my custom magento 2.3.0 module?Magento 2.3. How to create image upload field in an admin form