Finding dictionary keys whose values are duplicatesLoop an array of dictionary with keys containg sets; comparing each key, value pair; and combining dictionaries

Was Self-modifying-code possible just using BASIC?

Housemarks (superimposed & combined letters, heraldry)

Does putting salt first make it easier for attacker to bruteforce the hash?

Trying to get (more) accurate readings from thermistor (electronics, math, and code inside)

Assigning function to function pointer, const argument correctness?

If absolute velocity does not exist, how can we say a rocket accelerates in empty space?

Why did the World Bank set the global poverty line at $1.90?

What is the Leave No Trace way to dispose of coffee grounds?

Why do the Tie-fighter pilot helmets have similar ridges as the rebels?

Is Dumbledore a human lie detector?

What plausible reason could I give for my FTL drive only working in space

How was the airlock installed on the Space Shuttle mid deck?

Extracting data from Plot

How to get depth and other lengths of a font?

How to befriend someone who doesn't like to talk?

Should I refuse to be named as co-author of a low quality paper?

Proving that a Russian cryptographic standard is too structured

How can one's career as a reviewer be ended?

Why isn't Bash trap working if output is redirected to stdout?

What is the reason for setting flaps 1 on the ground at high temperatures?

ASCII Meme Arrow Generator

Tikz-cd diagram arrow passing under a node - not crossing it

Augment Export function to support custom number formatting

Why did Intel abandon unified CPU cache?

Finding dictionary keys whose values are duplicates

Loop an array of dictionary with keys containg sets; comparing each key, value pair; and combining dictionaries

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I currently have a dictionary (Duplicate_combos) that has a unique identifying number for the key value and the value is a list with two elements, a company code and then either a yes or no (both of these values are currently stored as strings). I am essentially just trying to see where the company code is equal and the second term is no for both.
So if this was my dictionary:

1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']

I would only want to return 1234 and 1235. The code below is what I currently have and I really need to optimize it because while it does work when I tested it on a small data set, I will need to use it on a much larger one (43,000 lines) and in early testing, it is taking 45+ minutes with seemingly no sign of ending soon.

def open_file():

 in_file = open("./Data.csv","r")
 blank = in_file.readline()
 titles = in_file.readline()
 titles = titles.strip()
 titles = titles.split(',')

 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9 
 Performance_center = [] #10
 Profit_center = [] #11

 total_lines = 

 for line in in_file:

 line = line.strip()
 line = line.split(',')
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]


 return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)


def find_duplicates(Duplicate_combos):

 Real_duplicates = []
 archive_duplicates = []


 # loop through the dictionary of duplicate combos by the keys 
 for key in Duplicate_combos:
 code = Duplicate_combos[key][0]
 for key2 in Duplicate_combos:
 # if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
 if key == key2:
 continue
 # if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate 
 elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
 # make sure that we haven't already dealt with this key before
 if key not in archive_duplicates:
 Real_duplicates.append(key)
 archive_duplicates.append(key)

 if key2 not in archive_duplicates:
 Real_duplicates.append(key2)
 archive_duplicates.append(key2)
 continue 
 return(Real_duplicates)

edited Jun 3 at 20:14

asked Jun 3 at 19:52

Ben Naylor

385

New contributor

1

$begingroup$
Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
$endgroup$
– 200_success
Jun 3 at 19:57

$begingroup$
The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
$endgroup$
– Ben Naylor
Jun 3 at 20:00

$begingroup$
In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
$endgroup$
– 200_success
Jun 3 at 20:09

$begingroup$
I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
$endgroup$
– Ben Naylor
Jun 3 at 20:17

1

$begingroup$
Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
$endgroup$
– 200_success
Jun 3 at 20:20

|
show 2 more comments

1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']

def open_file():

 in_file = open("./Data.csv","r")
 blank = in_file.readline()
 titles = in_file.readline()
 titles = titles.strip()
 titles = titles.split(',')

 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9 
 Performance_center = [] #10
 Profit_center = [] #11

 total_lines = 

 for line in in_file:

 line = line.strip()
 line = line.split(',')
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]


 return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)


def find_duplicates(Duplicate_combos):

 Real_duplicates = []
 archive_duplicates = []


 # loop through the dictionary of duplicate combos by the keys 
 for key in Duplicate_combos:
 code = Duplicate_combos[key][0]
 for key2 in Duplicate_combos:
 # if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
 if key == key2:
 continue
 # if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate 
 elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
 # make sure that we haven't already dealt with this key before
 if key not in archive_duplicates:
 Real_duplicates.append(key)
 archive_duplicates.append(key)

 if key2 not in archive_duplicates:
 Real_duplicates.append(key2)
 archive_duplicates.append(key2)
 continue 
 return(Real_duplicates)

edited Jun 3 at 20:14

asked Jun 3 at 19:52

Ben Naylor

385

New contributor

1

$begingroup$
Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
$endgroup$
– 200_success
Jun 3 at 19:57

$begingroup$
The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
$endgroup$
– Ben Naylor
Jun 3 at 20:00

$begingroup$
In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
$endgroup$
– 200_success
Jun 3 at 20:09

$begingroup$
I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
$endgroup$
– Ben Naylor
Jun 3 at 20:17

1

$begingroup$
Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
$endgroup$
– 200_success
Jun 3 at 20:20

|
show 2 more comments

1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']

def open_file():

 in_file = open("./Data.csv","r")
 blank = in_file.readline()
 titles = in_file.readline()
 titles = titles.strip()
 titles = titles.split(',')

 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9 
 Performance_center = [] #10
 Profit_center = [] #11

 total_lines = 

 for line in in_file:

 line = line.strip()
 line = line.split(',')
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]


 return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)


def find_duplicates(Duplicate_combos):

 Real_duplicates = []
 archive_duplicates = []


 # loop through the dictionary of duplicate combos by the keys 
 for key in Duplicate_combos:
 code = Duplicate_combos[key][0]
 for key2 in Duplicate_combos:
 # if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
 if key == key2:
 continue
 # if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate 
 elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
 # make sure that we haven't already dealt with this key before
 if key not in archive_duplicates:
 Real_duplicates.append(key)
 archive_duplicates.append(key)

 if key2 not in archive_duplicates:
 Real_duplicates.append(key2)
 archive_duplicates.append(key2)
 continue 
 return(Real_duplicates)

edited Jun 3 at 20:14

asked Jun 3 at 19:52

Ben Naylor

385

New contributor

1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']

def open_file():

 in_file = open("./Data.csv","r")
 blank = in_file.readline()
 titles = in_file.readline()
 titles = titles.strip()
 titles = titles.split(',')

 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9 
 Performance_center = [] #10
 Profit_center = [] #11

 total_lines = 

 for line in in_file:

 line = line.strip()
 line = line.split(',')
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]


 return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)


def find_duplicates(Duplicate_combos):

 Real_duplicates = []
 archive_duplicates = []


 # loop through the dictionary of duplicate combos by the keys 
 for key in Duplicate_combos:
 code = Duplicate_combos[key][0]
 for key2 in Duplicate_combos:
 # if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
 if key == key2:
 continue
 # if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate 
 elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
 # make sure that we haven't already dealt with this key before
 if key not in archive_duplicates:
 Real_duplicates.append(key)
 archive_duplicates.append(key)

 if key2 not in archive_duplicates:
 Real_duplicates.append(key2)
 archive_duplicates.append(key2)
 continue 
 return(Real_duplicates)

python time-limit-exceeded dictionary

edited Jun 3 at 20:14

asked Jun 3 at 19:52

Ben Naylor

385

New contributor

edited Jun 3 at 20:14

asked Jun 3 at 19:52

Ben Naylor

385

New contributor

edited Jun 3 at 20:14

asked Jun 3 at 19:52

Ben Naylor

385

New contributor

asked Jun 3 at 19:52

Ben Naylor

385

asked Jun 3 at 19:52

Ben Naylor

385

New contributor

1

$begingroup$
Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
$endgroup$
– 200_success
Jun 3 at 19:57

$begingroup$
The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
$endgroup$
– Ben Naylor
Jun 3 at 20:00

$begingroup$
In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
$endgroup$
– 200_success
Jun 3 at 20:09

$begingroup$
I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
$endgroup$
– Ben Naylor
Jun 3 at 20:17

1

$begingroup$
Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
$endgroup$
– 200_success
Jun 3 at 20:20

|
show 2 more comments

1

$begingroup$
Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.
$endgroup$
– 200_success
Jun 3 at 19:57

$begingroup$
The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run
$endgroup$
– Ben Naylor
Jun 3 at 20:00

$begingroup$
In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.
$endgroup$
– 200_success
Jun 3 at 20:09

$begingroup$
I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that
$endgroup$
– Ben Naylor
Jun 3 at 20:17

1

$begingroup$
Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.
$endgroup$
– 200_success
Jun 3 at 20:20

Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.

– 200_success
Jun 3 at 19:57

The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run

– Ben Naylor
Jun 3 at 20:00

In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.

– 200_success
Jun 3 at 20:09

I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that

– Ben Naylor
Jun 3 at 20:17

Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.

– 200_success
Jun 3 at 20:20

|
show 2 more comments

3 Answers
3

active

oldest

votes

It's easier to read code that tuple unpacks the values in the for from dict.items().
```
for key1, (code1, option1) in Duplicate_combos.items():
```

archive_duplicates is a duplicate of Real_duplicates. There's no need for it.

It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.

This alone speeds up your program from $O(n^3)$ to $O(n^2)$.

Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.

def find_duplicates(items):
 duplicates = set()
 for key1, (code1, option1) in items.items():
 for key2, (code2, option2) in items.items():
 if key1 == key2:
 continue
 elif code1 == code2 and option1 == option2 == 'No':
 duplicates.add(key1)
 duplicates.add(key2)
 return list(duplicates)

You don't need to loop over Duplicate_combos twice.

To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.

After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.

def find_duplicates(items):
 by_code = 
 for key, (code, option) in items.items():
 if option == 'No':
 by_code.setdefault(code, []).append(key)

 return [
 key
 for keys in by_code.values()
 if len(keys) >= 2
 for key in keys
 ]

This now runs in $O(n)$ time rather than $O(n^3)$ time.

>>> find_duplicates(
 101: ['1', 'No'], 102: ['1', 'No'],
 103: ['1','Yes'], 104: ['1', 'No'],
 201: ['2', 'No'], 202: ['2', 'No'],
 301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]

edited Jun 4 at 10:07

answered Jun 3 at 20:24

Peilonrayz

28.4k344118

$begingroup$
so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
$endgroup$
– Ben Naylor
Jun 3 at 20:34

$begingroup$
@BenNaylor Yes this would do that. Please see the update with the example showing this.
$endgroup$
– Peilonrayz
Jun 3 at 20:38

$begingroup$
Thank you so much, this really really helps!
$endgroup$
– Ben Naylor
Jun 4 at 12:20

add a comment |

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.

You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:

def open_file(filename='./Data.csv'):
 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9
 Performance_center = [] #10
 Profit_center = [] #11
 total_lines = 

 with open(filename) as in_file:
 next(in_file) # skip blank line
 reader = csv.reader(in_file, delimiter=',')

 for line in reader:
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]

 return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center

answered Jun 4 at 8:15

Mathias Ettinger

25.6k33387

$begingroup$
I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
$endgroup$
– Peilonrayz
Jun 4 at 10:39

$begingroup$
@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
$endgroup$
– Mathias Ettinger
Jun 4 at 12:46

add a comment |

Doing

def get_dupes(df):
 if sum(df.loc[1]=='No')<2:
 return None
 else:
 return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)

Got me

 0
 124 None
 123 [1234, 1235]
 dtype: object

Your question wasn't quite clear on what you want the output to be if there are multiple company values with duplicate values (e.g. if the input is 1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No'],1238: [124,'No'] do you want [1234, 1235, 1237, 1238] or [[1234, 1235], [1237, 1238]]), so you can modify this code accordingly.

answered Jun 3 at 23:16

Acccumulation

1,12515

1

$begingroup$
You could just take a look at how the current code behaves to understand what output is expected...
$endgroup$
– Vogel612♦
Jun 4 at 10:05

2

$begingroup$
You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
$endgroup$
– Toby Speight
Jun 4 at 10:07

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Ben Naylor is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f221609%2ffinding-dictionary-keys-whose-values-are-duplicates%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

It's easier to read code that tuple unpacks the values in the for from dict.items().
```
for key1, (code1, option1) in Duplicate_combos.items():
```

archive_duplicates is a duplicate of Real_duplicates. There's no need for it.

It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.

This alone speeds up your program from $O(n^3)$ to $O(n^2)$.

Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.

def find_duplicates(items):
 duplicates = set()
 for key1, (code1, option1) in items.items():
 for key2, (code2, option2) in items.items():
 if key1 == key2:
 continue
 elif code1 == code2 and option1 == option2 == 'No':
 duplicates.add(key1)
 duplicates.add(key2)
 return list(duplicates)

You don't need to loop over Duplicate_combos twice.

To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.

After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.

def find_duplicates(items):
 by_code = 
 for key, (code, option) in items.items():
 if option == 'No':
 by_code.setdefault(code, []).append(key)

 return [
 key
 for keys in by_code.values()
 if len(keys) >= 2
 for key in keys
 ]

This now runs in $O(n)$ time rather than $O(n^3)$ time.

>>> find_duplicates(
 101: ['1', 'No'], 102: ['1', 'No'],
 103: ['1','Yes'], 104: ['1', 'No'],
 201: ['2', 'No'], 202: ['2', 'No'],
 301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]

edited Jun 4 at 10:07

answered Jun 3 at 20:24

Peilonrayz

28.4k344118

$begingroup$
so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
$endgroup$
– Ben Naylor
Jun 3 at 20:34

$begingroup$
@BenNaylor Yes this would do that. Please see the update with the example showing this.
$endgroup$
– Peilonrayz
Jun 3 at 20:38

$begingroup$
Thank you so much, this really really helps!
$endgroup$
– Ben Naylor
Jun 4 at 12:20

add a comment |

It's easier to read code that tuple unpacks the values in the for from dict.items().
```
for key1, (code1, option1) in Duplicate_combos.items():
```

archive_duplicates is a duplicate of Real_duplicates. There's no need for it.

It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.

This alone speeds up your program from $O(n^3)$ to $O(n^2)$.

Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.

def find_duplicates(items):
 duplicates = set()
 for key1, (code1, option1) in items.items():
 for key2, (code2, option2) in items.items():
 if key1 == key2:
 continue
 elif code1 == code2 and option1 == option2 == 'No':
 duplicates.add(key1)
 duplicates.add(key2)
 return list(duplicates)

You don't need to loop over Duplicate_combos twice.

To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.

After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.

def find_duplicates(items):
 by_code = 
 for key, (code, option) in items.items():
 if option == 'No':
 by_code.setdefault(code, []).append(key)

 return [
 key
 for keys in by_code.values()
 if len(keys) >= 2
 for key in keys
 ]

This now runs in $O(n)$ time rather than $O(n^3)$ time.

>>> find_duplicates(
 101: ['1', 'No'], 102: ['1', 'No'],
 103: ['1','Yes'], 104: ['1', 'No'],
 201: ['2', 'No'], 202: ['2', 'No'],
 301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]

edited Jun 4 at 10:07

answered Jun 3 at 20:24

Peilonrayz

28.4k344118

$begingroup$
so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
$endgroup$
– Ben Naylor
Jun 3 at 20:34

$begingroup$
@BenNaylor Yes this would do that. Please see the update with the example showing this.
$endgroup$
– Peilonrayz
Jun 3 at 20:38

$begingroup$
Thank you so much, this really really helps!
$endgroup$
– Ben Naylor
Jun 4 at 12:20

add a comment |

It's easier to read code that tuple unpacks the values in the for from dict.items().
```
for key1, (code1, option1) in Duplicate_combos.items():
```

archive_duplicates is a duplicate of Real_duplicates. There's no need for it.

It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.

This alone speeds up your program from $O(n^3)$ to $O(n^2)$.

Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.

def find_duplicates(items):
 duplicates = set()
 for key1, (code1, option1) in items.items():
 for key2, (code2, option2) in items.items():
 if key1 == key2:
 continue
 elif code1 == code2 and option1 == option2 == 'No':
 duplicates.add(key1)
 duplicates.add(key2)
 return list(duplicates)

You don't need to loop over Duplicate_combos twice.

To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.

After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.

def find_duplicates(items):
 by_code = 
 for key, (code, option) in items.items():
 if option == 'No':
 by_code.setdefault(code, []).append(key)

 return [
 key
 for keys in by_code.values()
 if len(keys) >= 2
 for key in keys
 ]

This now runs in $O(n)$ time rather than $O(n^3)$ time.

>>> find_duplicates(
 101: ['1', 'No'], 102: ['1', 'No'],
 103: ['1','Yes'], 104: ['1', 'No'],
 201: ['2', 'No'], 202: ['2', 'No'],
 301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]

edited Jun 4 at 10:07

answered Jun 3 at 20:24

Peilonrayz

28.4k344118

It's easier to read code that tuple unpacks the values in the for from dict.items().
```
for key1, (code1, option1) in Duplicate_combos.items():
```

archive_duplicates is a duplicate of Real_duplicates. There's no need for it.

It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.

This alone speeds up your program from $O(n^3)$ to $O(n^2)$.

Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.

def find_duplicates(items):
 duplicates = set()
 for key1, (code1, option1) in items.items():
 for key2, (code2, option2) in items.items():
 if key1 == key2:
 continue
 elif code1 == code2 and option1 == option2 == 'No':
 duplicates.add(key1)
 duplicates.add(key2)
 return list(duplicates)

You don't need to loop over Duplicate_combos twice.

To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.

After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.

def find_duplicates(items):
 by_code = 
 for key, (code, option) in items.items():
 if option == 'No':
 by_code.setdefault(code, []).append(key)

 return [
 key
 for keys in by_code.values()
 if len(keys) >= 2
 for key in keys
 ]

This now runs in $O(n)$ time rather than $O(n^3)$ time.

>>> find_duplicates(
 101: ['1', 'No'], 102: ['1', 'No'],
 103: ['1','Yes'], 104: ['1', 'No'],
 201: ['2', 'No'], 202: ['2', 'No'],
 301: ['3', 'No'], 401: ['4', 'No'],
)
[101, 102, 104, 201, 202]

edited Jun 4 at 10:07

answered Jun 3 at 20:24

Peilonrayz

28.4k344118

edited Jun 4 at 10:07

answered Jun 3 at 20:24

Peilonrayz

28.4k344118

answered Jun 3 at 20:24

Peilonrayz

28.4k344118

answered Jun 3 at 20:24

Peilonrayz

28.4k344118

$begingroup$
so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
$endgroup$
– Ben Naylor
Jun 3 at 20:34

$begingroup$
@BenNaylor Yes this would do that. Please see the update with the example showing this.
$endgroup$
– Peilonrayz
Jun 3 at 20:38

$begingroup$
Thank you so much, this really really helps!
$endgroup$
– Ben Naylor
Jun 4 at 12:20

add a comment |

$begingroup$
so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
$endgroup$
– Ben Naylor
Jun 3 at 20:34

$begingroup$
@BenNaylor Yes this would do that. Please see the update with the example showing this.
$endgroup$
– Peilonrayz
Jun 3 at 20:38

$begingroup$
Thank you so much, this really really helps!
$endgroup$
– Ben Naylor
Jun 4 at 12:20

so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values

– Ben Naylor
Jun 3 at 20:34

@BenNaylor Yes this would do that. Please see the update with the example showing this.

– Peilonrayz
Jun 3 at 20:38

Thank you so much, this really really helps!

– Ben Naylor
Jun 4 at 12:20

add a comment |

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.

You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:

def open_file(filename='./Data.csv'):
 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9
 Performance_center = [] #10
 Profit_center = [] #11
 total_lines = 

 with open(filename) as in_file:
 next(in_file) # skip blank line
 reader = csv.reader(in_file, delimiter=',')

 for line in reader:
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]

 return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center

answered Jun 4 at 8:15

Mathias Ettinger

25.6k33387

$begingroup$
I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
$endgroup$
– Peilonrayz
Jun 4 at 10:39

$begingroup$
@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
$endgroup$
– Mathias Ettinger
Jun 4 at 12:46

add a comment |

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.

You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:

def open_file(filename='./Data.csv'):
 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9
 Performance_center = [] #10
 Profit_center = [] #11
 total_lines = 

 with open(filename) as in_file:
 next(in_file) # skip blank line
 reader = csv.reader(in_file, delimiter=',')

 for line in reader:
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]

 return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center

answered Jun 4 at 8:15

Mathias Ettinger

25.6k33387

$begingroup$
I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
$endgroup$
– Peilonrayz
Jun 4 at 10:39

$begingroup$
@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
$endgroup$
– Mathias Ettinger
Jun 4 at 12:46

add a comment |

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.

You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:

def open_file(filename='./Data.csv'):
 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9
 Performance_center = [] #10
 Profit_center = [] #11
 total_lines = 

 with open(filename) as in_file:
 next(in_file) # skip blank line
 reader = csv.reader(in_file, delimiter=',')

 for line in reader:
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]

 return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center

answered Jun 4 at 8:15

Mathias Ettinger

25.6k33387

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.

You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:

def open_file(filename='./Data.csv'):
 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9
 Performance_center = [] #10
 Profit_center = [] #11
 total_lines = 

 with open(filename) as in_file:
 next(in_file) # skip blank line
 reader = csv.reader(in_file, delimiter=',')

 for line in reader:
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])

 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]

 return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center

answered Jun 4 at 8:15

Mathias Ettinger

25.6k33387

answered Jun 4 at 8:15

Mathias Ettinger

25.6k33387

answered Jun 4 at 8:15

Mathias Ettinger

25.6k33387

answered Jun 4 at 8:15

Mathias Ettinger

25.6k33387

$begingroup$
I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
$endgroup$
– Peilonrayz
Jun 4 at 10:39

$begingroup$
@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
$endgroup$
– Mathias Ettinger
Jun 4 at 12:46

add a comment |

$begingroup$
I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.
$endgroup$
– Peilonrayz
Jun 4 at 10:39

$begingroup$
@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.
$endgroup$
– Mathias Ettinger
Jun 4 at 12:46

I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.

– Peilonrayz
Jun 4 at 10:39

@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.

– Mathias Ettinger
Jun 4 at 12:46

add a comment |

Doing

def get_dupes(df):
 if sum(df.loc[1]=='No')<2:
 return None
 else:
 return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)

Got me

 0
 124 None
 123 [1234, 1235]
 dtype: object

answered Jun 3 at 23:16

Acccumulation

1,12515

1

$begingroup$
You could just take a look at how the current code behaves to understand what output is expected...
$endgroup$
– Vogel612♦
Jun 4 at 10:05

2

$begingroup$
You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
$endgroup$
– Toby Speight
Jun 4 at 10:07

add a comment |

Doing

def get_dupes(df):
 if sum(df.loc[1]=='No')<2:
 return None
 else:
 return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)

Got me

 0
 124 None
 123 [1234, 1235]
 dtype: object

answered Jun 3 at 23:16

Acccumulation

1,12515

1

$begingroup$
You could just take a look at how the current code behaves to understand what output is expected...
$endgroup$
– Vogel612♦
Jun 4 at 10:05

2

$begingroup$
You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
$endgroup$
– Toby Speight
Jun 4 at 10:07

add a comment |

Doing

def get_dupes(df):
 if sum(df.loc[1]=='No')<2:
 return None
 else:
 return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)

Got me

 0
 124 None
 123 [1234, 1235]
 dtype: object

answered Jun 3 at 23:16

Acccumulation

1,12515

Doing

def get_dupes(df):
 if sum(df.loc[1]=='No')<2:
 return None
 else:
 return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)

Got me

 0
 124 None
 123 [1234, 1235]
 dtype: object

answered Jun 3 at 23:16

Acccumulation

1,12515

answered Jun 3 at 23:16

Acccumulation

1,12515

answered Jun 3 at 23:16

Acccumulation

1,12515

answered Jun 3 at 23:16

Acccumulation

1,12515

1

$begingroup$
You could just take a look at how the current code behaves to understand what output is expected...
$endgroup$
– Vogel612♦
Jun 4 at 10:05

2

$begingroup$
You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
$endgroup$
– Toby Speight
Jun 4 at 10:07

add a comment |

1

$begingroup$
You could just take a look at how the current code behaves to understand what output is expected...
$endgroup$
– Vogel612♦
Jun 4 at 10:05

2

$begingroup$
You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.
$endgroup$
– Toby Speight
Jun 4 at 10:07

You could just take a look at how the current code behaves to understand what output is expected...

– Vogel612♦
Jun 4 at 10:05

You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.

– Toby Speight
Jun 4 at 10:07

add a comment |

Ben Naylor is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Ben Naylor is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

vE1am2NPHdM,jFfyJ,VXWnbbf,FYvp5BZWav cfc

搜尋此網誌

Ttdfjt