What is an entropy graphDecompress and Analyse VMWARE EFI64 biosRemoving randomness from program execution
Do intermediate subdomains need to exist?
Why do airports remove/realign runways?
Why no parachutes in the Orion AA2 abort test?
When is one 'Ready' to make Original Contributions to Mathematics?
What was the significance of Spider-Man: Far From Home being an MCU Phase 3 film instead of a Phase 4 film?
Can a USB hub be used to access a drive from two devices?
How can I use my cell phone's light as a reading light?
What's the big deal about the Nazgûl losing their horses?
Shipped package arrived - didn't order, possible scam?
How to play a D major chord lower than the open E major chord on guitar?
Find max number you can create from an array of numbers
Why do Martians have to wear space helmets?
Why do people prefer metropolitan areas, considering monsters and villains?
Soda water first stored in refrigerator and then at room temperature
Can you create a free-floating MASYU puzzle?
How predictable is $RANDOM really?
Are "confidant" and "confident" homophones?
How can I align nodes and have arrows of the same length in tikz-cd?
Why does mean tend be more stable in different samples than median?
How do I check that users don't write down their passwords?
Park the computer
What's the difference between a type and a kind?
How to deal with a Murder Hobo Paladin?
White's last move?
What is an entropy graph
Decompress and Analyse VMWARE EFI64 biosRemoving randomness from program execution
I am new to reversing and I see a tool Detect It Easy and it has a feature called Entropy. I want to know what it is used for?
entropy
add a comment |
I am new to reversing and I see a tool Detect It Easy and it has a feature called Entropy. I want to know what it is used for?
entropy
add a comment |
I am new to reversing and I see a tool Detect It Easy and it has a feature called Entropy. I want to know what it is used for?
entropy
I am new to reversing and I see a tool Detect It Easy and it has a feature called Entropy. I want to know what it is used for?
entropy
entropy
edited Jun 27 at 11:56
0xC0000022L♦
8,1666 gold badges31 silver badges64 bronze badges
8,1666 gold badges31 silver badges64 bronze badges
asked Jun 27 at 2:21
Suman MandalSuman Mandal
183 bronze badges
183 bronze badges
add a comment |
add a comment |
5 Answers
5
active
oldest
votes
it has a feature called Entropy. I want to know what it is used for?
For our purposes, entropy can be though of as information density or as a measure of randomness in information, which is what makes it useful in the context of reverse engineering and binary analysis.
Compressed and encrypted data have higher entropy than e.g. code or text data. In fact, compressed and encrypted data have close to the maximum possible level of entropy, which can be used as a heuristic to identify it as such in order to differentiate it from non-compressed/non-encrypted data.
Example use cases in reverse engineering:
Malware Analysis - If we have an executable which has a header that can be parsed successfully and the program loads and runs without error, but the overall entropy level of the file is very high and the code can't be analyzed statically because the data outside of the file header and program headers looks random (hence the high entropy), it probably means that the executable is in fact compressed on disk and is decompressed at runtime. Executable compression complicates analysis, so it is a relatively common feature of programs developed for criminal purposes. If we want to analyze the code, its decompressed form need to be recovered somehow.
Firmware Analysis - In systems with relatively severe hardware constraints, such as embedded systems, firmware updates are often delivered in compressed form in order to save space. In order to analyse the firmware, it first needs to be determined whether it is encrypted or compressed. One way to determine this is through performing an entropy analysis of the file. If the entropy is very high, it is a good sign that the file is indeed compressed or encrypted. To proceed with analysis of the actual firmware, it must first be decompressed/decrypted. If we have a block of data with very high entropy (i.e. close to random), it makes no sense to try to treat it as code and disassemble it, because the results will be meaningless nonsense.
File Type Identification - Some file types can be identified on the basis of their overall entropy. For example, we can usually differentiate between image files (png, jpeg, etc) and compiled binaries (ELF, PE) because image files consist of compressed data and therefore (generally) have much higher entropy than compiled binaries.
Besides "Detect It Easy", tools such as binwalk
, ent
and binvis.io can assist with calculating file entropy. You can also build your own tools that do this.
add a comment |
Entropy is interpreted as the Degree of Disorder or Randomness
a high entropy means a highly disordered set of data
a low entropy means an ordered set of data
to address the comments
order here does not mean 'a' following 'a' kind of order it is to be interpreted as random / non random state of certain data
aaaabbbbccccdddd or "abcdabcdabcdabcd" or "adbcadbcadbcadbc" is a repetitive string whose entropy will be greater than
aaaaaaaabbbbcccd or any shuffled representation of this string
in the first string and its shuffled clones all have 4 chars with equal probability 4/16 or 1/4 or 25%
but in the second string char 'a' (8/16 ) or half of the data set has the highest probability
while 'c' (1/16) has the least or a very minuscule probability
entropy is a thermodynamic concept that was introduced to digital science (information theory)
as a means to calculate how random a set of data is
simply put the highest compressed data will have the highest entropy
where all the 255 possible bytes will have equal frequencies
ie if 0x00 was seen 10 times in a blob
0x10 or 0x80 or 0xff will all be seen 10 times in the same blob
that is the blob will be a repeated sequence comprising of all bytes between of 0x0..0xff
while a low entropy blob will have a repeated sequence comprising only of a certain byte like 0x00 0r 0x55 or two bytes 0x0d0a ox222e etc or any series one less than 255 possible byte sequences
taking an algo from here and modifying it a little
import math
from collections import Counter
base =
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.,
'somrand' : 256.
def eta(data, unit):
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
hes = "abcdex80x90xffxfexde"
les = "aaaaax61x61x61x61x61"
print ("=======================================================================================================")
print (" type ent for hes hes ent for les les")
print ("=======================================================================================================")
for i in base:
for j in range(1,4,1):
print (i ,' ', eta( j*hes,i) , 't', (hes*j + (30 -j *10) *" " ) , ' ' , eta (j*les , i) ,'t', ("%s" % les*j ))
you can see 'abcdex80.....' is high entropy while 'aaaaax61...' is low entropy
:>python foo.py
=======================================================================================================
type ent for hes hes ent for les les
=======================================================================================================
shannon 3.321928094887362 abcdeÿþÞ 0.0 aaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞ 0.0 aaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞ 0.0 aaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞ 0.0 aaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3
"a high entropy means a highly disordered set of data a low entropy means an ordered set of data" <- This is a false statement. Order is not relevant, because entropy is calculated over a distribution where each value in that distribution has a probability associated with it. Compressed and encrypted data have high entropy because the probability associated with each byte value in the distribution is roughly equal (the distribution of byte values in the data is close to uniform), not because of the order the byte values appear in the bytestream.
– julian♦
Jun 27 at 13:09
"where all the 255 possible bytes will have equal frequencies" <- You probably meant "where all byte values between 0 and 255 (256 total) have an equal probability in the overall distribution" (the frequency of each byte value between 0-255 is the same in the distribution).
– julian♦
Jun 27 at 13:12
1
@julian what do you mean by order ? like a follows a b => b kind of order ? order in my answer does not mean a sorted / sequential / non sequential data i meant it as in an orderly / non random state the most random repetitive data where the count of each value tends to be equal has the highest entropy it may be a (military type ordered / sorted set like aaaabbbbccccdddd 4[a,b,c,d] but this will tend to have an entropy greater than aaaaaaaabbbbbccd 8[a],4[b],2[c],1[d] here is a theory link using hard technical words en.wikipedia.org/wiki/Entropy_(information_theory)
– blabb
Jun 27 at 16:47
You said “ordered set”, so maybe I misunderstood your meaning. Anyway, your main point about the relationship between entropy and randomness is correct.
– julian♦
Jun 27 at 16:54
@julian: What are you saying ??? I do think that blabb is correct. You seems to be stuck on only one definition of entropy, but they are two... which are both perfectly valid. So, stop yelling at everyone please.
– perror
Jun 28 at 7:50
add a comment |
Just to add (small) piece of information to @blabb and @Johann Aydinbas answers, here is the cite from Practical Malware Analysis book regarding your question:
Packed executables can also be detected via a technique known as entropy
calculation. Entropy is a measure of the disorder in a system or program [...]
Compressed or encrypted data more closely resembles random data,
and therefore has high entropy; executables that are not encrypted or compressed have lower entropy. Automated tools for detecting packed programs often use heuristics like entropy.
You can find additional information here, under Increased entropy header.
add a comment |
Shannon's entropy comes from information theory. It is the measure of degree of randomness of text. If a string has greater Shannon's entropy it means it's a strong password. Principally, Shannon entropy equation provides a way to predict the average minimum number of bits required to encode a string of symbols, based on the frequency of the symbols.
Note that the base represents the number of possible characters. Base 2 can be replaced by any base. As can be seen in this code where it's replaced by 255.
This link has a simplest implementation of the algorithm for calculating entropy of novels and religious books. It tells us a lot. For example, that all the human generated books have nearly identical degree of fluctuation between disorder. It is a good feature of data.
This is the link to code mentioned above.
Information Entropy of different Books
add a comment |
First, you have to know that the term entropy is used to refer to two different concepts which are somehow related if you think twice, but as it is really not obvious at first sight, you should prefer to consider these two as different concepts.
Defining Entropy ?
The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system.
On the other hand, the other entropy is coming from information theory and can be seen as a measure of the amount of information that can be stored in a system.
Why is it useful in RE ?
An entropy graph (to evaluate the amount of disorder) can be useful to detect the parts of the file that get close to random data. It will allow to detect the parts that have been encrypted/compressed and the parts that appear to be left untouched.
Indeed, a high disorder in data is exactly what you want to achieve when encrypting data. And, I told you that the two entropy definitions were related, if you store a lot of information in a minimum of bytes, it appears to be with a high level of disorder, so is compression...
That is why we use entropy graphs of files, be able to distinguish raw parts from encrypted/compressed sub-parts without any prior information of the file format.
An Example
For example, here is an entropy graph from the tool binwalk
coming from another question from here:
Directly from this graph we can see that there is a first part that appear to be raw (probably asm opcodes if we look at the shape of the curve), then a part which is much likely encrypted (compression does not reach an entropy of 1 with such regularity usually), and finally padding with always the same byte (e.g. 0x00
or 0xff
).
2
"The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system." <- This is a false statement. Software is not a thermodynamic system and does not possesses the property "energy" (heat). Software consists of encoded information, therefore to measure its properties - such as its Shannon entropy - the tools provided by information theory are appropriate.
– julian♦
Jun 27 at 12:02
2
If you don't believe me, please examine the code that was used to generate the entropy plot in your post. Binwalk calculates information entropy level in terms of either zlib compression ratio or Shannon entropy.
– julian♦
Jun 27 at 12:05
1
"Your definition of entropy is perfectly valid if you are considering information theory. But, unfortunately, the entropy referred here is much likely the one coming from thermodynamics (i.e. the degree of disorder)." <- The meaning seems quite clear.
– julian♦
Jun 27 at 13:14
1
it’s nothing personal. It’s just not correct.
– julian♦
Jun 27 at 16:08
1
It looks personal, just by the way you are harassing me with that.
– perror
Jun 27 at 16:09
|
show 3 more comments
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "489"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2freverseengineering.stackexchange.com%2fquestions%2f21555%2fwhat-is-an-entropy-graph%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
it has a feature called Entropy. I want to know what it is used for?
For our purposes, entropy can be though of as information density or as a measure of randomness in information, which is what makes it useful in the context of reverse engineering and binary analysis.
Compressed and encrypted data have higher entropy than e.g. code or text data. In fact, compressed and encrypted data have close to the maximum possible level of entropy, which can be used as a heuristic to identify it as such in order to differentiate it from non-compressed/non-encrypted data.
Example use cases in reverse engineering:
Malware Analysis - If we have an executable which has a header that can be parsed successfully and the program loads and runs without error, but the overall entropy level of the file is very high and the code can't be analyzed statically because the data outside of the file header and program headers looks random (hence the high entropy), it probably means that the executable is in fact compressed on disk and is decompressed at runtime. Executable compression complicates analysis, so it is a relatively common feature of programs developed for criminal purposes. If we want to analyze the code, its decompressed form need to be recovered somehow.
Firmware Analysis - In systems with relatively severe hardware constraints, such as embedded systems, firmware updates are often delivered in compressed form in order to save space. In order to analyse the firmware, it first needs to be determined whether it is encrypted or compressed. One way to determine this is through performing an entropy analysis of the file. If the entropy is very high, it is a good sign that the file is indeed compressed or encrypted. To proceed with analysis of the actual firmware, it must first be decompressed/decrypted. If we have a block of data with very high entropy (i.e. close to random), it makes no sense to try to treat it as code and disassemble it, because the results will be meaningless nonsense.
File Type Identification - Some file types can be identified on the basis of their overall entropy. For example, we can usually differentiate between image files (png, jpeg, etc) and compiled binaries (ELF, PE) because image files consist of compressed data and therefore (generally) have much higher entropy than compiled binaries.
Besides "Detect It Easy", tools such as binwalk
, ent
and binvis.io can assist with calculating file entropy. You can also build your own tools that do this.
add a comment |
it has a feature called Entropy. I want to know what it is used for?
For our purposes, entropy can be though of as information density or as a measure of randomness in information, which is what makes it useful in the context of reverse engineering and binary analysis.
Compressed and encrypted data have higher entropy than e.g. code or text data. In fact, compressed and encrypted data have close to the maximum possible level of entropy, which can be used as a heuristic to identify it as such in order to differentiate it from non-compressed/non-encrypted data.
Example use cases in reverse engineering:
Malware Analysis - If we have an executable which has a header that can be parsed successfully and the program loads and runs without error, but the overall entropy level of the file is very high and the code can't be analyzed statically because the data outside of the file header and program headers looks random (hence the high entropy), it probably means that the executable is in fact compressed on disk and is decompressed at runtime. Executable compression complicates analysis, so it is a relatively common feature of programs developed for criminal purposes. If we want to analyze the code, its decompressed form need to be recovered somehow.
Firmware Analysis - In systems with relatively severe hardware constraints, such as embedded systems, firmware updates are often delivered in compressed form in order to save space. In order to analyse the firmware, it first needs to be determined whether it is encrypted or compressed. One way to determine this is through performing an entropy analysis of the file. If the entropy is very high, it is a good sign that the file is indeed compressed or encrypted. To proceed with analysis of the actual firmware, it must first be decompressed/decrypted. If we have a block of data with very high entropy (i.e. close to random), it makes no sense to try to treat it as code and disassemble it, because the results will be meaningless nonsense.
File Type Identification - Some file types can be identified on the basis of their overall entropy. For example, we can usually differentiate between image files (png, jpeg, etc) and compiled binaries (ELF, PE) because image files consist of compressed data and therefore (generally) have much higher entropy than compiled binaries.
Besides "Detect It Easy", tools such as binwalk
, ent
and binvis.io can assist with calculating file entropy. You can also build your own tools that do this.
add a comment |
it has a feature called Entropy. I want to know what it is used for?
For our purposes, entropy can be though of as information density or as a measure of randomness in information, which is what makes it useful in the context of reverse engineering and binary analysis.
Compressed and encrypted data have higher entropy than e.g. code or text data. In fact, compressed and encrypted data have close to the maximum possible level of entropy, which can be used as a heuristic to identify it as such in order to differentiate it from non-compressed/non-encrypted data.
Example use cases in reverse engineering:
Malware Analysis - If we have an executable which has a header that can be parsed successfully and the program loads and runs without error, but the overall entropy level of the file is very high and the code can't be analyzed statically because the data outside of the file header and program headers looks random (hence the high entropy), it probably means that the executable is in fact compressed on disk and is decompressed at runtime. Executable compression complicates analysis, so it is a relatively common feature of programs developed for criminal purposes. If we want to analyze the code, its decompressed form need to be recovered somehow.
Firmware Analysis - In systems with relatively severe hardware constraints, such as embedded systems, firmware updates are often delivered in compressed form in order to save space. In order to analyse the firmware, it first needs to be determined whether it is encrypted or compressed. One way to determine this is through performing an entropy analysis of the file. If the entropy is very high, it is a good sign that the file is indeed compressed or encrypted. To proceed with analysis of the actual firmware, it must first be decompressed/decrypted. If we have a block of data with very high entropy (i.e. close to random), it makes no sense to try to treat it as code and disassemble it, because the results will be meaningless nonsense.
File Type Identification - Some file types can be identified on the basis of their overall entropy. For example, we can usually differentiate between image files (png, jpeg, etc) and compiled binaries (ELF, PE) because image files consist of compressed data and therefore (generally) have much higher entropy than compiled binaries.
Besides "Detect It Easy", tools such as binwalk
, ent
and binvis.io can assist with calculating file entropy. You can also build your own tools that do this.
it has a feature called Entropy. I want to know what it is used for?
For our purposes, entropy can be though of as information density or as a measure of randomness in information, which is what makes it useful in the context of reverse engineering and binary analysis.
Compressed and encrypted data have higher entropy than e.g. code or text data. In fact, compressed and encrypted data have close to the maximum possible level of entropy, which can be used as a heuristic to identify it as such in order to differentiate it from non-compressed/non-encrypted data.
Example use cases in reverse engineering:
Malware Analysis - If we have an executable which has a header that can be parsed successfully and the program loads and runs without error, but the overall entropy level of the file is very high and the code can't be analyzed statically because the data outside of the file header and program headers looks random (hence the high entropy), it probably means that the executable is in fact compressed on disk and is decompressed at runtime. Executable compression complicates analysis, so it is a relatively common feature of programs developed for criminal purposes. If we want to analyze the code, its decompressed form need to be recovered somehow.
Firmware Analysis - In systems with relatively severe hardware constraints, such as embedded systems, firmware updates are often delivered in compressed form in order to save space. In order to analyse the firmware, it first needs to be determined whether it is encrypted or compressed. One way to determine this is through performing an entropy analysis of the file. If the entropy is very high, it is a good sign that the file is indeed compressed or encrypted. To proceed with analysis of the actual firmware, it must first be decompressed/decrypted. If we have a block of data with very high entropy (i.e. close to random), it makes no sense to try to treat it as code and disassemble it, because the results will be meaningless nonsense.
File Type Identification - Some file types can be identified on the basis of their overall entropy. For example, we can usually differentiate between image files (png, jpeg, etc) and compiled binaries (ELF, PE) because image files consist of compressed data and therefore (generally) have much higher entropy than compiled binaries.
Besides "Detect It Easy", tools such as binwalk
, ent
and binvis.io can assist with calculating file entropy. You can also build your own tools that do this.
answered Jun 27 at 14:10
julian♦julian
4,4092 gold badges11 silver badges42 bronze badges
4,4092 gold badges11 silver badges42 bronze badges
add a comment |
add a comment |
Entropy is interpreted as the Degree of Disorder or Randomness
a high entropy means a highly disordered set of data
a low entropy means an ordered set of data
to address the comments
order here does not mean 'a' following 'a' kind of order it is to be interpreted as random / non random state of certain data
aaaabbbbccccdddd or "abcdabcdabcdabcd" or "adbcadbcadbcadbc" is a repetitive string whose entropy will be greater than
aaaaaaaabbbbcccd or any shuffled representation of this string
in the first string and its shuffled clones all have 4 chars with equal probability 4/16 or 1/4 or 25%
but in the second string char 'a' (8/16 ) or half of the data set has the highest probability
while 'c' (1/16) has the least or a very minuscule probability
entropy is a thermodynamic concept that was introduced to digital science (information theory)
as a means to calculate how random a set of data is
simply put the highest compressed data will have the highest entropy
where all the 255 possible bytes will have equal frequencies
ie if 0x00 was seen 10 times in a blob
0x10 or 0x80 or 0xff will all be seen 10 times in the same blob
that is the blob will be a repeated sequence comprising of all bytes between of 0x0..0xff
while a low entropy blob will have a repeated sequence comprising only of a certain byte like 0x00 0r 0x55 or two bytes 0x0d0a ox222e etc or any series one less than 255 possible byte sequences
taking an algo from here and modifying it a little
import math
from collections import Counter
base =
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.,
'somrand' : 256.
def eta(data, unit):
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
hes = "abcdex80x90xffxfexde"
les = "aaaaax61x61x61x61x61"
print ("=======================================================================================================")
print (" type ent for hes hes ent for les les")
print ("=======================================================================================================")
for i in base:
for j in range(1,4,1):
print (i ,' ', eta( j*hes,i) , 't', (hes*j + (30 -j *10) *" " ) , ' ' , eta (j*les , i) ,'t', ("%s" % les*j ))
you can see 'abcdex80.....' is high entropy while 'aaaaax61...' is low entropy
:>python foo.py
=======================================================================================================
type ent for hes hes ent for les les
=======================================================================================================
shannon 3.321928094887362 abcdeÿþÞ 0.0 aaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞ 0.0 aaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞ 0.0 aaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞ 0.0 aaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3
"a high entropy means a highly disordered set of data a low entropy means an ordered set of data" <- This is a false statement. Order is not relevant, because entropy is calculated over a distribution where each value in that distribution has a probability associated with it. Compressed and encrypted data have high entropy because the probability associated with each byte value in the distribution is roughly equal (the distribution of byte values in the data is close to uniform), not because of the order the byte values appear in the bytestream.
– julian♦
Jun 27 at 13:09
"where all the 255 possible bytes will have equal frequencies" <- You probably meant "where all byte values between 0 and 255 (256 total) have an equal probability in the overall distribution" (the frequency of each byte value between 0-255 is the same in the distribution).
– julian♦
Jun 27 at 13:12
1
@julian what do you mean by order ? like a follows a b => b kind of order ? order in my answer does not mean a sorted / sequential / non sequential data i meant it as in an orderly / non random state the most random repetitive data where the count of each value tends to be equal has the highest entropy it may be a (military type ordered / sorted set like aaaabbbbccccdddd 4[a,b,c,d] but this will tend to have an entropy greater than aaaaaaaabbbbbccd 8[a],4[b],2[c],1[d] here is a theory link using hard technical words en.wikipedia.org/wiki/Entropy_(information_theory)
– blabb
Jun 27 at 16:47
You said “ordered set”, so maybe I misunderstood your meaning. Anyway, your main point about the relationship between entropy and randomness is correct.
– julian♦
Jun 27 at 16:54
@julian: What are you saying ??? I do think that blabb is correct. You seems to be stuck on only one definition of entropy, but they are two... which are both perfectly valid. So, stop yelling at everyone please.
– perror
Jun 28 at 7:50
add a comment |
Entropy is interpreted as the Degree of Disorder or Randomness
a high entropy means a highly disordered set of data
a low entropy means an ordered set of data
to address the comments
order here does not mean 'a' following 'a' kind of order it is to be interpreted as random / non random state of certain data
aaaabbbbccccdddd or "abcdabcdabcdabcd" or "adbcadbcadbcadbc" is a repetitive string whose entropy will be greater than
aaaaaaaabbbbcccd or any shuffled representation of this string
in the first string and its shuffled clones all have 4 chars with equal probability 4/16 or 1/4 or 25%
but in the second string char 'a' (8/16 ) or half of the data set has the highest probability
while 'c' (1/16) has the least or a very minuscule probability
entropy is a thermodynamic concept that was introduced to digital science (information theory)
as a means to calculate how random a set of data is
simply put the highest compressed data will have the highest entropy
where all the 255 possible bytes will have equal frequencies
ie if 0x00 was seen 10 times in a blob
0x10 or 0x80 or 0xff will all be seen 10 times in the same blob
that is the blob will be a repeated sequence comprising of all bytes between of 0x0..0xff
while a low entropy blob will have a repeated sequence comprising only of a certain byte like 0x00 0r 0x55 or two bytes 0x0d0a ox222e etc or any series one less than 255 possible byte sequences
taking an algo from here and modifying it a little
import math
from collections import Counter
base =
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.,
'somrand' : 256.
def eta(data, unit):
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
hes = "abcdex80x90xffxfexde"
les = "aaaaax61x61x61x61x61"
print ("=======================================================================================================")
print (" type ent for hes hes ent for les les")
print ("=======================================================================================================")
for i in base:
for j in range(1,4,1):
print (i ,' ', eta( j*hes,i) , 't', (hes*j + (30 -j *10) *" " ) , ' ' , eta (j*les , i) ,'t', ("%s" % les*j ))
you can see 'abcdex80.....' is high entropy while 'aaaaax61...' is low entropy
:>python foo.py
=======================================================================================================
type ent for hes hes ent for les les
=======================================================================================================
shannon 3.321928094887362 abcdeÿþÞ 0.0 aaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞ 0.0 aaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞ 0.0 aaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞ 0.0 aaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3
"a high entropy means a highly disordered set of data a low entropy means an ordered set of data" <- This is a false statement. Order is not relevant, because entropy is calculated over a distribution where each value in that distribution has a probability associated with it. Compressed and encrypted data have high entropy because the probability associated with each byte value in the distribution is roughly equal (the distribution of byte values in the data is close to uniform), not because of the order the byte values appear in the bytestream.
– julian♦
Jun 27 at 13:09
"where all the 255 possible bytes will have equal frequencies" <- You probably meant "where all byte values between 0 and 255 (256 total) have an equal probability in the overall distribution" (the frequency of each byte value between 0-255 is the same in the distribution).
– julian♦
Jun 27 at 13:12
1
@julian what do you mean by order ? like a follows a b => b kind of order ? order in my answer does not mean a sorted / sequential / non sequential data i meant it as in an orderly / non random state the most random repetitive data where the count of each value tends to be equal has the highest entropy it may be a (military type ordered / sorted set like aaaabbbbccccdddd 4[a,b,c,d] but this will tend to have an entropy greater than aaaaaaaabbbbbccd 8[a],4[b],2[c],1[d] here is a theory link using hard technical words en.wikipedia.org/wiki/Entropy_(information_theory)
– blabb
Jun 27 at 16:47
You said “ordered set”, so maybe I misunderstood your meaning. Anyway, your main point about the relationship between entropy and randomness is correct.
– julian♦
Jun 27 at 16:54
@julian: What are you saying ??? I do think that blabb is correct. You seems to be stuck on only one definition of entropy, but they are two... which are both perfectly valid. So, stop yelling at everyone please.
– perror
Jun 28 at 7:50
add a comment |
Entropy is interpreted as the Degree of Disorder or Randomness
a high entropy means a highly disordered set of data
a low entropy means an ordered set of data
to address the comments
order here does not mean 'a' following 'a' kind of order it is to be interpreted as random / non random state of certain data
aaaabbbbccccdddd or "abcdabcdabcdabcd" or "adbcadbcadbcadbc" is a repetitive string whose entropy will be greater than
aaaaaaaabbbbcccd or any shuffled representation of this string
in the first string and its shuffled clones all have 4 chars with equal probability 4/16 or 1/4 or 25%
but in the second string char 'a' (8/16 ) or half of the data set has the highest probability
while 'c' (1/16) has the least or a very minuscule probability
entropy is a thermodynamic concept that was introduced to digital science (information theory)
as a means to calculate how random a set of data is
simply put the highest compressed data will have the highest entropy
where all the 255 possible bytes will have equal frequencies
ie if 0x00 was seen 10 times in a blob
0x10 or 0x80 or 0xff will all be seen 10 times in the same blob
that is the blob will be a repeated sequence comprising of all bytes between of 0x0..0xff
while a low entropy blob will have a repeated sequence comprising only of a certain byte like 0x00 0r 0x55 or two bytes 0x0d0a ox222e etc or any series one less than 255 possible byte sequences
taking an algo from here and modifying it a little
import math
from collections import Counter
base =
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.,
'somrand' : 256.
def eta(data, unit):
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
hes = "abcdex80x90xffxfexde"
les = "aaaaax61x61x61x61x61"
print ("=======================================================================================================")
print (" type ent for hes hes ent for les les")
print ("=======================================================================================================")
for i in base:
for j in range(1,4,1):
print (i ,' ', eta( j*hes,i) , 't', (hes*j + (30 -j *10) *" " ) , ' ' , eta (j*les , i) ,'t', ("%s" % les*j ))
you can see 'abcdex80.....' is high entropy while 'aaaaax61...' is low entropy
:>python foo.py
=======================================================================================================
type ent for hes hes ent for les les
=======================================================================================================
shannon 3.321928094887362 abcdeÿþÞ 0.0 aaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞ 0.0 aaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞ 0.0 aaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞ 0.0 aaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Entropy is interpreted as the Degree of Disorder or Randomness
a high entropy means a highly disordered set of data
a low entropy means an ordered set of data
to address the comments
order here does not mean 'a' following 'a' kind of order it is to be interpreted as random / non random state of certain data
aaaabbbbccccdddd or "abcdabcdabcdabcd" or "adbcadbcadbcadbc" is a repetitive string whose entropy will be greater than
aaaaaaaabbbbcccd or any shuffled representation of this string
in the first string and its shuffled clones all have 4 chars with equal probability 4/16 or 1/4 or 25%
but in the second string char 'a' (8/16 ) or half of the data set has the highest probability
while 'c' (1/16) has the least or a very minuscule probability
entropy is a thermodynamic concept that was introduced to digital science (information theory)
as a means to calculate how random a set of data is
simply put the highest compressed data will have the highest entropy
where all the 255 possible bytes will have equal frequencies
ie if 0x00 was seen 10 times in a blob
0x10 or 0x80 or 0xff will all be seen 10 times in the same blob
that is the blob will be a repeated sequence comprising of all bytes between of 0x0..0xff
while a low entropy blob will have a repeated sequence comprising only of a certain byte like 0x00 0r 0x55 or two bytes 0x0d0a ox222e etc or any series one less than 255 possible byte sequences
taking an algo from here and modifying it a little
import math
from collections import Counter
base =
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.,
'somrand' : 256.
def eta(data, unit):
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
hes = "abcdex80x90xffxfexde"
les = "aaaaax61x61x61x61x61"
print ("=======================================================================================================")
print (" type ent for hes hes ent for les les")
print ("=======================================================================================================")
for i in base:
for j in range(1,4,1):
print (i ,' ', eta( j*hes,i) , 't', (hes*j + (30 -j *10) *" " ) , ' ' , eta (j*les , i) ,'t', ("%s" % les*j ))
you can see 'abcdex80.....' is high entropy while 'aaaaax61...' is low entropy
:>python foo.py
=======================================================================================================
type ent for hes hes ent for les les
=======================================================================================================
shannon 3.321928094887362 abcdeÿþÞ 0.0 aaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
shannon 3.321928094887362 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞ 0.0 aaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
natural 2.3025850929940455 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞ 0.0 aaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
hartley 0.9999999999999998 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞ 0.0 aaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaa
somrand 0.4152410118609203 abcdeÿþÞabcdeÿþÞabcdeÿþÞ 0.0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
edited Jun 27 at 17:59
answered Jun 27 at 7:19
blabbblabb
9,7731 gold badge7 silver badges24 bronze badges
9,7731 gold badge7 silver badges24 bronze badges
3
"a high entropy means a highly disordered set of data a low entropy means an ordered set of data" <- This is a false statement. Order is not relevant, because entropy is calculated over a distribution where each value in that distribution has a probability associated with it. Compressed and encrypted data have high entropy because the probability associated with each byte value in the distribution is roughly equal (the distribution of byte values in the data is close to uniform), not because of the order the byte values appear in the bytestream.
– julian♦
Jun 27 at 13:09
"where all the 255 possible bytes will have equal frequencies" <- You probably meant "where all byte values between 0 and 255 (256 total) have an equal probability in the overall distribution" (the frequency of each byte value between 0-255 is the same in the distribution).
– julian♦
Jun 27 at 13:12
1
@julian what do you mean by order ? like a follows a b => b kind of order ? order in my answer does not mean a sorted / sequential / non sequential data i meant it as in an orderly / non random state the most random repetitive data where the count of each value tends to be equal has the highest entropy it may be a (military type ordered / sorted set like aaaabbbbccccdddd 4[a,b,c,d] but this will tend to have an entropy greater than aaaaaaaabbbbbccd 8[a],4[b],2[c],1[d] here is a theory link using hard technical words en.wikipedia.org/wiki/Entropy_(information_theory)
– blabb
Jun 27 at 16:47
You said “ordered set”, so maybe I misunderstood your meaning. Anyway, your main point about the relationship between entropy and randomness is correct.
– julian♦
Jun 27 at 16:54
@julian: What are you saying ??? I do think that blabb is correct. You seems to be stuck on only one definition of entropy, but they are two... which are both perfectly valid. So, stop yelling at everyone please.
– perror
Jun 28 at 7:50
add a comment |
3
"a high entropy means a highly disordered set of data a low entropy means an ordered set of data" <- This is a false statement. Order is not relevant, because entropy is calculated over a distribution where each value in that distribution has a probability associated with it. Compressed and encrypted data have high entropy because the probability associated with each byte value in the distribution is roughly equal (the distribution of byte values in the data is close to uniform), not because of the order the byte values appear in the bytestream.
– julian♦
Jun 27 at 13:09
"where all the 255 possible bytes will have equal frequencies" <- You probably meant "where all byte values between 0 and 255 (256 total) have an equal probability in the overall distribution" (the frequency of each byte value between 0-255 is the same in the distribution).
– julian♦
Jun 27 at 13:12
1
@julian what do you mean by order ? like a follows a b => b kind of order ? order in my answer does not mean a sorted / sequential / non sequential data i meant it as in an orderly / non random state the most random repetitive data where the count of each value tends to be equal has the highest entropy it may be a (military type ordered / sorted set like aaaabbbbccccdddd 4[a,b,c,d] but this will tend to have an entropy greater than aaaaaaaabbbbbccd 8[a],4[b],2[c],1[d] here is a theory link using hard technical words en.wikipedia.org/wiki/Entropy_(information_theory)
– blabb
Jun 27 at 16:47
You said “ordered set”, so maybe I misunderstood your meaning. Anyway, your main point about the relationship between entropy and randomness is correct.
– julian♦
Jun 27 at 16:54
@julian: What are you saying ??? I do think that blabb is correct. You seems to be stuck on only one definition of entropy, but they are two... which are both perfectly valid. So, stop yelling at everyone please.
– perror
Jun 28 at 7:50
3
3
"a high entropy means a highly disordered set of data a low entropy means an ordered set of data" <- This is a false statement. Order is not relevant, because entropy is calculated over a distribution where each value in that distribution has a probability associated with it. Compressed and encrypted data have high entropy because the probability associated with each byte value in the distribution is roughly equal (the distribution of byte values in the data is close to uniform), not because of the order the byte values appear in the bytestream.
– julian♦
Jun 27 at 13:09
"a high entropy means a highly disordered set of data a low entropy means an ordered set of data" <- This is a false statement. Order is not relevant, because entropy is calculated over a distribution where each value in that distribution has a probability associated with it. Compressed and encrypted data have high entropy because the probability associated with each byte value in the distribution is roughly equal (the distribution of byte values in the data is close to uniform), not because of the order the byte values appear in the bytestream.
– julian♦
Jun 27 at 13:09
"where all the 255 possible bytes will have equal frequencies" <- You probably meant "where all byte values between 0 and 255 (256 total) have an equal probability in the overall distribution" (the frequency of each byte value between 0-255 is the same in the distribution).
– julian♦
Jun 27 at 13:12
"where all the 255 possible bytes will have equal frequencies" <- You probably meant "where all byte values between 0 and 255 (256 total) have an equal probability in the overall distribution" (the frequency of each byte value between 0-255 is the same in the distribution).
– julian♦
Jun 27 at 13:12
1
1
@julian what do you mean by order ? like a follows a b => b kind of order ? order in my answer does not mean a sorted / sequential / non sequential data i meant it as in an orderly / non random state the most random repetitive data where the count of each value tends to be equal has the highest entropy it may be a (military type ordered / sorted set like aaaabbbbccccdddd 4[a,b,c,d] but this will tend to have an entropy greater than aaaaaaaabbbbbccd 8[a],4[b],2[c],1[d] here is a theory link using hard technical words en.wikipedia.org/wiki/Entropy_(information_theory)
– blabb
Jun 27 at 16:47
@julian what do you mean by order ? like a follows a b => b kind of order ? order in my answer does not mean a sorted / sequential / non sequential data i meant it as in an orderly / non random state the most random repetitive data where the count of each value tends to be equal has the highest entropy it may be a (military type ordered / sorted set like aaaabbbbccccdddd 4[a,b,c,d] but this will tend to have an entropy greater than aaaaaaaabbbbbccd 8[a],4[b],2[c],1[d] here is a theory link using hard technical words en.wikipedia.org/wiki/Entropy_(information_theory)
– blabb
Jun 27 at 16:47
You said “ordered set”, so maybe I misunderstood your meaning. Anyway, your main point about the relationship between entropy and randomness is correct.
– julian♦
Jun 27 at 16:54
You said “ordered set”, so maybe I misunderstood your meaning. Anyway, your main point about the relationship between entropy and randomness is correct.
– julian♦
Jun 27 at 16:54
@julian: What are you saying ??? I do think that blabb is correct. You seems to be stuck on only one definition of entropy, but they are two... which are both perfectly valid. So, stop yelling at everyone please.
– perror
Jun 28 at 7:50
@julian: What are you saying ??? I do think that blabb is correct. You seems to be stuck on only one definition of entropy, but they are two... which are both perfectly valid. So, stop yelling at everyone please.
– perror
Jun 28 at 7:50
add a comment |
Just to add (small) piece of information to @blabb and @Johann Aydinbas answers, here is the cite from Practical Malware Analysis book regarding your question:
Packed executables can also be detected via a technique known as entropy
calculation. Entropy is a measure of the disorder in a system or program [...]
Compressed or encrypted data more closely resembles random data,
and therefore has high entropy; executables that are not encrypted or compressed have lower entropy. Automated tools for detecting packed programs often use heuristics like entropy.
You can find additional information here, under Increased entropy header.
add a comment |
Just to add (small) piece of information to @blabb and @Johann Aydinbas answers, here is the cite from Practical Malware Analysis book regarding your question:
Packed executables can also be detected via a technique known as entropy
calculation. Entropy is a measure of the disorder in a system or program [...]
Compressed or encrypted data more closely resembles random data,
and therefore has high entropy; executables that are not encrypted or compressed have lower entropy. Automated tools for detecting packed programs often use heuristics like entropy.
You can find additional information here, under Increased entropy header.
add a comment |
Just to add (small) piece of information to @blabb and @Johann Aydinbas answers, here is the cite from Practical Malware Analysis book regarding your question:
Packed executables can also be detected via a technique known as entropy
calculation. Entropy is a measure of the disorder in a system or program [...]
Compressed or encrypted data more closely resembles random data,
and therefore has high entropy; executables that are not encrypted or compressed have lower entropy. Automated tools for detecting packed programs often use heuristics like entropy.
You can find additional information here, under Increased entropy header.
Just to add (small) piece of information to @blabb and @Johann Aydinbas answers, here is the cite from Practical Malware Analysis book regarding your question:
Packed executables can also be detected via a technique known as entropy
calculation. Entropy is a measure of the disorder in a system or program [...]
Compressed or encrypted data more closely resembles random data,
and therefore has high entropy; executables that are not encrypted or compressed have lower entropy. Automated tools for detecting packed programs often use heuristics like entropy.
You can find additional information here, under Increased entropy header.
answered Jun 27 at 8:30
bart1ebart1e
8621 gold badge1 silver badge12 bronze badges
8621 gold badge1 silver badge12 bronze badges
add a comment |
add a comment |
Shannon's entropy comes from information theory. It is the measure of degree of randomness of text. If a string has greater Shannon's entropy it means it's a strong password. Principally, Shannon entropy equation provides a way to predict the average minimum number of bits required to encode a string of symbols, based on the frequency of the symbols.
Note that the base represents the number of possible characters. Base 2 can be replaced by any base. As can be seen in this code where it's replaced by 255.
This link has a simplest implementation of the algorithm for calculating entropy of novels and religious books. It tells us a lot. For example, that all the human generated books have nearly identical degree of fluctuation between disorder. It is a good feature of data.
This is the link to code mentioned above.
Information Entropy of different Books
add a comment |
Shannon's entropy comes from information theory. It is the measure of degree of randomness of text. If a string has greater Shannon's entropy it means it's a strong password. Principally, Shannon entropy equation provides a way to predict the average minimum number of bits required to encode a string of symbols, based on the frequency of the symbols.
Note that the base represents the number of possible characters. Base 2 can be replaced by any base. As can be seen in this code where it's replaced by 255.
This link has a simplest implementation of the algorithm for calculating entropy of novels and religious books. It tells us a lot. For example, that all the human generated books have nearly identical degree of fluctuation between disorder. It is a good feature of data.
This is the link to code mentioned above.
Information Entropy of different Books
add a comment |
Shannon's entropy comes from information theory. It is the measure of degree of randomness of text. If a string has greater Shannon's entropy it means it's a strong password. Principally, Shannon entropy equation provides a way to predict the average minimum number of bits required to encode a string of symbols, based on the frequency of the symbols.
Note that the base represents the number of possible characters. Base 2 can be replaced by any base. As can be seen in this code where it's replaced by 255.
This link has a simplest implementation of the algorithm for calculating entropy of novels and religious books. It tells us a lot. For example, that all the human generated books have nearly identical degree of fluctuation between disorder. It is a good feature of data.
This is the link to code mentioned above.
Information Entropy of different Books
Shannon's entropy comes from information theory. It is the measure of degree of randomness of text. If a string has greater Shannon's entropy it means it's a strong password. Principally, Shannon entropy equation provides a way to predict the average minimum number of bits required to encode a string of symbols, based on the frequency of the symbols.
Note that the base represents the number of possible characters. Base 2 can be replaced by any base. As can be seen in this code where it's replaced by 255.
This link has a simplest implementation of the algorithm for calculating entropy of novels and religious books. It tells us a lot. For example, that all the human generated books have nearly identical degree of fluctuation between disorder. It is a good feature of data.
This is the link to code mentioned above.
Information Entropy of different Books
answered Jun 27 at 14:09
Random Science StuffRandom Science Stuff
311 bronze badge
311 bronze badge
add a comment |
add a comment |
First, you have to know that the term entropy is used to refer to two different concepts which are somehow related if you think twice, but as it is really not obvious at first sight, you should prefer to consider these two as different concepts.
Defining Entropy ?
The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system.
On the other hand, the other entropy is coming from information theory and can be seen as a measure of the amount of information that can be stored in a system.
Why is it useful in RE ?
An entropy graph (to evaluate the amount of disorder) can be useful to detect the parts of the file that get close to random data. It will allow to detect the parts that have been encrypted/compressed and the parts that appear to be left untouched.
Indeed, a high disorder in data is exactly what you want to achieve when encrypting data. And, I told you that the two entropy definitions were related, if you store a lot of information in a minimum of bytes, it appears to be with a high level of disorder, so is compression...
That is why we use entropy graphs of files, be able to distinguish raw parts from encrypted/compressed sub-parts without any prior information of the file format.
An Example
For example, here is an entropy graph from the tool binwalk
coming from another question from here:
Directly from this graph we can see that there is a first part that appear to be raw (probably asm opcodes if we look at the shape of the curve), then a part which is much likely encrypted (compression does not reach an entropy of 1 with such regularity usually), and finally padding with always the same byte (e.g. 0x00
or 0xff
).
2
"The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system." <- This is a false statement. Software is not a thermodynamic system and does not possesses the property "energy" (heat). Software consists of encoded information, therefore to measure its properties - such as its Shannon entropy - the tools provided by information theory are appropriate.
– julian♦
Jun 27 at 12:02
2
If you don't believe me, please examine the code that was used to generate the entropy plot in your post. Binwalk calculates information entropy level in terms of either zlib compression ratio or Shannon entropy.
– julian♦
Jun 27 at 12:05
1
"Your definition of entropy is perfectly valid if you are considering information theory. But, unfortunately, the entropy referred here is much likely the one coming from thermodynamics (i.e. the degree of disorder)." <- The meaning seems quite clear.
– julian♦
Jun 27 at 13:14
1
it’s nothing personal. It’s just not correct.
– julian♦
Jun 27 at 16:08
1
It looks personal, just by the way you are harassing me with that.
– perror
Jun 27 at 16:09
|
show 3 more comments
First, you have to know that the term entropy is used to refer to two different concepts which are somehow related if you think twice, but as it is really not obvious at first sight, you should prefer to consider these two as different concepts.
Defining Entropy ?
The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system.
On the other hand, the other entropy is coming from information theory and can be seen as a measure of the amount of information that can be stored in a system.
Why is it useful in RE ?
An entropy graph (to evaluate the amount of disorder) can be useful to detect the parts of the file that get close to random data. It will allow to detect the parts that have been encrypted/compressed and the parts that appear to be left untouched.
Indeed, a high disorder in data is exactly what you want to achieve when encrypting data. And, I told you that the two entropy definitions were related, if you store a lot of information in a minimum of bytes, it appears to be with a high level of disorder, so is compression...
That is why we use entropy graphs of files, be able to distinguish raw parts from encrypted/compressed sub-parts without any prior information of the file format.
An Example
For example, here is an entropy graph from the tool binwalk
coming from another question from here:
Directly from this graph we can see that there is a first part that appear to be raw (probably asm opcodes if we look at the shape of the curve), then a part which is much likely encrypted (compression does not reach an entropy of 1 with such regularity usually), and finally padding with always the same byte (e.g. 0x00
or 0xff
).
2
"The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system." <- This is a false statement. Software is not a thermodynamic system and does not possesses the property "energy" (heat). Software consists of encoded information, therefore to measure its properties - such as its Shannon entropy - the tools provided by information theory are appropriate.
– julian♦
Jun 27 at 12:02
2
If you don't believe me, please examine the code that was used to generate the entropy plot in your post. Binwalk calculates information entropy level in terms of either zlib compression ratio or Shannon entropy.
– julian♦
Jun 27 at 12:05
1
"Your definition of entropy is perfectly valid if you are considering information theory. But, unfortunately, the entropy referred here is much likely the one coming from thermodynamics (i.e. the degree of disorder)." <- The meaning seems quite clear.
– julian♦
Jun 27 at 13:14
1
it’s nothing personal. It’s just not correct.
– julian♦
Jun 27 at 16:08
1
It looks personal, just by the way you are harassing me with that.
– perror
Jun 27 at 16:09
|
show 3 more comments
First, you have to know that the term entropy is used to refer to two different concepts which are somehow related if you think twice, but as it is really not obvious at first sight, you should prefer to consider these two as different concepts.
Defining Entropy ?
The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system.
On the other hand, the other entropy is coming from information theory and can be seen as a measure of the amount of information that can be stored in a system.
Why is it useful in RE ?
An entropy graph (to evaluate the amount of disorder) can be useful to detect the parts of the file that get close to random data. It will allow to detect the parts that have been encrypted/compressed and the parts that appear to be left untouched.
Indeed, a high disorder in data is exactly what you want to achieve when encrypting data. And, I told you that the two entropy definitions were related, if you store a lot of information in a minimum of bytes, it appears to be with a high level of disorder, so is compression...
That is why we use entropy graphs of files, be able to distinguish raw parts from encrypted/compressed sub-parts without any prior information of the file format.
An Example
For example, here is an entropy graph from the tool binwalk
coming from another question from here:
Directly from this graph we can see that there is a first part that appear to be raw (probably asm opcodes if we look at the shape of the curve), then a part which is much likely encrypted (compression does not reach an entropy of 1 with such regularity usually), and finally padding with always the same byte (e.g. 0x00
or 0xff
).
First, you have to know that the term entropy is used to refer to two different concepts which are somehow related if you think twice, but as it is really not obvious at first sight, you should prefer to consider these two as different concepts.
Defining Entropy ?
The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system.
On the other hand, the other entropy is coming from information theory and can be seen as a measure of the amount of information that can be stored in a system.
Why is it useful in RE ?
An entropy graph (to evaluate the amount of disorder) can be useful to detect the parts of the file that get close to random data. It will allow to detect the parts that have been encrypted/compressed and the parts that appear to be left untouched.
Indeed, a high disorder in data is exactly what you want to achieve when encrypting data. And, I told you that the two entropy definitions were related, if you store a lot of information in a minimum of bytes, it appears to be with a high level of disorder, so is compression...
That is why we use entropy graphs of files, be able to distinguish raw parts from encrypted/compressed sub-parts without any prior information of the file format.
An Example
For example, here is an entropy graph from the tool binwalk
coming from another question from here:
Directly from this graph we can see that there is a first part that appear to be raw (probably asm opcodes if we look at the shape of the curve), then a part which is much likely encrypted (compression does not reach an entropy of 1 with such regularity usually), and finally padding with always the same byte (e.g. 0x00
or 0xff
).
edited Jun 28 at 12:16
answered Jun 27 at 8:55
perrorperror
11.7k18 gold badges71 silver badges131 bronze badges
11.7k18 gold badges71 silver badges131 bronze badges
2
"The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system." <- This is a false statement. Software is not a thermodynamic system and does not possesses the property "energy" (heat). Software consists of encoded information, therefore to measure its properties - such as its Shannon entropy - the tools provided by information theory are appropriate.
– julian♦
Jun 27 at 12:02
2
If you don't believe me, please examine the code that was used to generate the entropy plot in your post. Binwalk calculates information entropy level in terms of either zlib compression ratio or Shannon entropy.
– julian♦
Jun 27 at 12:05
1
"Your definition of entropy is perfectly valid if you are considering information theory. But, unfortunately, the entropy referred here is much likely the one coming from thermodynamics (i.e. the degree of disorder)." <- The meaning seems quite clear.
– julian♦
Jun 27 at 13:14
1
it’s nothing personal. It’s just not correct.
– julian♦
Jun 27 at 16:08
1
It looks personal, just by the way you are harassing me with that.
– perror
Jun 27 at 16:09
|
show 3 more comments
2
"The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system." <- This is a false statement. Software is not a thermodynamic system and does not possesses the property "energy" (heat). Software consists of encoded information, therefore to measure its properties - such as its Shannon entropy - the tools provided by information theory are appropriate.
– julian♦
Jun 27 at 12:02
2
If you don't believe me, please examine the code that was used to generate the entropy plot in your post. Binwalk calculates information entropy level in terms of either zlib compression ratio or Shannon entropy.
– julian♦
Jun 27 at 12:05
1
"Your definition of entropy is perfectly valid if you are considering information theory. But, unfortunately, the entropy referred here is much likely the one coming from thermodynamics (i.e. the degree of disorder)." <- The meaning seems quite clear.
– julian♦
Jun 27 at 13:14
1
it’s nothing personal. It’s just not correct.
– julian♦
Jun 27 at 16:08
1
It looks personal, just by the way you are harassing me with that.
– perror
Jun 27 at 16:09
2
2
"The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system." <- This is a false statement. Software is not a thermodynamic system and does not possesses the property "energy" (heat). Software consists of encoded information, therefore to measure its properties - such as its Shannon entropy - the tools provided by information theory are appropriate.
– julian♦
Jun 27 at 12:02
"The entropy that you want to know about can be defined as the amount of order, disorder, or chaos in a thermodynamic system." <- This is a false statement. Software is not a thermodynamic system and does not possesses the property "energy" (heat). Software consists of encoded information, therefore to measure its properties - such as its Shannon entropy - the tools provided by information theory are appropriate.
– julian♦
Jun 27 at 12:02
2
2
If you don't believe me, please examine the code that was used to generate the entropy plot in your post. Binwalk calculates information entropy level in terms of either zlib compression ratio or Shannon entropy.
– julian♦
Jun 27 at 12:05
If you don't believe me, please examine the code that was used to generate the entropy plot in your post. Binwalk calculates information entropy level in terms of either zlib compression ratio or Shannon entropy.
– julian♦
Jun 27 at 12:05
1
1
"Your definition of entropy is perfectly valid if you are considering information theory. But, unfortunately, the entropy referred here is much likely the one coming from thermodynamics (i.e. the degree of disorder)." <- The meaning seems quite clear.
– julian♦
Jun 27 at 13:14
"Your definition of entropy is perfectly valid if you are considering information theory. But, unfortunately, the entropy referred here is much likely the one coming from thermodynamics (i.e. the degree of disorder)." <- The meaning seems quite clear.
– julian♦
Jun 27 at 13:14
1
1
it’s nothing personal. It’s just not correct.
– julian♦
Jun 27 at 16:08
it’s nothing personal. It’s just not correct.
– julian♦
Jun 27 at 16:08
1
1
It looks personal, just by the way you are harassing me with that.
– perror
Jun 27 at 16:09
It looks personal, just by the way you are harassing me with that.
– perror
Jun 27 at 16:09
|
show 3 more comments
Thanks for contributing an answer to Reverse Engineering Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2freverseengineering.stackexchange.com%2fquestions%2f21555%2fwhat-is-an-entropy-graph%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown