File / Data / Script management as a PhDHow can I encourage my advisor to adopt better work practices?Tools for data organising and processingGood data entry and management systemsHow to make data management plans machine readable?How to integrate partial version control, data exchange and research assistants?Data exchange between academic institutionsSorting Data in TablesHandling computer files in simulation-based research with multiple storage drivesShould I (student) share my data with a researcher I don't know directly?How to efficiently organize my PhD data, papers and notes?

Are the Night's Watch still required?

Trigonometry substitution issue with sign

Should homeowners insurance cover the cost of the home?

Is it normal for gliders not to have attitude indicators?

Which sphere is fastest?

Install LibreOffice-Writer Only not LibreOffice whole package

Can there be a single technologically advanced nation, in a continent full of non-technologically advanced nations?

What is a common way to tell if an academic is "above average," or outstanding in their field? Is their h-index (Hirsh index) one of them?

Is there an age requirement to play in Adventurers League?

Should I simplify my writing in a foreign country?

Why is "breaking the mould" positively connoted?

Will 700 more planes a day fly because of the Heathrow expansion?

Hostile Divisor Numbers

Has the United States ever had a non-Christian President?

Where are the "shires" in the UK?

As a GM, is it bad form to ask for a moment to think when improvising?

Handling Null values (and equivalents) routinely in Python

Checking if two expressions are related

How do I calculate how many of an item I'll have in this inventory system?

What was Bran's plan to kill the Night King?

Why is my arithmetic with a long long int behaving this way?

What do I do if my advisor made a mistake?

Would a small hole in a Faraday cage drastically reduce its effectiveness at blocking interference?

How do I, as a DM, handle a party that decides to set up an ambush in a dungeon?

File / Data / Script management as a PhD

How can I encourage my advisor to adopt better work practices?Tools for data organising and processingGood data entry and management systemsHow to make data management plans machine readable?How to integrate partial version control, data exchange and research assistants?Data exchange between academic institutionsSorting Data in TablesHandling computer files in simulation-based research with multiple storage drivesShould I (student) share my data with a researcher I don't know directly?How to efficiently organize my PhD data, papers and notes?

I started my PhD 7 months ago, and as I generate more and more data, and do more and more analysis, my folder structure is getting out of hand. I wanted to ask for best practices and opinions on how to organize my files in order to not lose track of everything and quickly find what I need. I am a geoscientist and have lots of analytical data as well as programming scripts. Along that I also have README.md for some analysis and snippets of ms word or plain text, if I wwrite something down to remember for a paper possibly.

I've made it a habit not to edit raw data at all and to backup regularly for obvious reasons.

Right now my simplified general structure looks a bit like this:

├── data
│   ├── analysis
│   │   ├── isotope_temperature_reconstruction
│   │   │   ├── report.ipynb
│   │   │   └── script523.py
│   │   └── light_micro_growth-line-analysis
│   │   ├── img1.svg
│   │   └── regression.py
│   └── raw
│   ├── isotopes
│   │   ├── run2019_02_19
│   │   └── run2019_02_24
│   ├── light_microscope
│   │   ├── sample_xy123123
│   │   └── sample_xy123124
│   └── sem
│   ├── sample_xy123123
│   └── sample_xy123124
└── documents
 └── paper1

I know the system my files are organized in doesn't really matter as long as it is consistent. However I am facing some struggles:

The data usually varies in "quality" and "ripeness". I have data that resulted from:
- Some trivial test -> won't be used ever again
- Is for calibration -> Doesn't belong to any project, but matters in many cases
- Is directly and only needed for a certain publication

My problems with this are:

I find myself often linking and copypasting my data all over the place, because it is not where it's currently needed. As a consequence i also make edits only in certain places and not everywhere, and lose track of whats the most recent file. I also lose track of what files are handled programatically and where I edited something "per hand".

I have a lot of duplicate python/R/whatever scripts that I copypaste wherever needed. I think this is the easier part to resolve by modulating code and putting it into version controlled system wide libraries.

I sometimes have snippets of word or plain text that contain relevant research insights, but are scattered all over the place, because they are not directly related to a paper / data.

So I am looking for suggestions to address these problems, as well as suggestions for general file and data management, and general organization at a researchers main desktop machine.

The only problem that I feel adequately solved is my literature management because I just use zotero and let it organuize all my papers in a coherent folder structure. (It also makes it easy to search via tags, which would be super cool for data files)

asked Apr 30 at 12:57

cripcate

914

New contributor

I organize most of the smaller data on a per-paper basis. Larger things (gigabytes, terabytes) are typically somewhere else anyway.

– Oleg Lobachev
Apr 30 at 17:35

add a comment |

I've made it a habit not to edit raw data at all and to backup regularly for obvious reasons.

Right now my simplified general structure looks a bit like this:

├── data
│   ├── analysis
│   │   ├── isotope_temperature_reconstruction
│   │   │   ├── report.ipynb
│   │   │   └── script523.py
│   │   └── light_micro_growth-line-analysis
│   │   ├── img1.svg
│   │   └── regression.py
│   └── raw
│   ├── isotopes
│   │   ├── run2019_02_19
│   │   └── run2019_02_24
│   ├── light_microscope
│   │   ├── sample_xy123123
│   │   └── sample_xy123124
│   └── sem
│   ├── sample_xy123123
│   └── sample_xy123124
└── documents
 └── paper1

I know the system my files are organized in doesn't really matter as long as it is consistent. However I am facing some struggles:

The data usually varies in "quality" and "ripeness". I have data that resulted from:
- Some trivial test -> won't be used ever again
- Is for calibration -> Doesn't belong to any project, but matters in many cases
- Is directly and only needed for a certain publication

My problems with this are:

I find myself often linking and copypasting my data all over the place, because it is not where it's currently needed. As a consequence i also make edits only in certain places and not everywhere, and lose track of whats the most recent file. I also lose track of what files are handled programatically and where I edited something "per hand".

I have a lot of duplicate python/R/whatever scripts that I copypaste wherever needed. I think this is the easier part to resolve by modulating code and putting it into version controlled system wide libraries.

I sometimes have snippets of word or plain text that contain relevant research insights, but are scattered all over the place, because they are not directly related to a paper / data.

So I am looking for suggestions to address these problems, as well as suggestions for general file and data management, and general organization at a researchers main desktop machine.

asked Apr 30 at 12:57

cripcate

914

New contributor

I organize most of the smaller data on a per-paper basis. Larger things (gigabytes, terabytes) are typically somewhere else anyway.

– Oleg Lobachev
Apr 30 at 17:35

add a comment |

I've made it a habit not to edit raw data at all and to backup regularly for obvious reasons.

Right now my simplified general structure looks a bit like this:

├── data
│   ├── analysis
│   │   ├── isotope_temperature_reconstruction
│   │   │   ├── report.ipynb
│   │   │   └── script523.py
│   │   └── light_micro_growth-line-analysis
│   │   ├── img1.svg
│   │   └── regression.py
│   └── raw
│   ├── isotopes
│   │   ├── run2019_02_19
│   │   └── run2019_02_24
│   ├── light_microscope
│   │   ├── sample_xy123123
│   │   └── sample_xy123124
│   └── sem
│   ├── sample_xy123123
│   └── sample_xy123124
└── documents
 └── paper1

I know the system my files are organized in doesn't really matter as long as it is consistent. However I am facing some struggles:

The data usually varies in "quality" and "ripeness". I have data that resulted from:
- Some trivial test -> won't be used ever again
- Is for calibration -> Doesn't belong to any project, but matters in many cases
- Is directly and only needed for a certain publication

My problems with this are:

I find myself often linking and copypasting my data all over the place, because it is not where it's currently needed. As a consequence i also make edits only in certain places and not everywhere, and lose track of whats the most recent file. I also lose track of what files are handled programatically and where I edited something "per hand".

I have a lot of duplicate python/R/whatever scripts that I copypaste wherever needed. I think this is the easier part to resolve by modulating code and putting it into version controlled system wide libraries.

I sometimes have snippets of word or plain text that contain relevant research insights, but are scattered all over the place, because they are not directly related to a paper / data.

So I am looking for suggestions to address these problems, as well as suggestions for general file and data management, and general organization at a researchers main desktop machine.

asked Apr 30 at 12:57

cripcate

914

New contributor

I've made it a habit not to edit raw data at all and to backup regularly for obvious reasons.

Right now my simplified general structure looks a bit like this:

├── data
│   ├── analysis
│   │   ├── isotope_temperature_reconstruction
│   │   │   ├── report.ipynb
│   │   │   └── script523.py
│   │   └── light_micro_growth-line-analysis
│   │   ├── img1.svg
│   │   └── regression.py
│   └── raw
│   ├── isotopes
│   │   ├── run2019_02_19
│   │   └── run2019_02_24
│   ├── light_microscope
│   │   ├── sample_xy123123
│   │   └── sample_xy123124
│   └── sem
│   ├── sample_xy123123
│   └── sample_xy123124
└── documents
 └── paper1

I know the system my files are organized in doesn't really matter as long as it is consistent. However I am facing some struggles:

The data usually varies in "quality" and "ripeness". I have data that resulted from:
- Some trivial test -> won't be used ever again
- Is for calibration -> Doesn't belong to any project, but matters in many cases
- Is directly and only needed for a certain publication

My problems with this are:

I find myself often linking and copypasting my data all over the place, because it is not where it's currently needed. As a consequence i also make edits only in certain places and not everywhere, and lose track of whats the most recent file. I also lose track of what files are handled programatically and where I edited something "per hand".

I have a lot of duplicate python/R/whatever scripts that I copypaste wherever needed. I think this is the easier part to resolve by modulating code and putting it into version controlled system wide libraries.

I sometimes have snippets of word or plain text that contain relevant research insights, but are scattered all over the place, because they are not directly related to a paper / data.

So I am looking for suggestions to address these problems, as well as suggestions for general file and data management, and general organization at a researchers main desktop machine.

phd data

asked Apr 30 at 12:57

cripcate

914

New contributor

asked Apr 30 at 12:57

cripcate

914

New contributor

asked Apr 30 at 12:57

cripcate

914

New contributor

asked Apr 30 at 12:57

cripcate

914

asked Apr 30 at 12:57

cripcate

914

New contributor

cripcate is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

I organize most of the smaller data on a per-paper basis. Larger things (gigabytes, terabytes) are typically somewhere else anyway.

– Oleg Lobachev
Apr 30 at 17:35

add a comment |

I organize most of the smaller data on a per-paper basis. Larger things (gigabytes, terabytes) are typically somewhere else anyway.

– Oleg Lobachev
Apr 30 at 17:35

I organize most of the smaller data on a per-paper basis. Larger things (gigabytes, terabytes) are typically somewhere else anyway.

– Oleg Lobachev
Apr 30 at 17:35

add a comment |

2 Answers
2

active

oldest

votes

In general I would distinguish between two approaches or a combination of both before ending in a very time consuming categorization of all your files

top-down

This means you use intelligent software to find again files you are looking for and don't worry too much about where it is stored. For example reference manager (Mendeley, Zotero,...), a Desktop search engine (Copernic). In the desktop search engine or with windows search (indexing turned on) you should see which file was changed at the latest. Most of your ideas and sketches you save in Onenote or something else that can link to files and www links.

bottom-up

This means you come up your self with a distinct structure like you did with a rough categorization (raw data, devices, projects). But don't complexify this, when most of the files can be identified by the file type/ending, then it may be more convenient to order it by projects and keep all file types within such subfolders.

Additionally, one trick I use is tagging files in their filename like "#phdthesis" or "#interesting" or "#cite" or "#collaboration".

Sample files (data, images of samples) always get a date ProjectnameDayMonthYear independent of their file type.

Don't make your system too complex or too time consuming. For instance most of PDFs I read and download end in one folder. No point in categorizing them, this will not work/help over years and decades. Defining content-relative filenames takes too much time, most of them I find again over my desktop search engine months or years later if they were memorizable with keywords and search operators or self-created #tags.

For files which the PI or teammembers have to have the possibility to retrace them you anyway have to arrange a common plan how to structure and save the files everyone understands and can contribute. I would either suggest cloud storage or a local wiki here (moinmoinwiki for instance).

The most productive system is probably the one "invented" by Stephen Wolfram, but it took probably even more time to set it up than writing his article. But you get a lot of ideas, so I'm not sure it is future-proof to organize yourself with a propertiary software.

A personal opensource free wiki like moinmoinwiki or tiddlywiki is also an alternative and combination of bottom-up top-down and file system. So my experience is a wiki is only worth the additional time and effort at least with 2 or more team members, otherwise your personal choice and mixture of bottom-up and top-down is more time-efficient and humanities/experimental research/programming single reaserchers will end up with very different management system due to available time, number of files, work-flow and necessity to document and retrace everything.

File systems and desktop search engines will never die out, so my general tip is not to rely too much on fancy software like onenote, mathematica or alternatives to have a future-proof system.

edited Apr 30 at 14:36

answered Apr 30 at 13:54

user847982

1,380413

add a comment |

One of the more important goals when organizing your scientific data is to make it reproducible. At some point someone will have to look at the data and analysis again and try to understand it. This will often be you, e.g. when writing papers or your thesis, but also other scientists later that continue to work and expand on your projects. If you create a figure and your PI or a reviewer asks how exactly it was created and which datasets its based on, you should be able to figure that out from your data.

This requirements means that copy and pasting is not necessarily a bad thing. You want to be able to reproduce your analysis with exactly the flawed scripts you used the first time, not the updated version that might fix bugs there or might have introduced new ones. But you also want to be able to update your analyses if you find a problem in one of your common scripts, and check if your results changed due to this.

If your data isn't very large, I'd simply duplicate it in every analysis folder. That way you have both the code and the data in the same place when you have to revisit it. If it is large, you should take care to create an immutable raw data storage that you link to, and make sure those links are never broken by renaming or removing data.

For your scripts, it seems like you are experiencing issues because you haven't centralized your common code. Creating a modular version of your common code like you suggested is certainly a good idea. But I would run this in a way that copies your common code as it is at that time to the current analysis folder. So that you can just rerun the analysis with the old code easily, but still have a way to also run it with updated common code, if desired.

If you have problems distinguishing where you have edited stuff by hand, or where you copied stuff, it seems like you're doing this in multiple directions. It's always easier to follow this if your changes only flow in one direction. It's also a good idea to put as much of the data and code as reasonable into a real version control system like git. So you can just use e.g. "git blame" to figure out if you edited some parts of a script manually.

For finding snippets of text, I'd probably just use some kind of full text search. Maybe use a naming convention for this kind of files, so that you can distinguish them from your data and code files, if you want to search in them.

answered 2 days ago

Mad Scientist

625511

Upvoted for make it reproducible.

– Oleg Lobachev
2 days ago

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "415"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

cripcate is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2facademia.stackexchange.com%2fquestions%2f129910%2ffile-data-script-management-as-a-phd%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

In general I would distinguish between two approaches or a combination of both before ending in a very time consuming categorization of all your files

top-down

This means you use intelligent software to find again files you are looking for and don't worry too much about where it is stored. For example reference manager (Mendeley, Zotero,...), a Desktop search engine (Copernic). In the desktop search engine or with windows search (indexing turned on) you should see which file was changed at the latest. Most of your ideas and sketches you save in Onenote or something else that can link to files and www links.

bottom-up

This means you come up your self with a distinct structure like you did with a rough categorization (raw data, devices, projects). But don't complexify this, when most of the files can be identified by the file type/ending, then it may be more convenient to order it by projects and keep all file types within such subfolders.

Additionally, one trick I use is tagging files in their filename like "#phdthesis" or "#interesting" or "#cite" or "#collaboration".

Sample files (data, images of samples) always get a date ProjectnameDayMonthYear independent of their file type.

File systems and desktop search engines will never die out, so my general tip is not to rely too much on fancy software like onenote, mathematica or alternatives to have a future-proof system.

edited Apr 30 at 14:36

answered Apr 30 at 13:54

user847982

1,380413

add a comment |

In general I would distinguish between two approaches or a combination of both before ending in a very time consuming categorization of all your files

top-down

This means you use intelligent software to find again files you are looking for and don't worry too much about where it is stored. For example reference manager (Mendeley, Zotero,...), a Desktop search engine (Copernic). In the desktop search engine or with windows search (indexing turned on) you should see which file was changed at the latest. Most of your ideas and sketches you save in Onenote or something else that can link to files and www links.

bottom-up

This means you come up your self with a distinct structure like you did with a rough categorization (raw data, devices, projects). But don't complexify this, when most of the files can be identified by the file type/ending, then it may be more convenient to order it by projects and keep all file types within such subfolders.

Additionally, one trick I use is tagging files in their filename like "#phdthesis" or "#interesting" or "#cite" or "#collaboration".

Sample files (data, images of samples) always get a date ProjectnameDayMonthYear independent of their file type.

File systems and desktop search engines will never die out, so my general tip is not to rely too much on fancy software like onenote, mathematica or alternatives to have a future-proof system.

edited Apr 30 at 14:36

answered Apr 30 at 13:54

user847982

1,380413

add a comment |

In general I would distinguish between two approaches or a combination of both before ending in a very time consuming categorization of all your files

top-down

This means you use intelligent software to find again files you are looking for and don't worry too much about where it is stored. For example reference manager (Mendeley, Zotero,...), a Desktop search engine (Copernic). In the desktop search engine or with windows search (indexing turned on) you should see which file was changed at the latest. Most of your ideas and sketches you save in Onenote or something else that can link to files and www links.

bottom-up

This means you come up your self with a distinct structure like you did with a rough categorization (raw data, devices, projects). But don't complexify this, when most of the files can be identified by the file type/ending, then it may be more convenient to order it by projects and keep all file types within such subfolders.

Additionally, one trick I use is tagging files in their filename like "#phdthesis" or "#interesting" or "#cite" or "#collaboration".

Sample files (data, images of samples) always get a date ProjectnameDayMonthYear independent of their file type.

File systems and desktop search engines will never die out, so my general tip is not to rely too much on fancy software like onenote, mathematica or alternatives to have a future-proof system.

edited Apr 30 at 14:36

answered Apr 30 at 13:54

user847982

1,380413

In general I would distinguish between two approaches or a combination of both before ending in a very time consuming categorization of all your files

top-down

This means you use intelligent software to find again files you are looking for and don't worry too much about where it is stored. For example reference manager (Mendeley, Zotero,...), a Desktop search engine (Copernic). In the desktop search engine or with windows search (indexing turned on) you should see which file was changed at the latest. Most of your ideas and sketches you save in Onenote or something else that can link to files and www links.

bottom-up

This means you come up your self with a distinct structure like you did with a rough categorization (raw data, devices, projects). But don't complexify this, when most of the files can be identified by the file type/ending, then it may be more convenient to order it by projects and keep all file types within such subfolders.

Additionally, one trick I use is tagging files in their filename like "#phdthesis" or "#interesting" or "#cite" or "#collaboration".

Sample files (data, images of samples) always get a date ProjectnameDayMonthYear independent of their file type.

File systems and desktop search engines will never die out, so my general tip is not to rely too much on fancy software like onenote, mathematica or alternatives to have a future-proof system.

edited Apr 30 at 14:36

answered Apr 30 at 13:54

user847982

1,380413

edited Apr 30 at 14:36

answered Apr 30 at 13:54

user847982

1,380413

answered Apr 30 at 13:54

user847982

1,380413

answered Apr 30 at 13:54

user847982

1,380413

add a comment |

answered 2 days ago

Mad Scientist

625511

Upvoted for make it reproducible.

– Oleg Lobachev
2 days ago

add a comment |

answered 2 days ago

Mad Scientist

625511

Upvoted for make it reproducible.

– Oleg Lobachev
2 days ago

add a comment |

answered 2 days ago

Mad Scientist

625511

answered 2 days ago

Mad Scientist

625511

answered 2 days ago

Mad Scientist

625511

answered 2 days ago

Mad Scientist

625511

answered 2 days ago

Mad Scientist

625511

Upvoted for make it reproducible.

– Oleg Lobachev
2 days ago

add a comment |

Upvoted for make it reproducible.

– Oleg Lobachev
2 days ago

Upvoted for make it reproducible.

– Oleg Lobachev
2 days ago

add a comment |

cripcate is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

cripcate is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Academia Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ttdfjt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

2 Answers
2

2 Answers
2

2 Answers
2