Difference between revisions of "Dropbox Crawler"

From SimpleWiki
Jump to navigationJump to search
 
(118 intermediate revisions by 3 users not shown)
Line 1: Line 1:
We are crawling DropBox information to help our research.
+
Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?
  
It is very important to us to know  the DropBox user file pattern. For example, how big  files are and  which kind of file users store on DropBox.
+
In this experiment, we collected basic statistics of what files are stored in Dropbox folders.
  
To Run our crawler, you may try do load it directly from our page, clicking here:
+
== Datasets ==
  
 +
Download our datasets:
  
or You may download the Jar package and run it (double click on most OS or java -jar HelpOurResearch.jar)
+
{| class="wikitable" style="text-align: center; width: 400px; height: 40px;"
 +
|-
 +
! scope="col" | Name
 +
! scope="col" | File Size
 +
! scope="col" | Volunteers
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/crawler/dropbox_crawler.tar.gz Crawler Dataset]
 +
| 219M || 333
 +
|}
  
Lauch Java Application
+
Some results derived from these data can be found in [http://eprints.eemcs.utwente.nl/24136/01/2013_drago_thesis.pdf here].
  
 +
In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the
 +
scripts available in the folder "scripts" inside the tarball.
  
We ensure that:
+
== How our data collection work? ==
  
All data we collect are anonymized.
+
* It scans Dropbox folders
We do not copy any file content.
+
* Calculates basic statistics
We do not collect any personal information and file/dir names.
+
* Shows what has been collected for approval
 +
* Sends the statistics to us
  
 +
== What has been logged? ==
  
We also will make our data publicity in a near future. Thus, anyone will be able to use this important data source.
+
For each file/folder in a Dropbox, the program collects:
 +
<pre>
 +
* Size in bytes
 +
* Last modification time
 +
* Mime type of the file
 +
* File extension
 +
* MD5 Hash of both initial and final 8 kbytes of the file
 +
* MD5 Hash of the file name/path
 +
</pre>
  
What we do:
+
The program also sends to us:
 +
<pre>
 +
* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
 +
* MD5 Hash of the path of your Dropbox home folder
 +
* Your IP address and operating system version
 +
* Error logs, in case something goes wrong during the data collection
 +
</pre>
  
We will read all your DropBox Folder;
+
Collected information is sent via plain HTTP to a centralized collection server.
We will collect basic statistics (log format can be viewed in the following);
 
We will send these statistics to our web server.
 
  
 +
== Client source code ==
  
What we DO NOT do:
+
Download the source code by clicking [http://www.simpleweb.org/dropbox/source_python.zip here] for the native versions (you will need Python 2.7 and [http://www.pyinstaller.org/ PyInstaller] for building these versions), or [http://www.simpleweb.org/dropbox/source_java.zip here] for the Java version.
  
We do not copy any file content;
+
== More information ==
We do not copy file or folder name;
 
We do not copy any personal information;
 
We do not install or store anything in your computer.
 
  
 +
The dataset in this page is used in the following publications:
  
 +
  @phdthesis{drago_understanding_2013,
 +
          author      = {Idilio Drago},
 +
          title        = {Understanding and Monitoring Cloud Services},
 +
          school      = {University of Twente},
 +
          url          = {<nowiki>\url{http://eprints.eemcs.utwente.nl/24136/</nowiki>}},
 +
          year        = {2013},
 +
  },
  
Traces
+
  @inproceedings{drago_caracterizacao_2013,
As soon as possible, we will make our logs public.
+
          author      = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
 +
          title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
 +
          booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
 +
          series      = <nowiki>{{WP2P+}}</nowiki>,
 +
          pages        = {109--114},
 +
          year        = {2013},
 +
  },
  
These datasets were captured from Jan. 3, 2013 to (not yet defined).
+
More information about our previous work is found on these papers:
  
Acceptable Use Policy (to use our logs in future)
+
* [http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012''']
  
The user must not attempt to reverse engineer the anonymization procedure used to protect the data.
+
* [http://eprints.eemcs.utwente.nl/23674/01/cloud_storage.pdf '''Drago, I. and Bocchi, E. and Mellia, M. and Slatman, H. and Pras, A. (2013) Benchmarking personal cloud storage. In: Proceedings of the 13th ACM Internet Measurement Conference, IMC 2013, 23-25 Oct 2013, Barcelona, Spain. pp. 205-212.''']
  
If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
+
* [[Dropbox Traces|This page]] and [[Cloud benchmarks | this page]] have more traces we used in other papers.
  
When writing a paper using this data, please cite:
+
== External Links ==
  
@inproceedings{
+
These institutes involved in this research:
 
+
* [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente] - Contact: Idilio Drago - idilio.drago@polito.it
}
+
* [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora] Contact: Alex Vieira - alex.borges@ufjf.edu.br
 
+
* [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino] - Marco Mellia - mellia@tlc.polito.it
 
 
Format
 
 
 
All files are in a simple format. Each line has files attributes, separeted by #.
 
 
 
The following columns are found in these traces:
 
 
 
############################################################################
 
#    #    # Short description      # Unit  # Long description            #
 
############################################################################
 
#  1  #    # Lenght                # -    # File Size in Bytes
 
#  2  #    # Modified              # -    # Last modification on file (Unix date/time format)
 
#  3  #    # MIME                  # -    # File Mime Type using Magic Java Unit
 
#  4  #    # EXTENSION              # -    # File extension (substring after the last "." on the string)
 
#  5  #    # MD5                    # -     # MD5 hash code of the initial/final 8 bytes of the file.
 
#  6  #    # MD5 of the name        # -     # MD5 hash code of file name string.
 
############################################################################
 
 
 
 
 
 
 
Crawler Source Code (java)
 
 
 
Download the Java Source Code to Capture Files Information
 
The Project may be used direct in NetBeans, version 7.2.1
 
 
 
 
 
Previous Work
 
You may find DropBox information on our previous work
 
 
 
Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012
 
 
 
As described in the paper, the data was captured at 4 vantage points in 2 European countries. The first 4 files were collected from March 24, 2012 to May 5, 2012. A second dataset was collected in Campus 1 in June and July 2012 to complement the analysis.
 
 
 
The data was captured using Tstat: An open source monitoring tool developed at Politecnico di Torino. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from here. More information about the DN-Hunter version of Tstat, needed for some experiments, can be found here. Note that all IP addresses in the datasets are anonymized
 
 
 
 
 
External Link
 
 
 
    Conference Website
 

Latest revision as of 09:44, 9 May 2014

Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?

In this experiment, we collected basic statistics of what files are stored in Dropbox folders.

Datasets

Download our datasets:

Name File Size Volunteers
Crawler Dataset 219M 333

Some results derived from these data can be found in here.

In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the scripts available in the folder "scripts" inside the tarball.

How our data collection work?

  • It scans Dropbox folders
  • Calculates basic statistics
  • Shows what has been collected for approval
  • Sends the statistics to us

What has been logged?

For each file/folder in a Dropbox, the program collects:

* Size in bytes
* Last modification time
* Mime type of the file
* File extension
* MD5 Hash of both initial and final 8 kbytes of the file
* MD5 Hash of the file name/path

The program also sends to us:

* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
* MD5 Hash of the path of your Dropbox home folder
* Your IP address and operating system version
* Error logs, in case something goes wrong during the data collection

Collected information is sent via plain HTTP to a centralized collection server.

Client source code

Download the source code by clicking here for the native versions (you will need Python 2.7 and PyInstaller for building these versions), or here for the Java version.

More information

The dataset in this page is used in the following publications:

 @phdthesis{drago_understanding_2013,
         author       = {Idilio Drago},
         title        = {Understanding and Monitoring Cloud Services},
         school       = {University of Twente},
         url          = {\url{http://eprints.eemcs.utwente.nl/24136/}},
         year         = {2013},
 },
 @inproceedings{drago_caracterizacao_2013,
         author       = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
         title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
         booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
         series       = {{WP2P+}},
         pages        = {109--114},
         year         = {2013},
 },

More information about our previous work is found on these papers:

External Links

These institutes involved in this research: