Dropbox Traces

From SimpleWiki
Jump to navigationJump to search

You can download from this page the flow data used in the following paper:

As described in the paper, the data was captured at 4 vantage points in 2 European countries. The first 4 files were collected from March 24, 2012 to May 5, 2012. A second dataset was collected in Campus 1 in June and July 2012 to complement the analysis.

The data was captured using Tstat: An open source monitoring tool developed at Politecnico di Torino. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from here. More information about the DN-Hunter version of Tstat, needed for some experiments, can be found here. Note that all IP addresses in the datasets are anonymized.

Traces

First Data Capture

These datasets were captured from March 24, 2012 to May 5, 2012.

Name File Size Flows Devices
Campus 1 21MB 167,189 283
Campus 2 262M 1,902,824 6,609
Home 1 181M 1,438,369 3,350
Home 2 82M 693,086 1,313

Second Data Capture

This dataset was captured from June 01, 2012 to July 31, 2012.

Name File Size Flows Devices
Campus 1 32M 264,131 270

Acceptable Use Policy

  • The user must not attempt to reverse engineer the anonymization procedure used to protect the data.
  • If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
  • When writing a paper using this data, please cite:
@inproceedings{drago2012_dropbox,
  author        = {Idilio Drago and Marco Mellia and Maurizio M. Munaf\`{o} and Anna Sperotto and Ramin Sadre and Aiko Pras},
  title         = {{I}nside {D}ropbox: {U}nderstanding {P}ersonal {C}loud {S}torage {S}ervices},
  booktitle     = {Proceedings of the 12th ACM SIGCOMM Conference on Internet Measurement},
  series        = {IMC'12},
  pages         = {481-494},
  year          = {2012}
}

Format

All files are in a format similar to the log_tcp_complete saved by Tstat.

The following columns are found in these traces:

############################################################################
# C2S # S2C # Short description      # Unit  # Long description            #
############################################################################
#  1  # 45  # Client/Server IP addr  # -     # Anonymized IP addresses of the client/server
#  2  # 46  # Client/Server TCP port # -     # TCP port addresses for the client/server
#  3  # 47  # packets                # -     # total number of packets observed form the client/server
#  4  # 48  # RST sent               # 0/1   # 0 = no RST segment has been sent by the client/server
#  5  # 49  # ACK sent               # -     # number of segments with the ACK field set to 1
#  6  # 50  # PURE ACK sent          # -     # number of segments with ACK field set to 1 and no data
#  7  # 51  # unique bytes           # bytes # number of bytes sent in the payload
#  8  # 52  # data pkts              # -     # number of segments with payload
#  9  # 53  # data bytes             # bytes # number of bytes transmitted in the payload, including retransmissions
# 10  # 54  # rexmit pkts            # -     # number of retransmitted segments
# 11  # 55  # rexmit bytes           # bytes # number of retransmitted bytes
# 12  # 56  # out seq pkts           # -     # number of segments observed out of sequence
# 13  # 57  # SYN count              # -     # number of SYN segments observed (including rtx)
# 14  # 58  # FIN count              # -     # number of FIN segments observed (including rtx)
# 15  # 59  # RFC1323 ws             # 0/1   # Window scale option sent
# 16  # 60  # RFC1323 ts             # 0/1   # Timestamp option sent
# 17  # 61  # window scale           # -     # Scaling values negotiated [scale factor]
# 18  # 62  # SACK req               # 0/1   # SACK option set
# 19  # 63  # SACK sent              # -     # number of SACK messages sent
# 20  # 64  # MSS                    # bytes # MSS declared
# 21  # 65  # max seg size           # bytes # Maximum segment size observed
# 22  # 66  # min seg size           # bytes # Minimum segment size observed
# 23  # 67  # win max                # bytes # Maximum receiver window announced (already scale by the window scale factor)
# 24  # 68  # win min                # bytes # Maximum receiver windows announced (already scale by the window scale factor)
# 25  # 69  # win zero               # -     # Total number of segments declaring zero as receiver window
# 26  # 70  # cwin max               # bytes # Maximum in-flight-size (see Tstat docs)
# 27  # 71  # cwin min               # bytes # Minimum in-flight-size
# 28  # 72  # initial cwin           # bytes # First in-flight size, or total number of unack-ed bytes sent before receiving the first ACK segment
# 29  # 73  # Average rtt            # ms    # Average RTT computed measuring the time elapsed between the data segment and the corresponding ACK
# 30  # 74  # rtt min                # ms    # Minimum RTT observed during connection lifetime
# 31  # 75  # rtt max                # ms    # Maximum RTT observed during connection lifetime
# 32  # 76  # Stdev rtt              # ms    # Standard deviation of the RTT
# 33  # 77  # rtt count              # -     # Number of valid RTT observation
# 34  # 78  # ttl_min                # -     # Minimum Time To Live
# 35  # 79  # ttl_max                # -     # Maximum Time To Live
# 36  # 80  # rtx RTO                # -     # Number of retransmitted segments due to timeout expiration
# 37  # 81  # rtx FR                 # -     # Number of retransmitted segments due to Fast Retransmit (three dup-ack)
# 38  # 82  # reordering             # -     # Number of packet reordering observed
# 39  # 83  # net dup                # -     # Number of network duplicates observed
# 40  # 84  # unknown                # -     # Number of segments not in sequence or duplicate which are not classified as specific events
# 41  # 85  # flow control           # -     # Number of retransmitted segments to probe the receiver window
# 42  # 86  # unnece rtx RTO         # -     # Number of unnecessary transmissions following a timeout expiration
# 43  # 87  # unnece rtx FR          # -     # Number of unnecessary transmissions following a fast retransmit
# 44  # 88  # != SYN seqno           # 0/1   # 1 = retransmitted SYN segments have different initial seqno
############################################################################
# 89        # Completion time        # ms    # Flow duration since first packet to last packet
# 90        # First time             # ms    # Flow first packet since first segment ever
# 91        # Last time              # ms    # Flow last segment since first segment ever
# 92        # C first payload        # ms    # Client first segment with payload since the first flow segment
# 93        # S first payload        # ms    # Server first segment with payload since the first flow segment
# 94        # C last payload         # ms    # Client last segment with payload since the first flow segment
# 95        # S last payload         # ms    # Server last segment with payload since the first flow segment
# 96        # C first ack            # ms    # Client first ACK segment (without SYN) since the first flow segment
# 97        # S first ack            # ms    # Server first ACK segment (without SYN) since the first flow segment
# 98        # First time abs         # ms    # Flow first packet absolute time (epoch)
# 99        # C Internal             # 0/1   # 1 = client has internal IP, 0 = client has external IP
# 100       # S Internal             # 0/1   # 1 = server has internal IP, 0 = server has external IP
############################################################################
# 101       # Connection type        # -     # Bitmask stating the connection type as identified by TCPL7 inspection engine (see protocol.h)
############################################################################

Note that the last columns of the current log_tcp_complete of Tstat are not included. Specifically for this analysis, the following extra columns were added:

############################################################################
# 102       # C2S messages           # -     # PSH-separated "messages" C2S
# 103       # S2C messages           # -     # PSH-separated "messages" S2C
# 104       # DB host_int            # -     # Anonymized Dropbox device ID
# 105       # DB service             # -     # Dropbox service inferred from the FQDN requested by the user or from server IP addresses
############################################################################

Columns 102 and 103 were added some weeks after the data capture started in some vantage points. The columns are filled with a "-" for the period in which the value was not yet captured.

Column 105 has a string referring to a Dropbox service. Check Sec. 2 and Tab. 1 in the paper for details. In the scripts below (e.g. traffic_share.py) there are also more information on how those values were interpreted in the paper. Flows marked as "Unknown" were identified as related to Dropbox, but the destination service was unclear.

Sample Scripts

The scripts are written in bash, awk or python. Gnuplot is required, and each figure is created by a separate bash script. For example, after unpacking the files and download the data, the following command creates Fig.05 - the output eps will be in a sub-directory called 'figs':

$ ./fig05_ts_contacted_servers.sh campus1_dataset1.log.gz campus2_dropbox.log.gz home1_dropbox.log.gz home2_dropbox.log.gz

List of server IPs

Dropbox server IPs are public. To enforce privacy, we had to anonymize both client and server IPs in our datasets, using distinct methods. We however release the list of Dropbox server IPs that can be obtained from the DNS.

External Links