Dropbox Traces

From SimpleWiki
Revision as of 09:10, 7 September 2012 by Idiliod (talk | contribs)
Jump to navigationJump to search

You can download from this page the flow data used in the following paper:

  • Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. In: Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012

As described in the paper, the data was captured at 4 vantage points in 2 European countries. Most of the data were collected from March 24, 2012 to May 5, 2012. A second dataset was collected in Campus 1 in June and July 2012 to complement the analysis.

All data was captured using Tstat: An open source monitoring tool developed at Politecnico di Torino. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from here.

Note that ***all IP addresses*** are anonymized.

Traces

First data capture

  • Campus 1
  • Campus 2 (soon)
  • Home 1 (soon)
  • Home 2 (soon)

Second data capture

  • Campus 1

Acceptable Use Policy

  • The user must not attempt to reverse engineer the anonymization procedure used to protect the data.
  • If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
  • When writing a paper using this data, we ask the user to cite:
@inproceedings{drago2012_dropbox,
  author        = {Idilio Drago and Marco Mellia and Maurizio M. Munaf\`{o} and Anna Sperotto and Ramin Sadre and Aiko Pras},
  title         = {{I}nside {D}ropbox: {U}nderstanding {P}ersonal {C}loud {S}torage {S}ervices},
  booktitle     = {Proceedings of the 12th ACM SIGCOMM Conference on Internet Measurement},
  series        = {IMC'12},
  pages         = {},
  year          = {2012}
}

Format

All files are in a format similar to the log_tcp_complete saved by Tstat.

The following columns are found in these traces:

############################################################################
# C2S # S2C # Short description      # Unit  # Long description            #
############################################################################
#  1  # 45  # Client/Server IP addr  # -     # IP addresses of the client/server
#  2  # 46  # Client/Server TCP port # -     # TCP port addresses for the client/server
#  3  # 47  # packets                # -     # total number of packets observed form the client/server
#  4  # 48  # RST sent               # 0/1   # 0 = no RST segment has been sent by the client/server
#  5  # 49  # ACK sent               # -     # number of segments with the ACK field set to 1
#  6  # 50  # PURE ACK sent          # -     # number of segments with ACK field set to 1 and no data
#  7  # 51  # unique bytes           # bytes # number of bytes sent in the payload
#  8  # 52  # data pkts              # -     # number of segments with payload
#  9  # 53  # data bytes             # bytes # number of bytes transmitted in the payload, including retransmissions
# 10  # 54  # rexmit pkts            # -     # number of retransmitted segments
# 11  # 55  # rexmit bytes           # bytes # number of retransmitted bytes
# 12  # 56  # out seq pkts           # -     # number of segments observed out of sequence
# 13  # 57  # SYN count              # -     # number of SYN segments observed (including rtx)
# 14  # 58  # FIN count              # -     # number of FIN segments observed (including rtx)
# 15  # 59  # RFC1323 ws             # 0/1   # Window scale option sent
# 16  # 60  # RFC1323 ts             # 0/1   # Timestamp option sent
# 17  # 61  # window scale           # -     # Scaling values negotiated [scale factor]
# 18  # 62  # SACK req               # 0/1   # SACK option set
# 19  # 63  # SACK sent              # -     # number of SACK messages sent
# 20  # 64  # MSS                    # bytes # MSS declared
# 21  # 65  # max seg size           # bytes # Maximum segment size observed
# 22  # 66  # min seg size           # bytes # Minimum segment size observed
# 23  # 67  # win max                # bytes # Maximum receiver window announced (already scale by the window scale factor)
# 24  # 68  # win min                # bytes # Maximum receiver windows announced (already scale by the window scale factor)
# 25  # 69  # win zero               # -     # Total number of segments declaring zero as receiver window
# 26  # 70  # cwin max               # bytes # Maximum in-flight-size (see Tstat docs)
# 27  # 71  # cwin min               # bytes # Minimum in-flight-size
# 28  # 72  # initial cwin           # bytes # First in-flight size, or total number of unack-ed bytes sent before receiving the first ACK segment
# 29  # 73  # Average rtt            # ms    # Average RTT computed measuring the time elapsed between the data segment and the corresponding ACK
# 30  # 74  # rtt min                # ms    # Minimum RTT observed during connection lifetime
# 31  # 75  # rtt max                # ms    # Maximum RTT observed during connection lifetime
# 32  # 76  # Stdev rtt              # ms    # Standard deviation of the RTT
# 33  # 77  # rtt count              # -     # Number of valid RTT observation
# 34  # 78  # ttl_min                # -     # Minimum Time To Live
# 35  # 79  # ttl_max                # -     # Maximum Time To Live
# 36  # 80  # rtx RTO                # -     # Number of retransmitted segments due to timeout expiration
# 37  # 81  # rtx FR                 # -     # Number of retransmitted segments due to Fast Retransmit (three dup-ack)
# 38  # 82  # reordering             # -     # Number of packet reordering observed
# 39  # 83  # net dup                # -     # Number of network duplicates observed
# 40  # 84  # unknown                # -     # Number of segments not in sequence or duplicate which are not classified as specific events
# 41  # 85  # flow control           # -     # Number of retransmitted segments to probe the receiver window
# 42  # 86  # unnece rtx RTO         # -     # Number of unnecessary transmissions following a timeout expiration
# 43  # 87  # unnece rtx FR          # -     # Number of unnecessary transmissions following a fast retransmit
# 44  # 88  # != SYN seqno           # 0/1   # 1 = retransmitted SYN segments have different initial seqno
############################################################################
# 89        # Completion time        # ms    # Flow duration since first packet to last packet
# 90        # First time             # ms    # Flow first packet since first segment ever
# 91        # Last time              # ms    # Flow last segment since first segment ever
# 92        # C first payload        # ms    # Client first segment with payload since the first flow segment
# 93        # S first payload        # ms    # Server first segment with payload since the first flow segment
# 94        # C last payload         # ms    # Client last segment with payload since the first flow segment
# 95        # S last payload         # ms    # Server last segment with payload since the first flow segment
# 96        # C first ack            # ms    # Client first ACK segment (without SYN) since the first flow segment
# 97        # S first ack            # ms    # Server first ACK segment (without SYN) since the first flow segment
# 98        # First time abs         # ms    # Flow first packet absolute time (epoch)
# 99        # C Internal             # 0/1   # 1 = client has internal IP, 0 = client has external IP
# 100       # S Internal             # 0/1   # 1 = server has internal IP, 0 = server has external IP
############################################################################
# 101       # Connection type        # -     # Bitmask stating the connection type as identified by TCPL7 inspection engine (see protocol.h)
############################################################################

Note that the last columns are different from the current stable Tstat version. Specifically for this analysis, the following extra columns were added:

############################################################################
# 102       # C2S messages           # -     # PSH-separated "messages" C2S 
# 103       # S2C messages           # -     # PSH-separated "messages" S2C 
# 104       # DB host_int            # -     # Anonymized Dropbox device ID
# 105       # DB service             # -     # Dropbox service inferred from the FQDN requested by the user
############################################################################

Note that columns 102 and 103 were added some weeks after the data capture started in some vantage points. The columns have an "-" for the period in which the value was not yet captured.

Sample Scripts

External Links