I’m keeping public track of what I read. Today I read three papers using the three pass method.
I did a first pass over two papers on distributed file systems. Both papers are by authors Sage A. Weil, Scott A. Brandt, Ethan L. Miller and Darrell D. E. Long of University of California, Santa Cruz.
- Ceph: A Scalable, High-Performance Distributed File System
- CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
A gave a Facebook paper a second pass. The paper describes a system for efficient photo storage and retrieval, that minimizes metadata so that it can be kept in memory, and stores a large number of photos in a single physical file, and stores pointers into this file, among other things.
- Finding a needle in Haystack: Facebook’s photo storage
Find all of these papers on Google Scholar.
Below are four different (partially overlapping) lists of papers one should read in distributed systems.
I have found some tutorials (1 and 2) for using the hstore column type in PostgreSQL. This blog post is specific to querying OpenStreetMap data using the tags column in relations ways and nodes (the tags column is a hstore).
This blog post is made as a “note-to-self” so that I can remember the procedure. You are of course welcome to read along. It’s does nothing fancy, simply imports the planet.osm file into PostGIS using Osmosis with the Snapshot Schema.
Step by step
Assuming Osmosis is installed (if not download osmosis), and a planet.osm file has been download.
Situation: You have a large pile of computer science papers in front of you. You want to read them all. What to do?
My suggestion is that you read the two guides below. They are really short and helpful. I’m one year into my CS PhD, and I still find reading a large pile of papers to be quite hard. Especially if the papers are exploring problems within a field that I’m not super familiar with.
General benchmarking tools
phoronix is a testing and benchmarking platform.
This is easy (using pip):
This tutorial uses an in memory SQLite database, which is cool in itself.
Apache Cassandra is a column-store with a p2p distribution architecture. Yesterday I did a presentation about Cassandra at Grontmij, Glostrup, Denmark.
I made the slides in Google Docs, and they’re public: slides.
I tried the Tor browser today, and was amazed at how slow it was. As Tor’s user base has grown, the performance of the Tor network has suﬀered. This document describes
the current understanding of why Tor is slow, and lays out the options for ﬁxing it.
In this post I’ll compare the running time of reading uncompressed and compressed files from disc.
I’ll run a test using two files, data.txt (858M) and data.txt.gz (83M), that have the same content.
About cat and zcat
The well-known command cat, prints the contents of a file. The lesser-known zcat, prints the contents of a GZIP’ed file.