2015年1月2日星期五

MongoDB Notes Final

Aggregation Introduction

Aggregations are operations that process data records and return computed results.

Aggregation Pipelines
  • Documents enter a multi-stage pipelines that transforms the documents into an aggregated result.
  • consist of stages.
  • some stages take a aggregation expression as input.

Map-Reduce
  • Map, Reduce, Finalize.
  • use custom JavaScript functions to map values to key.

Single Purpose Aggregation Operations
  • returning a count of matching documents
    • collection.count( )
  • returning the distinct values for a field
    • collection.distinct( )

  • grouping data based on the values of a field
    • collection.group( )


Aggregation Pipeline on Sharded Collections
  • The pipeline is split into two parts
    • The first is run on each shard, or exclude some shards through shard key
    • The second is run on primary shard, which collect the cursor from each shard, then forward the final result to mongos

Map-Reduce Example
  • Define the map function to process each input document

  • Define the corresponding function with two arguments

  • Perform the map-reduce on all documents in the orders collection using the map function and reduce function


Replication Introduction

Replication is the process of synchronizing data across multiple severs.
Replication provides redundancy and increases data availability. Also allows you to recover from hardware failure and service interruption. 

A replica set is a group of mongod instances that host the same data. One mongod, called the primary, receives all write operations. All other instances, called secondaries, apply operations form the primary to have the same data. The primary logs all operations to oplog. Only primary could receive write operations, read operations could be received by all members.

The secondaries apply the oplog to themselves. If the primary is unavailable, one of the secondaries would be elected to the new primary. The secondary that receives majority of the votes.

An arbiter could be added to break the draw during the election when there are even number of secondaries. The arbiter does not hold any data and is only used for election. 

An arbiter is always an arbiter, a primary could become a secondary, and a secondary could become a primary.

Each set has at most 12 members and in each election, at most 7 members could vote.

Priority 0 member is a secondary that could not become a primary, could not trigger elections. It could function as a standby.
A hidden member maintains a copy of the primary’s data and invisible to the client applications. It must be priority 0 and could not be the primary.

Delayed member contains copies of a replica sets’ data. It reflects an earlier or delayed state of the set. They must be priority 0 and must be a hidden member.

Architecture 
  • Three member replica sets
    • The minimum architecture of a replica set
  • Replica sets with four or more members
    • ensure the sets have odd number of voting members
  • Geographically distributed replica sets

Failover
Heartbeats: Replica set members send heartbeats(pings) to each other every two seconds. If it does not return within 10 seconds, then this member would mark it as inaccessible.

Members prefer to vote members with high priority.

Optime: the timestamp of the last operation that a member applied form the oplog. A replica set member could not become a primary unless it has the highest optime of any visible member in the set.

A replica set member can not become primary unless it can connect a majority of the members in the set. In a three members architecture, a secondary could not be a primary when the other two are done since it could not connect to a majority number of the members in the set. Also when the two secondaries are done, the primary would down step to a secondary.

Read Preference
  • primary
  • primaryPreferred
  • secondary
  • secondaryPreferred
  • nearest

The oplog is a special capped collection that keeps a rolling record of all operations that modify the data stored in your database. All replica set maintain a copy of oplog. Any member can import oplog entries from any other member.

Data Synchronization
  • Initial Sync: when a member has no data
    • Clones all data.
    • Applies all changes to the data set.
    • Builds all indexes on all collections.
  • Replication: continuously after initial sync




没有评论:

发表评论