close

Вход

Забыли?

вход по аккаунту

код для вставкиСкачать
Welcome to CIS 455 / 555 –
Internet and Web Systems
Zachary G. Ives
University of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
January 14, 2015
© 2013 A. Haeberlen, Z. Ives
What this Course Is About
• How do we build services like Google, Akamai, iTunes,
Facebook, EBAY, …?
• What are the principles behind them?
(This is NOT a course on building Web sites! See CIS 450/550…)
• How do “cloud computing,” P2P, and Web services relate?
• The main themes of the course:
• Distributed systems concepts, with emphasis on data, scalability
and interoperability (including “the cloud”)
• Data representation fundamentals, with emphasis on XML
• Information retrieval concepts, including ranking and indexing
• It’s a course that involves building software using the
principles learned, evaluating it, and programming in teams
© 2011-14 A. Haeberlen, Z. Ives
2
How Does this Relate to
Other CIS Courses?
NETS 212
• Cloud service layers
• Key/value stores, in particular
• MapReduce, Spark, and data-parallel programming basics
CIS 450/550
• Data representation and management
• Relational querying with SQL; XML querying with XQuery
• DBMS-backed web sites
• 455/555 focuses on data with respect to interoperability
CIS 350/573: software engineering and mashups
CIS 505: focuses on distributed systems and algorithms
•
•
CIS 505 is less project-oriented than CIS 555
CIS 555 covers Web services, cloud architectures in more detail
© 2011-14 A. Haeberlen, Z. Ives
3
Some Things We’ll Look at
• What are the principles behind building systems that work on
the Internet?
• How do these relate to many of today’s hot technologies?
• Web servers, DHTML, Servlets, JSP, …
• XML
• Web services
• Peer-to-peer
• Application servers
• Cloud computing environments
• Content distribution networks
• Web search
• Mash-ups
• The cloud
•…
© 2011-14 A. Haeberlen, Z. Ives
4
Staff
• Instructor: Zack Ives, [email protected]
• Office: 576 Levine North
• Office hours W 1:30-2:30 (and by arrangement)
• TAs:
• Avani Deshpande
• Mounica Maddela
• Shenga Ding
Akshay Hegde
Shruthi Gorantala
• Piazza: piazza.com/upenn/spring2015/cis455555
• Will have custom homework submission platform (coming soon)
© 2011-14 A. Haeberlen, Z. Ives
5
Textbooks
• Distributed Systems: Principles and Paradigms, 2nd ed,
Tanenbaum and van Steen
• We’ll read from the book ~50% of the time
• Frequent supplementary handouts
• Excerpts from several books
• Many recent research papers
• Your first one, which you should read by Wed:
http://research.microsoft.com/en-us/um/people/blampson/33Hints/Acrobat.pdf (linked off the CIS 555 page)
© 2011-14 A. Haeberlen, Z. Ives
6
Prerequisites, Workload, etc.
Necessary skills:
•
•
•
•
•
•
Ability to code in Java: there is a substantial implementation project
Good debugging skills – this will be the biggest time sink!
The ability to work as a team with classmates (towards the end)
A willingness to learn how to read API documentation
Some exposure to threads and concurrent programming
A willingness to “push the envelope”
Workload:
•
•
•
Several programming/debugging-based homework assignments
A substantial term project with experimental evaluation and a report
Two midterms
Payoff:
•
•
•
Lots of practical development and debugging experience
A good working knowledge of the fundamentals behind scalable systems
A working “academic clone of Google,” hosted on Amazon EC2!
WARNING: this course should be considered 1.5 CU!
© 2011-14 A. Haeberlen, Z. Ives
7
A Disclaimer…
• This remains a “bleeding edge” course!
•
•
•
•
Goal 0: an understanding of scalable distributed data-centric systems
Goal 1: a look under the covers of today’s hottest topics – in lectures and in
projects
Goal 2: a level of comfort in managing large, complex software
development with others’ code
Part of this means doing a substantial implementation project
•
•
As in the real world: learning APIs, dealing with inadequate tools
Most of you will find this a struggle! You’ll spend many hours debugging!
• We will be using some immature technology
•
•
Not everything will have been validated ahead of time
We’ll do the best we can to smooth over the bugs!
• We hope it will be a fun course, though…
… And an interesting one!
© 2011-14 A. Haeberlen, Z. Ives
8
A Bit of Context for the Course
© 2011-14 A. Haeberlen, Z. Ives
9
What Exactly Is the Web?
• The Web consists of HTTP servers that publish HTML, XML, and
a few other content types
• These are hyperlinked via URLs (a subset of URIs)
• Plus there are a huge number of web clients
• The Web is built on a number of Internet protocols:
• DNS, TCP, IP
• Other Internet services use other protocols
• SMTP, IMAP, POP, AIM, FTP, …
• Streaming media, music swapping protocols, …
• Web services, custom applications may actually also use HTTP in
ways it wasn’t designed for
© 2011-14 A. Haeberlen, Z. Ives
10
The Internet is Built in Layers
Your Application
…
Middleware
Session
Transport
IP
Link
…
Web Services, distrib
transactions, …
SSH, FTP,
HTTP, IM, P2P, …
Lightweight
streaming, etc.
TCP (sessionbased)
UDP (sessionless)
IPv4, IPv6 Unicast, (multicast)
11
WiFi, ZigBee, Ethernet, WiMax
© 2011-14 A. Haeberlen, Z. Ives
What Is an Internet System?
• Not just a web server or web application…
• An application built over the Internet, whose functionality is
distributed across more than one machine
• Typically, at least in a client-server or server-to-server fashion, but
may have many more participants
• Typically, data and/or code must be exchanged in distributed
fashion for the functioning of the application
• Often, the data must be partitioned, replicated, translated, etc.
(“shards” in Google-speak)
• Often, the code is written in multiple different environments,
languages, etc.
• Often, there are concerns about handling failures, firewalls, attacks,
…
© 2011-14 A. Haeberlen, Z. Ives
12
Why Are Internet System
Topics Interesting?
• Understanding what’s underneath today’s Web
• How does it work?
• What are its shortcomings?
• What are its strengths?
• Understanding distributed algorithms
• Using the right approach when designing new protocols and web
systems
• Being able to anticipate what’s actually possible in the future
© 2011-14 A. Haeberlen, Z. Ives
13
Example: Web Search,
a Cloud Service
client
client
client
HTML forms;
results
Search Interface
Servers
queries
Uses a model of
document/word
similarity to rank
matches
query
results
Index Servers
Web
Pages
pages
Crawlers
keywords +
locations
© 2011-14 A. Haeberlen, Z. Ives
14
Example: Social Networking (Facebook /
Twitter), a Cloud Service
client
client
client
pages &
notifications
clicks
User Page
Servers
suggestions
Recommender
updates, posts
Users &
entities
common properties,
usage logs, …
© 2011-14 A. Haeberlen, Z. Ives
15
Example: Enterprise (or Web) Information
Integration
Maps all data into a
single format and virtual
schema
client
client
client
results in
“mediated schema”
queries
XQuery
+ XPath
over
XML XML
XML sources
Mediator
System
SQL
ODBC
HTTP
results
POST
Relational
sources
HTML
HTML sources
© 2011-14 A. Haeberlen, Z. Ives
16
Example: [email protected]
Breaks computation into
many parts and
distributes them to
the clients
Problem
Data
Aggregation
Partitioning
New subproblems
client
Computed
subresults
client
client
© 2011-14 A. Haeberlen, Z. Ives
17
Example: P2P File Sharing
Processes name-based
requests for data; each
node can make requests,
forward requests,
client
return data
request
client
data
request
data
request
data
client
client
© 2011-14 A. Haeberlen, Z. Ives
18
What are the Hard Problems?
• Disclaimer: most of the hard problems AREN’T solved (or
solvable) – and there often isn’t any single BEST solution
Much of systems design is about finding the right compromise for each
specific problem
• We can divide them into:
• Scalability
• Availability / reliability
• Consistency
• Interoperability
• Location and resource discovery
© 2011-14 A. Haeberlen, Z. Ives
19
Scalability
• How do we support a large number of clients or requests?
• Distribute work!
• Challenges:
•
•
Coordination – takes significant overhead in the general case
Load balancing – avoid having bottlenecks
• Parts of the solution:
•
•
•
Client-server, multi-tier, P2P architectures
Restricted programming models, e.g., MapReduce
Data partitioning, replication, remote procedure calls, …
© 2011-14 A. Haeberlen, Z. Ives
20
Availability/Reliability
• How do we ensure the system is “up” when we want it to be, and
doing the “right” thing?
•
•
•
•
•
Replication and redundancy
Security measures against attacks
Ability to undo/redo
Challenges:
•
•
•
Keeping things consistent
Performance vs. security
Acknowledgments
Parts of the solution:
•
•
•
•
Data partitioning, replication, …
Logging, transactions, …
Redundant hardware, multiple sites, …
Quorum and consensus algorithms
© 2011-14 A. Haeberlen, Z. Ives
21
Consistency / Consensus
• Replication, distribution, and failures make it difficult to keep a
unified, consistent view of the world – how do we combat this?
• Locking, concurrency control, and invalidation schemes
• Clock synchronization
• Challenges:
•
•
Locking has huge performance overhead
Network partitions, disconnected operation
• Parts of the solution:
•
•
•
Optimistic concurrency control, 2-phase locking
Distributed clock sync
Conflict resolvers
© 2011-14 A. Haeberlen, Z. Ives
22
Interoperability
• How do we coordinate the efforts of components that have
different data formats and/or source languages, and are on
different machines?
• Standardization!
• Challenges:
•
Everything has a different semantics!
• Parts of the solution:
•
•
•
Standard data formats: XML, XML schemas
“Schema mediation” and data translation
Remote procedure calls: CORBA, XML-RPC, …
© 2011-14 A. Haeberlen, Z. Ives
23
Location & Resource Discovery
• How do you find what you’re looking for?
• Naming
• Declarative queries over standard schemas
• Advertisements
• Challenges:
•
•
Naming has implicit semantics
What do you do when you don’t know what to call something?
•
•
•
•
Directory systems – DNS, LDAP, etc.
Resource discovery and advertising protocols
Overlay networks, sharding schemes
Standardized schemas
• Parts of the solution:
© 2011-14 A. Haeberlen, Z. Ives
24
Our First Focus:
Single Machines, aka Servers
• How do you handle large numbers of concurrent users?
• Processes
• Threads
• Events
• Hybrids (e.g., thread pools)
• Staged architectures
© 2011-14 A. Haeberlen, Z. Ives
25
Next Time…
• We’ll look under the covers of an HTTP server
• Key ideas in building scalable systems
• Principles of HTTP and web servers
• Management of concurrent sessions
• To read by next Wednesday:
• Lampson and Saltzer paper
http://research.microsoft.com/en-us/um/people/blampson/33Hints/Acrobat.pdf
• Tanenbaum Ch. 3.1
• If necessary: Review Tanenbaum “Modern OS,” Ch. 2.3 or a similar
OS book on interprocess communication
© 2011-14 A. Haeberlen, Z. Ives
26
1/--страниц
Пожаловаться на содержимое документа