close

Вход

Забыли?

вход по аккаунту

код для вставкиСкачать
DISTRIBUTED DATA FLOW WEB-SERVICES
FOR ACCESSING AND PROCESSING OF BIG
DATA SETS IN EARTH SCIENCES
A.A. Poyda1, M.N. Zhizhin1, D.P. Medvedev2, D.Y. Mishin3
1NRC
"Kurchatov Institute", Moscow, Russia
2Geophysical Center RAS, Moscow, Russia
3Johns Hopkins University, Baltimore, USA
The Big Data problem in Earth sciences
160
Current Estimate of NOAA NESDIS DATA
ARCHIVE VOLUME PROJECTIONS
(under CLASS Environment - 2 site concept)
August 2006
Model Data
140
NEXRAD
NPOESS
NPP
GOES
100
)NASA EOS (MODIS
METOP
80
Ocean Related Data
DMSP
60
& IN-SITU (Weather
)Climate
CORS
40
POES
.Misc
20
Sorted by year 2020 volumes
0
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
20
19
20
20
PETABYTES
120
YEAR
Big Data problem in Earth sciences
• Storage problem: remote access is required.
• Data request problem: timeout or insufficient
memory when requesting big data blocks.
• Data processing problem: processing of big
data volumes may lead to disk swapping
resulting in dramatic performance decrease.
• Optimization of data access and processing is
required.
Data model for Earth sciences
Vis5D time-space-parameter animation
Data access and processing
optimizations in Earth sciences
• Data access parallelization
• Migration to data-flow / block-stream data
access
• Data store optimization
• Migration to distributed data-flow processing
Data access parallelization
OpenStack Swift
Fault-tolerant, distributed object or blob storage with continuity support
• Works as data container
• Supports fault-tolerance and data
replication
• Data backup
• Scalability
• RESTful S3-like interface
• Supports users authorization and
authentication (swauth, keystone)
Openstack SWIFT performance
Rate (MB/s)
250
200
150
100
50
0
0
2
4
6
8
10
12
14
16
Data-flow / block-stream data access
Scientific data arrays
•
Arrays are widely used in environmental sciences to store modelling results,
satellite observations, raster maps, etc.
•
Datasets can be quite large, up to several terabytes.
•
Most data are stored as file collections in proprietary formats or universally
adopted formats like netCDF, GRIB, HDF5.
•
File access can be problematic:
 Scientists need to know about too many file formats
 Usually files must be completely downloaded before they can be used
 Thousands of files can be processed in one data request; only a small
portion of their contents appears in the result set
•
Currently available database solutions do not have convenient array storage
capabilities.
Data store optimization. Cloud-based Active
Storage for multidimensional arrays.
•
•
Active Storage is a new way in database design used for storing multidimensional numeric arrays containing space, terrestrial weather data
archives and large scaled images.
Special features of Active Storage are:
– Universal architecture capable to store different data types in one
system.
– Effective index creating for large data (tens and hundreds Tb).
– Can do basic data transformations directly on storage nodes
(arithmetic operations, statistical operations, linear convolution).
– Metadata integrated with data.
– Can distribute data automatically on several computer nodes (also
can distribute computations).
– Can be used in Grid infrastructure using OGSA-DAI services.
Splitting an array into chunks
Non-chunked array
1 seek
8 seeks
Chunked array
• We store chunks in BLOB fields of a database table
• Chunks do not need to be the same size
4 seeks
4 seeks
chunk_key
chunk
0
<Chunk0>
1
<Chunk1>
2
<Chunk2>
3
<Chunk3>
ActiveStorage performance
Request
number
Request form
(time Х
latitude Х
longitude)
1
8 х 64 х 128
2
32 х 32 х 64
3
128 х 16 х 32
4
512 х 8 х16
5
2048 х 4 х8
6
8192 х 2 х 4
7
32768 х 1 х 2
Distributed data-flow processing
Distributed data-flow processing organization
problems:
• Data communication support between activity;
• Load balancing and parallelization management;
• Fault-tolerance and error processing support;
• Activity management.
At present, several frameworks of distributed dataflow processing exist: Yahoo S4, Twitter Storm,
Taverna, Kepler, OGSA DAI.
Twitter Storm
Wind speed calculation workflow
example
GetData
(U-component)
GetData
(V-component)
processing
GetData
(U-component)
GetData
(V-component)
processing
Output Block
GetData
(U-component)
RESTful data
service
processing
GetData
(V-component)
GetData
(U-component)
processing
GetData
(V-component)
Wind speed calculation:
U V
2
2
Dependence of data-flow processing
time from data volume
Problems that are not solved by
frameworks
• Automatic partitioning of source data space.
• Flooding and synchronization management in
case of data flow merging.
• Data flow routing in case of parallel processing
activity and data flow merging.
Current work
• Twitter Storm data request block-stream
activity supporting block geometry and array
priority direction properties, and automatic
partitioning of source data space.
• Twitter Storm data processing activity
supporting automatic data flow merging,
generalized array processing language, and
flooding management.
Results
A framework has been developed, having the
following features:
• cloud storage with data reservation and access
acceleration;
• designed for large multidimentional data arrays;
• request shape flexibility;
• flow-based system for access and processing;
• high scalability.
Applications
• High-resolution 3D models of the Earth based
on large number of observations.
• Climate modeling and analysis tasks.
• Multispectral satellite and geological imagery
processing.
1/--страниц
Пожаловаться на содержимое документа