select * from StockTick
select * from StockTick(symbol='IBM').win:time(30 sec)
select a.custId, sum(a.price + b.price) from pattern [every a=ServiceOrder -> b=ProductOrder(custId = a.custId) where timer:within(1 min)].win:time(2 hour) where a.name in ('Repair', b.name) group by a.custId having sum(a.price + b.price) > 100
[...] we’re measuring data less in terms of scale and more in terms of bandwidth.
select avg(price) from StockTick(symbol='IBM').win:time(1 min)
select avg(price) from StockTick(symbol='IBM').win:time(1 hour)
StockTick symbol:IBM => price#avg every minute
StockTick symbol:IBM => price#stdev every hour
\begin{equation*} \bar x_n = \bar x_{n-1} + \frac{x_n - \bar x_{n-1}}{n} \end{equation*} \begin{equation*} M_{2,n}\! =\sum_{i=1}^n (x_i - \bar x_n)^2 \end{equation*} \begin{equation*} M_{2,n}\! = M_{2,n-1} + (x_n - \bar x_{n-1})(x_n - \bar x_n) \end{equation*} \begin{equation*} \sigma_n = \frac{M_{2,n}}{n} \end{equation*} | \begin{equation*} \delta\! = \bar x_B - \bar x_A \end{equation*} \begin{equation*} \bar x_X = \bar x_A + \delta\cdot\frac{n_B}{n_X} \end{equation*} \begin{equation*} M_{2,X} = M_{2,A} + M_{2,B} + \delta^2\cdot\frac{n_A n_B}{n_X} \end{equation*} |
StockTick symbol:IBM => 42 * price#avg * price#stdev every hour
PageViews page:home => user_id:dcount every hour
Set cardinality
HyperLogLog
Philippe Flajolet (2007)
Multiset sumarization
Count-Min Sketch
Graham Cormode (2003)
Set membership
Bloom Filters
Burton Bloom (1970)
It is theoretically impossible to define a hash function that creates random data from non-random data. But in practice it is not difficult to produce a pretty good imitation.
Downloading a few gigabytes of data before you can start using Bitcoin is not a good first user experience. (...) added support for bloom filters to get just the transactions relevant to your wallet.
a small amount of tablet server memory used for storing Bloom Filters drastically reduces the number of disk seeks required for read operations.
Define-se largura ($w$) e profundidade ($d$), de forma que:
$w = \lceil e/\epsilon \rceil$ e $d = \lceil \ln{1/\delta} \rceil$.
Assim, é possível responder point queries ($\hat{a}_i$) onde:
$a_i \leq \hat{a}_i \leq \epsilon \|a\|$
O lado direito da inequação se mantém com probabilidade $1 - \delta$.
Consiste em estimar $\sum_{i = l}^{r} a_i$.
O truque é tratar a estrutura como uma Fenwick Tree sobre um array probabilistico.
class CountMin: def __init__(self, m, k): self.data = [[0] * m for i in range(k)] self.m = m self.k = k def add(self, element, count=1): for i in range(self.k): h = mmh3.hash(element, i) self.data[i][h % self.m] += count def quantile(self, element): m = 1<<30 for i in range(self.k): h = mmh3.hash(element, i) m = min(m, self.data[i][h % self.m]) return m
The math establishing error bounds or other properties can be somewhat hairy, but the structures themselves are not too complicated.
Escolhe-se $log_2 m$ bits do espaço de hash para corresponderem a $m$ subfluxos ($|M| = m$). A busca pelo prefixo é efetuada nos bits restantes.
Por exemplo, para $log_2 m = 3$ ($m = 8$):
Estimativa de cada registrador:
$\hat{E}_i = 2^{M(i)}$
Estimativa geral (média harmônica de todas as estimativas):
$\hat{E} = \frac{m^2}{\sum_{i=1}^{m} \hat{E}(i)^{-1}} = \frac{m^2}{\sum_{i=1}^{m} 2^{-M(i)}}$
Estimativa geral (corrigindo o bias multiplicativo):
$\hat{E}' = \frac{\alpha_m m^2}{\sum_{i=1}^{m} 2^{-M(i)}}$, onde $\alpha_m = (m \int_0^{\infty} (log_2 (\frac{2+u}{1+u}))^m du)^{-1}$
class HyperLogLog: def __init__(self, log2m): self.log2m = log2m self.m = 1 << log2m self.data = [0]*self.m self.alphaMM = (0.7213 / (1 + 1.079 / self.m)) * self.m * self.m def add(self, element): x = mmh3.hash(str(element), 0) a, b = 32-self.log2m, self.log2m i = x >> a v = self._bitscan(x << b, a) self.data[i] = max(self.data[i], v) def cardinality(self): estimate = self.alphaMM / sum([2**-v for v in self.data]) if estimate <= 2.5 * self.m: zeros = float(self.data.count(0)) return round(-self.m * math.log(zeros / self.m)) else: return round(estimate)
Os melhores algoritmos estão nos menores códigos...
... porém nos papers mais obscuros.
Estruturas de dados probabilisticas vão desempenhar um papel fundamental na evolução do Big Data.
A academia já percebeu isso e já está dez anos à frente.
A indústria ainda está engatinhando.
Estamos contratando!
intelie.com.br/trabalheApresentação em juanlopes.net/qconsp2013.
Exemplos em github.com/juanplopes/sketches.
Contato: