Skip to content

nevesnunes/aggregables

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aggregables

Snippets and scripts to parse and manipulate data patterns.

Install

pip install -e .

Tasks

Time Series

Compare deviations of two time spans in logs, grouped by captured variables

Use cases:

  • Analyzing logs where we are not certain of which variables to observe, but know a point in time to compare against (e.g. before an exception was thrown); Our assumption is that variables with higher deviation of values are more likely to be interesting to observe
    • e.g. to understand why an exception was thrown, if all requests across the full time span (i.e. all logged requests) use the verb GET, then the verb doesn't offer any clues; however, if the user making requests only appeares on the second time span and not on the first, maybe we should investigate what is special about that user session

Usage:

# Split time span at point where timestamps occurred after '1 week ago'
./measure_deviating_groups.py access.log.1 access.log.rules '1 week ago'

In this case, assuming the current date is "08/Aug/2020", log lines will be split into two sets for analysis:

set 1 | 109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" [...]
      | [...]
      | 109.184.11.34 - - [12/Dec/2015:18:32:56 +0100] "GET /administrator/ HTTP/1.1" [...]
     ---
set 2 | 165.225.8.79 - - [06/Aug/2020:12:47:50 +0200] "GET /foo.com/cpg/displayimage.php?album=1&pos=40 HTTP/1.0" [...]
      | [...]

Variables (e.g. ip, date, verb...) are matched against regex patterns containing named capture groups. For each variable, we identify values and count their occurrences.

Output (sorted by standard deviation of values and occurrences):

  1. Low deviation: identical values or similar distribution of occurrences:
virtual_host (std_dev: 0.0)
        [(None, 9)]
        [(None, 5)]
---
request_method (std_dev: 0.0162962962962963)
        [('GET', 5), ('POST', 4)]
        [('GET', 4), ('POST', 1)]
[...]
  1. High deviation: All values are distinct:
path (std_dev: 0.06666666666666667)
        [('/administrator/', 5), ('/administrator/index.php', 4)]
        [('/index.php?option=com_contact&view=contact&id=1', 2), ('/foo.com/cpg/displayimage.php?album=1&pos=40', 1), ('/', 1), ('/index.php?option=com_content&view=article&id=50&Itemid=56', 1)]

Caption (for each block):

Line Description
1 captured variable
2 time span 1, observed values and their occurrences
3 time span 2, observed values and their occurrences

Related work:

Sort logs by timestamps, including non-timestamped lines

# Non-timestamped lines will use the last parsed timestamp
awk '
    timestamp {
        if(/^([0-9-]* [0-9,:]* ).*/) { print $0 }
        else { print timestamp $0 }
    }
    match($0, /^([0-9-]* [0-9,:]* ).*/, e) {
        timestamp=e[1]
    }
    NR==1 { print }
' *.log *.log.1 \
    | sort \
    | awk '{
        gsub("^[0-9-]*[[:space:]]*[0-9,:]*", "")
        if(!x[$0]++) { print }
    }' \
    | vim -

Alternatives:

Captures

Visualize co-occurrences

Usage:

./heatmap.py <(printf '%s\n' \
    'a b 41' \
    'a c 12' \
    'b c 10' \
    'c b 1' \
    'b e 1' \
    'b f 99')

Output:

|b
|O|f
|o| |a
|.| |.|c
|.| | | |e

99 ('b', 'f')
41 ('a', 'b')
12 ('a', 'c')
11 ('b', 'c')
 1 ('b', 'e')

Caption:

Symbol Description
O counts > max_counts / 2
o counts > max_counts / 3
u counts > max_counts / 4
. counts > 0

Alternatives:

Related work:

Histogram

#!/usr/bin/awk -f

{
    out[$0]++
    total++
}
END {
    for (key in out) {
        h = ""
        max_h = 8 * out[key] / total
        for (i=0; i<max_h; i++) {
            h = h "="
        }
        printf "%16s | %8s %.2f | %s\n", out[key], h, (out[key] / total), key
    }
}

Usage:

printf '%s\n' 1 1 1 2 3 | histogram.awk

Output (occurrences, distribution, value):

3 |    ===== 0.60 | 1
1 |       == 0.20 | 2
1 |       == 0.20 | 3

Alternatives:

Related work:

Example: Verify /dev/urandom

Input (using filled_uniq_count.py to add zeroes for missing values):

./bar.py <(head -c100000 /dev/urandom \
  | od -tuC -An -v \
  | sed 's/ /\n/g' \
  | ./filled_uniq_count.py)

Output (de-skewed distribution):

image

Small multiple charts

  • multiple_bar.py
    • Interpolates bar color to make value differences across multiple scales more explicit
    • Sorts by Tukey's fences and standard deviation for faster detection of anomalies
    • Outputs to pdf to handle large numbers of charts

Usage: paste -d ',' 1.csv 3.csv 12.csv | ./multiple_bar.py

Output: pdf

Example: Side-Channel Statistical Analysis

Line chart

Example: Instruction trace of an executable

This program takes the "else" branch in the first iteration, then the "if" branch in the remaining iterations. We can observe in the line chart that there are two blocks of repeated patterns, with the second block taking significantly more instructions.

# Generate trace file `instrace.loops.log`
~/opt/dynamorio/build/bin64/drrun \
  -c ~/opt/dynamorio/build/api/bin/libinstrace_x86_text.so \
  -- ./loops

# Filter out addresses from shared library modules
awk '
match($0, /^0x4[0-9a-f]+/) {
  print substr($0, RSTART, RLENGTH)
}
' instrace.loops.log \
  > instrace-filtered.loops.log

# Add csv header,
# convert hex values to integers,
# then format label values back to hex
cat \
  <(echo "foo") \
  <(python -c 'import sys; [print(int(x,16)) for x in sys.stdin.read().strip().split("\n")]' \
    < ../../sequences/instrace-filtered.loops.log) \
  | ./line.py --hex

Output:

image

Proximity search for two or more substrings

Usage: ./magrep.py test1 'brown.*quick'

Output:

test1:1-1:quick brown

Usage: ./magrep.sh brown quick test1

Output:

test1[1,5]:
the quick brown fox
was quick
and also a fox
bla bla bla
bbbbbbbbbbb
test1[11,12]:
the fox
was quick

Alternatives: grep --color=always -Hin -C 2 quick test1 | grep 'quick\|fox'

Output:

test1:1:the quick brown fox
test1:2:was quick
test1-3-and also a fox
test1:6:it was quick
test1-11-the fox
test1:12:was quick

Trace patterns while preserving full output

Example: matches 1, flushing output on each match

(echo 1 && sleep 1 && echo 1 && sleep 1 && echo 2) \
    | tee /tmp/a \
    | awk '/1/ {
        cmd = "date +%s%N"
        cmd | getline d
        close(cmd)
        print $0 " " d
        system("")
    }' \
    | tee /tmp/b

Alternatives:

Differences

Summarize distinct bytes in two files

Benchmarking:

# Given:
# - CPU: Intel i5-4200U
# - RAM: 12GiB DDR3 1600 MT/s
# - Input: 2 files with size ~= 481M
seq 1 5 \
  | while read -r i; do \
    sudo sh -c 'free && sync && echo 3 > /proc/sys/vm/drop_caches && free' \
      && time ./hexdiff.py foo bar \
  done
# 21.2406 seconds = (24.555 + 19.692 + 19.115 + 23.204 + 19.637) / 5

Alternatives: GNU diffutils contains cmp, which outputs offsets and byte values in a byte-by-byte manner:

10  24 ^T    25 ^U
11  14 ^L    35 ^]
25  41 !    226 M-^V
26  42 "    252 M-*
27 226 M-^V  41 !
28 252 M-*   42 "

hexdiff.py adds context by outputting in unified diff format, uses hex values, and joins differences using semantic cleanup:

./hexdiff.py test-bytes1 test-bytes2-added
--- test-bytes1
+++ test-bytes2-added
      0x0: 7071a42f707170716d | b'pq\xa4/pqpqm'
-    0x12: 140c | b'\x14\x0c'
+    0x12: 151d | b'\x15\x1d'
     0x12: 6996aa191a1b1c1d771e772122 | b'i\x96\xaa\x19\x1a\x1b\x1c\x1dw\x1ew!"'
-    0x2c: 212296aa9ff3 | b'!"\x96\xaa\x9f\xf3'
+    0x2c: 96aa21229ff31234 | b'\x96\xaa!"\x9f\xf3\x124'
  • Comparing files recursively:
diff -aurwq dir1/ dir2/ | grep '^Only'

# Apply pair-wise process substitution recursively
# Alternative: `... | xargs eval "$(printf 'echo %s %s')"`
diff -aurwq dir1/ dir2/ | \
    gawk 'match($0, /Files (.*) and (.*) differ/, matches) {
        print matches[1] "\n" matches[2]
    }' | \
    xargs -n2 bash -c 'echo "$1 $2"; diff -auw \
        <(gawk "/^[[:space:]]*#|\/\/|<!--/{next} {print}" "$1") \
        <(gawk "/^[[:space:]]*#|\/\/|<!--/{next} {print}" "$2")' _

# diff on distinct keys
p='^\('$(diff -Naurw \
        <(grep -o '^[^=]*' ~/f1) \
        <(grep -o '^[^=]*' ~/f2) | \
    awk '
        NR <= 3 || /^[^+-]/ {next}
        {if (a) {a = a "\\|"} a = a substr($0, 2, length($0) + 1)}
        END {print a}
    ')'\)' && \
diff -Naurw <(grep "$p" ~/f1) <(grep "$p" ~/f2)

Trace changes in variables

Usage:

printf '%s\n' 'a 1' 'a 2' 'b 2' 'a 1' 'c 3' \
    | ./trace.py \
    | vim -c 'set ft=diff' -

Output (count of variable changes; variable; value):

-[0]       a: None
+[1]       a: 1
 [0]       b: None
 [0]       c: None
~~~
-[1]       a: 1
+[2]       a: 2
 [0]       b: None
 [0]       c: None
~~~
 [2]       a: 2
-[0]       b: None
+[1]       b: 2
 [0]       c: None
~~~
-[2]       a: 2
+[3]       a: 1
 [1]       b: 2
 [0]       c: None
~~~
 [3]       a: 1
 [1]       b: 2
-[0]       c: None
+[1]       c: 3
~~~

Apply ignore filters to output

Usage:

./filterdiff.py <(printf '%s\n' '([0-9]+)') test1-text1-filterdiff test1-text2-filterdiff

Output (Includes filtered value 123 from first file as context, not as difference):

--- base
+++ derivative
@@ -1,4 +1,4 @@
 apple
 banana 123
 orange
-papaia
+pear

Compare with diff -u test1-text1-filterdiff test1-text2-filterdiff:

--- test1-text1-filterdiff
+++ test1-text2-filterdiff
@@ -1,4 +1,4 @@
 apple
-banana 123
+banana 456
 orange
-papaia
+pear

Example: strace diff

Consider the following diff between 2 programs:

--- loops.c
+++ loops.with_access.c
@@ -1,5 +1,6 @@
 #include "stdio.h"
 #include "stdlib.h"
+#include "unistd.h"

 void output(char *msg) { printf("%s\n", msg); }

@@ -16,5 +17,6 @@
             }
         }
     }
+    access("/tmp/1", F_OK);
     printf("%d", k);
 }

Input (filtering out any hex or decimal numbers):

./filterdiff.py \
  <(printf '%s\n' '((0x[0-9a-f]+)|([0-9]+))') \
  <(strace ./loops 2>&1 | sort -u) \
  <(strace ./loops.with_access 2>&1 | sort -u)

Output:

--- base
+++ derivative
@@ -1,12 +1,13 @@
 28) = 304
 access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
+access("/tmp/1", F_OK)                  = 0
 arch_prctl(0x3001 /* ARCH_??? */, 0x7fff943f5cc0) = -1 EINVAL (Invalid argument)
 arch_prctl(ARCH_SET_FS, 0x7fb1cd9e9540) = 0
 brk(0x118b000)                          = 0x118b000
 brk(NULL)                               = 0x116a000
 brk(NULL)                               = 0x118b000
 close(3)                                = 0
-execve("./loops", ["./loops"], 0x7fff3350eb00 /* 119 vars */) = 0
+execve("./loops.with_access", ["./loops.with_access"], 0x7ffcfe61feb0 /* 119 vars */) = 0
 +++ exited with 0 +++
 exit_group(0)                           = ?
 fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0

Compare with diff -u <(strace ./loops 2>&1 | sort -u) <(strace ./loops.with_access 2>&1 | sort -u):

--- /proc/self/fd/11	2021-03-04 09:00:58.068761187 +0000
+++ /proc/self/fd/13	2021-03-04 09:00:58.069761198 +0000
@@ -1,29 +1,30 @@
 28) = 304
 access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
-arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc5e62e920) = -1 EINVAL (Invalid argument)
-arch_prctl(ARCH_SET_FS, 0x7f76fbd8b540) = 0
-brk(0xdd7000)                           = 0xdd7000
-brk(NULL)                               = 0xdb6000
-brk(NULL)                               = 0xdd7000
+access("/tmp/1", F_OK)                  = 0
+arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe11326df0) = -1 EINVAL (Invalid argument)
+arch_prctl(ARCH_SET_FS, 0x7f0aa1ec1540) = 0
+brk(0x1b4b000)                          = 0x1b4b000
+brk(NULL)                               = 0x1b2a000
+brk(NULL)                               = 0x1b4b000
 close(3)                                = 0
-execve("./loops", ["./loops"], 0x7ffc4a29ea20 /* 119 vars */) = 0
+execve("./loops.with_access", ["./loops.with_access"], 0x7ffe745a6fc0 /* 119 vars */) = 0
 +++ exited with 0 +++
 exit_group(0)                           = ?
 fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
 fstat(3, {st_mode=S_IFREG|0644, st_size=301428, ...}) = 0
 fstat(3, {st_mode=S_IFREG|0755, st_size=3183216, ...}) = 0
 if
-mmap(0x7f76fbbe5000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f76fbbe5000
-mmap(0x7f76fbd35000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f76fbd35000
-mmap(0x7f76fbd80000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f76fbd80000
-mmap(0x7f76fbd86000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f76fbd86000
-mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f76fbbc0000
-mmap(NULL, 301428, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f76fbd8c000
-mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f76fbd8a000
+mmap(0x7f0aa1d1b000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f0aa1d1b000
+mmap(0x7f0aa1e6b000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f0aa1e6b000
+mmap(0x7f0aa1eb6000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f0aa1eb6000
+mmap(0x7f0aa1ebc000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0aa1ebc000
+mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0aa1cf6000
+mmap(NULL, 301428, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0aa1ec2000
+mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0aa1ec0000
 mprotect(0x403000, 4096, PROT_READ)     = 0
-mprotect(0x7f76fbd80000, 12288, PROT_READ) = 0
-mprotect(0x7f76fbe02000, 4096, PROT_READ) = 0
-munmap(0x7f76fbd8c000, 301428)          = 0
+mprotect(0x7f0aa1eb6000, 12288, PROT_READ) = 0
+mprotect(0x7f0aa1f38000, 4096, PROT_READ) = 0
+munmap(0x7f0aa1ec2000, 301428)          = 0
 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
 openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
 pread64(3, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\1\0\0\300\4\0\0\0\330\1\0\0\0\0\0\0"..., 48, 848) = 48

Compare with diff -u <(strace ./loops 2>&1 | sed 's/\(0x[0-9a-f]\+\)\|\([0-9]\+\)/_/g' | sort -u) <(strace ./loops.with_access 2>&1 | sed 's/\(0x[0-9a-f]\+\)\|\([0-9]\+\)/_/g' | sort -u) (loss of original context, e.g. name of accessed file):

--- /proc/self/fd/11    2021-03-04 09:20:23.754515183 +0000
+++ /proc/self/fd/12    2021-03-04 09:20:23.755515196 +0000
@@ -1,11 +1,12 @@
 _) = _
 access("/etc/ld.so.preload", R_OK)      = -_ ENOENT (No such file or directory)
+access("/tmp/_", F_OK)                  = _
 arch_prctl(_ /* ARCH_??? */, _) = -_ EINVAL (Invalid argument)
 arch_prctl(ARCH_SET_FS, _) = _
 brk(_)                           = _
 brk(NULL)                               = _
 close(_)                                = _
-execve("./loops", ["./loops"], _ /* _ vars */) = _
+execve("./loops.with_access", ["./loops.with_access"], _ /* _ vars */) = _
 +++ exited with _ +++
 exit_group(_)                           = ?
 fstat(_, {st_mode=S_IFIFO|_, st_size=_, ...}) = _

Example: function dissassembly diff between 2 executables

Consider the following diff between 2 programs:

--- loops.c
+++ loops.with_access.with_unused.c
@@ -1,5 +1,10 @@
 #include "stdio.h"
 #include "stdlib.h"
+#include "unistd.h"
+
+int unused() {
+    return 1;
+}
 
 void output(char *msg) { printf("%s\n", msg); }
 
@@ -16,5 +21,6 @@
             }
         }
     }
+    access("/tmp/1", F_OK);
     printf("%d", k);
 }

Input:

./funcdiff_tui.py ../sequences/loops ../sequences/loops.with_access.with_unused

Output (interactive interface with preview for function diffs, offsets don't contribute to the diff, entries sorted by similarity ratio):

image

References:

Sequences

Summarize matched bytes in file

Usage:

# hex encoded
./hexmatch.py <(printf '%s\n' foo bar) 6f

# literal
./hexmatch.py <(printf '%s\n' foo bar) $(printf '%s' o | xxd -p)

# little-endian
hexmatch.py <(printf '%s\n' DCBA) $(printf '%s' BC | xxd -p) -e le

# up to off-by-2 values
hexmatch.py <(printf '%s\n' AAA BBB ZZZ) $(printf '%s' C | xxd -p) -k 2

Output (0x[...]: offset in hex, e: endianess, k: off-by-k, b'[...]': matched bytes):

# hex encoded / literal
/proc/self/fd/11:1(0x1):b'o'
/proc/self/fd/11:2(0x2):b'o'

# little-endian
/proc/self/fd/11:1(0x1):e=le,k=0:4342 b'CB'

# up to off-by-2 values
/proc/self/fd/11:0(0x0):e=be,k=-2:41 b'A'
/proc/self/fd/11:1(0x1):e=be,k=-2:41 b'A'
/proc/self/fd/11:2(0x2):e=be,k=-2:41 b'A'
/proc/self/fd/11:4(0x4):e=be,k=-1:42 b'B'
/proc/self/fd/11:5(0x5):e=be,k=-1:42 b'B'
/proc/self/fd/11:6(0x6):e=be,k=-1:42 b'B'

Related work:

Filter out repeated k-line patterns in a plaintext stream

Usage:

printf '%s\n' 1 2 1 2 3 3 4 | ./multi_line-uniq.sh

Output (single occurrences of '1 2' and '3'):

1
2
3
4

Find longest k-repeating substrings in byte stream

Input (hex dump of file):

00000000: 7071 a42f 7071 7071 6d14 0c69 96aa 191a  pq./pqpqm..i....
00000010: 1b1c 1d77 1e77 2122 2122 96aa 9ff3       ...w.w!"!"....

Output (longest 2-repeating substrings with total count):

b'pq'
b'\x96\xaa'
b'!"'
3

Alternatives (with filter for numeric patterns): ./reducer_tui.py test-reducer1 <(printf '%s\n' '([0-9]+)')

Input (test-reducer1 file contents):

xyz
abc
abc
foo 123
bar baz
foo 456
bar baz
123

Output (interactive interface with preview for expanded unfiltered substrings):

image

Related work:

References:

Colorize contiguous longest k-repeating substrings

Usage:

printf '%s\n' 00 111 12 13 111 12 13 14 | ./repeated-sum.py

Output (count of contiguous occurrences in [...] + single substring):

            00
colorized | [2]
          | 111
          | 12
          | 13
            14

TODO

Releases

No releases published

Packages

No packages published