An Analysis Framework for KCDC

Frank Polgart DLC 08.06.2020

Motivation

  • KCDC already provides public datasets
  • Analysis Frameworks/Tools may be specialised and cumbersome to get to run
  • in case of large datasets: bring user to the data, instead of data to the user

Requirements

  • accessibility: we want people to actually use it, not another prestige project
  • usability: provide analysis framework
  • administration: the less effort, the better
  • don't reinvent the wheel, most of the work has already been done

Solution

  • accessibility: jupyterhub / notebooks
    almost anyone how has done some data analysis has used a jupyter (formerly ipython-) notebook before

  • useability: python + ROOT
    modern goto for data analysis; need ROOT to read files

  • administration: docker
    light-weight, robust, fits our service needs

  • integration requires some custom code

this is by far a novel approach; low expected maintenance and reasonably "future-proof"

|Created with Snap|||proxyjupyterhubuser|||||||- file listing|||authentication||||||- data downloadArchitectureweb UInotebook serverdockerKCDCFTP

Features

from the users point of view

  • authenticates against KCDC, no extra account necessary
  • transparent access to datashop download area
  • interactive kernels for python and C++
  • loads of extensions available, for example this presentation module

from the providers point of view

  • standard components with active development and documentation
  • if need be, scales to many users & hosts with minimal effort
  • setup with docker compose is pretty much automatic
  • infrastructure description under version control for free
  • it's easy to add more datashops

Example

get some data!

In [3]:
import os
from zipfile import ZipFile
ZipFile('KASCADE_SmallDataSample_wA_runs_0877-7417_ROOT.zip').extractall()
os.listdir()
Out[3]:
['.ipynb_checkpoints',
 'KCDC_analyze_example.C',
 'slides.ipynb',
 'KASCADE_SmallDataSample_wA_runs_0877-7417_ROOT.zip',
 'info.txt',
 'events.root',
 'EULA.pdf']

Example

switch kernels and run some c++

In [1]:
.L KCDC_analyze_example.C
In [2]:
run()
Input file:events.root
KCDC-Entries read from files: 1080295
KCDCM-Entries:     1080295
Array Entries:     986577
Calor Entries:     250981
Grande Entries:    88259
General Entries:   1080295
KCDCN-Entries to be evaluated: 1080295
 processing event No: 0  of 1080295
 processing event No: 100000  of 1080295
 processing event No: 200000  of 1080295
 processing event No: 300000  of 1080295
 processing event No: 400000  of 1080295
 processing event No: 500000  of 1080295
 processing event No: 600000  of 1080295
 processing event No: 700000  of 1080295
 processing event No: 800000  of 1080295
 processing event No: 900000  of 1080295
 processing event No: 1000000  of 1080295
Entries survived:: 1080295 out of 1080295
general_id >0   :: 1080295
array_id >0     :: 986577
calorimter_id >0:: 250981
grande_id >0    :: 88259
(int) 0

Example

switch back to Python and look at the result

In [1]:
import ROOT
f = ROOT.TFile('KCDC_Test.root')
keys = [_.GetName() for _ in f.GetListOfKeys()]
c = ROOT.TCanvas("foo", "bar", 1920, 1080*len(keys)//4)
c.Divide(2,len(keys)//2)
c.SetLogy()
pad = 0
logspectra = ['h6202', 'h6302', 'h7202']
for key in keys:
    pad+=1
    c.cd(pad)
    if key in logspectra:
        ROOT.gPad.SetLogy()
    f.Get(key).Draw()
Welcome to JupyROOT 6.20/04

Example

In [2]:
c.Draw()

What's next

  • this is a tech-demo
  • needs a little work still to be made accessible to the plublic
  • maybe have more than one datashop (AstroDS?)
  • explore viability of build-in ipython clusters for analysis
  • improve with user feedback

The End