Validating CMS event display tools with real data
This is a report on the work done for a talk to CHEP09 Conference , Prag 21-27 March 2009.
Abstract
An ideal general purpose event display for HEP experiments should:
- shield completely the user from data complexity providing an user friendly GUI
- be fast enough to allow scanning of hundreds, if not thousands, of events in a few minutes
- be light enough to allow access everywhere, also from an Internet cafe'
- produce quality images and animations for publishing and press
- highlight the most relevant event features
This is indeed a complex task and what is presented here is how the
tools now available in CMS cope with this goal. For this reason they were used to access,
scan and analyze visually cosmic runs containing millions of events.
The results of these tests provide a picture of what to expect in the LHC environment
and give feedback for the improvement of these tools.
Introduction
At the beginning my idea of how to develop this validation was that I would look at some events with the event
display tools available and at the end give a kind of score card saying this tool is good for this but not for this, etc.... But how many events and which events I had to look
out of the millions taken? Then it become clear to me that the main difference between the LHC experiments and
the previous ones concerning the event display, was that we have to deal with a huge amount of events. At this
point I found the perfect use case for this validation.
Looking at the data taken in the last cosmic runs , I found some strange events containing a peculiar pattern
in tracker endcaps: strings of clusters aligned with beam line and apparently with no connection with the
cosmic track. I found only one or two every 10,000 events. Since most of the time these aligned clusters weren't
reconstructed as tracks, you couldn't study them with usual analysis metods (i.e. without using the
event display).
At this point I would like as I said in the abstract
to "access ,scan and analyze visually cosmic runs containing millions of events" in order to have some answers
to questions like these:
- Were these events present in all runs?
- Were these events present in equal percentage during the whole run?
- Were these strings distributed uniformly in space (i.e. were all tracker modules interested or only a part of them)?
- Were these events present without magnetic field?
- Were these events present in Montecarlo samples?
To answer these questions I had to develop a system to help me select runs and events inside a run: this required around two months.
At the end this presentation has become the story of this quest to look at all events to study this pattern.
So, you wan't find here the score card you expected: the available event display tools are all fine programs but
in the brave new world of LHC
experiments , if we want to use these fine tools we have to develop a system to assist them and the main purpose
of this presentation is to say what and why we need this.
(The answer of the riddle of the aligned clusters is at the end).
The Challenge : some numbers about CRAFT
CRAFT was the last data taking period with Cosmics and the complete CMS detector. It has provided the biggest and most complete data
collection until now and is used in this study about CMS event display tools.
- Number of events with cosmics tracks : around 300M divided in around 140 good runs ; but the number of events with tracks is only 7M. So expect around 1 event with tracks every 50 events.
- Events taken in around 20 days at the rate of around 300/sec.
- Each event replicated at least 3 times in around 6 different PDS
- Many thousands of dqm plots available for each run and directly accessible from disc(afs area).These plots are filled for all events in a run.
- Average number of events for each run 2.5 M stored in around 100 files of 20,000 events each one but with some
very big runs (up to 15M) and very small (<1M). Each file is available on Cern cluster lxplus from castor mass
storage.
- You can access a single dqm plot for all runs and process them in around 1 minute (300M processed)
- You can do plots of quantities already contained in a file in 1 minute(20,000 events processed but sometime you need to load the file from castor to disk)
- Time necessary to do a simple plot on all events of a 6 M events run: 10 hours inside CMSSW with maximum storage 1500MB; 5 hour with bare Root with maximum
storage 350MB.
Three programs and a Framework
CMSSW is the framework used to process CMS events. It is a very complex softare
that stores events in Root format.
So the minimum software necessary to process CMS events is Root. But this knows only the bare CMS event layout and can represent a quantity only if this is physically
stored on disc.
Root+ FWLite will allow the user
to visualize quantities (like track pt) that can be computed with a simple method from stored quantities. FWLite is an integral part of CMSSW and it is distributed with CMSSW . It is a kind of programming user interface created to shield
the user from CMSSW complexity allowing him to analyze data with simple scripts
in CINT or python. Although it can in principle do everything that CMSSW can do,in practice (at the moment I write this) it is used only for simple interactive analyses of data.
Then you have the full framework (called CMSSW) that allows the user to access any class and any CMS environment data stored on CMS databases.
An event display completely embedded inside CMSSW has the advantage it can
use any CMSSW class and access any data contained in CMSSW databases. Unfortunately this also means an heavy use of computer and network resources that slows
down considerably the visualization degrading the user experience.
All the four event display tools I used for this test have dealt in some way with this
problem.
First of all we have the two CMS general purpose event display tools:
- Iguana will run both completely embedded in CMSSW or as a so called
"iView browser" that will display events in a special format created by
normal CMSSW tasks with no graphics. The event in the special format can
be read from a normal file or provided by a server with VisNet protocol.
- Fireworks uses FWLite and in this way
shields the user from CMSSW complexity. The use of FWLite is also done
in such a way to optimize the use of computer resources and the result is
an easier to use and faster program.
Then, there is another general purpose event display tool: this isn't supported
by CMS and is essentially the result of the efforts of a single person , Loic Quertenmont, that allowed
me to use it for this test. His tool is experiment independent and is used in many experiments.
Frog
works only outside CMSSW. It is based on a custom event format
completely independent from CMS (it can be used to display any hep experimental data). In this way it can be used on many platforms (Windows,Linux,Mac).
The CMS specific part is a normal CMSSW task that will access CMS events
and store them in the frog format. The data stored contains not only geometric
data but also any other kind of data that can be used later in the visualizer
either to be displayed as it is (when picking the object: i.e. track pt) or
to apply cuts.
The last tool used is the trackermap. A visualization tool specialized for tracker. This is a synoptic
view of the tracker and is implemented both in "Iguana embedded" and as single class that uses no other
CMSSW services and allow its use like a Root histogram.
Because of the use case I was validating what mattered for me more in a tool was speed and the possibility
to access 3D view (I had to check tracks in space).
A test of speed done scanning visually 551 events (same machine, local access to events) gave me the following
result:
fireworks | 2' |
iguana embedded | 8':30 sec |
iguana iView browser | 30 sec |
frog | 30 sec |
These results and the fact that fireworks at the time I started the study didn't have 3D view and iguana iView was
not yet available , forced me to
use frog . I used also extensively the trackermap outside iguana.
Note: at this point people that were expecting a score card will become extremely disappointed about how I am proceeding. I have discarded from the validation
the two CMS ufficial programs because they are not ready. Unfortunately this
is the truth. Try to look with Iguana or Fireworks at a 20,000 events sample
as I did with frog: it is impossible unless you want to spend days on it.
And the reason for this isn't because iguana or foreworks are slow as display
program but because they are embedded in CMSSW. In fact iguana outside CMSSW
is as fast as frog! So ,at this time, I had two possibilities: use frog or wait
until iguana outside CMSSW was ready.I choose of course the first approach since
my experience would benefit the iguana and fireworks developers. Also CMSSW
developers could try , from my experience, if CMSSW could be optimized for
this use case.
What means looking at events in an experiment like CMS
Although the use of event display for monitoring or other applications was quite obvious in old
experiments, with Cms and similar new generation experiments, its use should
be rethinked. We take so many events and most of them are garbage that looking
at them to help in monitoring or other tasks , may seem like the proverbial
searching for a needle in a haystack. Looking at a few random events isn't
a big help. It will easily let you miss the few interesting events.
So, if we still want to use an event display to look at ALL events in a run, the only solution is
to select the possibly interesting events and look only at them. Taking in
account that we would like to look at all events and that a person , even with a
a fast program, cannot scan more than a few hundreds events in an hour , than
an optimum cut should take only around 1 event every 100 events in a run.
For example , for tracker monitoring, we could request only events with
number of tracks > 0. If the filter is fast, this will allow us to scan one million
events , in a few hours.
The user should also be able to change easily his/her selection
strategy , based on the features of selected events.
For example scanning the sample with tracks he could find some events with
a very big number of clusters and he wants to know if there are events with
large number of clusters but NO TRACKS.
Another important thing is that the user should be able easily to change
from run to run without bothering with list of filenames.
Since you have so many events that you cannot even in principle look, it would be nice to have access to synoptic views that would show in a single image the main features of a run. This image should be created processing all events and used to give the user a rough idea of what's happening in the run. By putting toghether all 150 miniatures of these images(one for each run), you have a kind of visual summary of your data.
All these new requirements aren't met by the event display programs and
require the use and sometime the development of new tools.
Trying to use event display to monitor CMS tracker : my experience
The first thing was to look AT ALL EVENTS in a file (around 20,000 events).
This may seem trivial but if you look at the times reported before you see
that iguana embedded cannot be used for this purpose (too slow).
iguana light wasn't yet available. Fireworks wasn't a good tool for this purpose
since it had still no 3D display when I started this exercise: impossible to look at tracks in space. This left only frog
available. Creating file in frog format for 20,000 events requires only an hour and then in
the same time you can scan quickly all events.
The result of this scan was that the events with tracks in tracker are very
few (around 500 out of 20000) and that there were around 10 events with some
very obvious noise.
A quick scan of other runs was made possible in frog by an automatic procedure
set up by Nikita Beliy and showed that this happened in all files examined.
At this point a more thorough study was necessary. First of all I tried to
classify the noisy events. There were two obvious noise patterns.
- Events with a very big number of clusters positioned randomly in one or more of the 4 tracker parts named TIB/TOB/TID/TEC. Those events had no tracks only because the track finding algorithm will protect itself in this cases by doing nothing.
- Events with a peculiar pattern of clusters mostly in TEC. Clusters seem to be organized
in a "string" parallel to beam line. Although the modules in different layers
are aligned, the clusters aren't. But the track finding algorithms manages
sometime to fit one or more tracks completely included in TEC.
The first noise pattern has already been studied in detail by other people
and a cause has been found (it depends on trigger) so no further study was
necessary. The second type of noise was new and required now some more exhaustive visual study of events to try to characterize it. First of all it was necessary to study it during the whole run and from run to run.
We found the following possible strategy in three steps to do this and make some progress:
- Using a synoptic view of the tracker (trackermap) do a kind of tracker
x-ray showing for each module the total number of rechits during the run. This would
be done for ALL EVENTS IN A RUN , for ALL RUNS and also for MC data. Comparison of these images would clarify
if some anomaly was apparent in TEC and how this anomaly changed in time.
- When this kind of visual summary of all CRAFT data is created, classify also ALL EVENTS creating small ad-hoc Root
trees for each run. For each event a "classification ntuple" is stored containing around 10 quantities necessary to characterize the event (for example: number of clusters, number of tracks). This information will be used for further fast study of the problem and as basis for the third strategy step.(What I did here is to create so called "tags". Neverthless
I wouldn't use this name since it refers to a global strategy to access data.)
- Select a relatively small subsamble of events presenting anomalies. These subsamples should contain MOSTLY
noise events that we could inspect visually one by one.
The implementation of this strategy required around a month mostly to learn new "tricks" necessary to process so large amount of data and then another month to do
the analysis. It is incredible how difficult it is to scale from the single file analysis to the single run
analysis (which contains up to 1,000 files).Considering that the single file analysis requires less
than a minute with Root using an interpreted program, one would naively think that scaling to 1 hundred
files using a compiled program will require less than an hour.
Unfortunately also here the complexity of CMSSW will transform this simple task
in a gargantuan task requiring a ridiculous amount of resources.
It is clear that here some optimization is needed if we would like to debug
quickly the detector when we start taking real data.If you need at least a day
to obtain some simple histogram on a complete run, you react very slowly to new problems.
Implementing the strategy
The result of step 1 can be seen in this visual summary of CRAFT data.
The first row represents the rechit trackermaps for MC data samples with number of tracks increasing from
16K to 800K. Each run image should be compared with the MC image with number of tracks nearest to number of tracks
in the run. To represent number of events with rechits in a given module a "rainbow" palette is used with
blue indicating 1 event and red 255 or more events. This palette and other informations about the trackermap are
represented in the first image : the legend.
The creation of the image for each run is done by a CMSSW task that requires around 1 day . It processes
all events filling the trakermap. In the same time it stores a small amount of information for each event in
a root tree. Although it is in principle possible to run in parallel all tasks one for each run: in practice
the scarce disc resources available to a normal user on AFS wan't allow me to run more than a few runs in parallel.
The complete processing of all runs that could in principle be carried out in a week, required for this reason
around one month.
Interestingly enough, the building of this summary would allow the discovery of new problems
in tracker :for example the two blue wedges that you see in two of the discs in the bottom layer were discovered tracing their origin to a wrong cabling. The same visualization showed also clearly when problems had
appeared and when they stopped: for example you can see clearly that the hole present in all rectangular layers disappears starting from this run. It also gave a quick feedback in cases when the
problem could be solved by reprocessing the data.
It would help a lot if this procedure could be completely integrated in normal tracker data DQM processing.
Both the trackermap and the "event classification tree" could be added as additional material to the thousands
of plots already automatically available.
Once the "classification ntuple" is available, selecting the interesting events from the castor dataset can
be performed automatically by running a root macro that will process the "classification tree" , create a list
of events and then run the CMSSW task that creates the file in frog format reading only the selected events.
This operation is relatively fast (compared to full run processing in CMSSW) and can be performed on a typical
run in only an hour.
The results
Comparison of resources used to answer a query “events having aligned strings of clusters in endcaps” in run 66714
with 5,9 Mevents and 197 files.
209 events selected in the whole run. We use three methods:
- Full CMSSW :build a skim of the events presenting this pattern and then look at them. Because of the amount of resources needed, this can be run only in batch using LSF service.
- Root +FWLite without the classification ntuples. This can be run interactively on lxplus.
- We create the selection using the classification ntuples. Then we
use this list of events to look directly at events (or produce a skim).
This second step requires a CMSSW task but uses very little cpu time
and so can be run interactively on lxplus.
| 1 – Skim with full CMSSW | 2 – Selection with FWLite | 3 – Use of selection ntuples
|
Runs | In batch on LSF | Interactively on lxplus | Interactively on lxplus
|
CPU time | 14,881 sec | 3,546 sec | 860 sec
|
Memory max/swap | 1286/1599 MB | 250/360MB | 1200/1400MB
|
Solar time | 10 hours | 4 hours | 1 hour
|
Can look at first event | After minutes or hours (depends on time the job waits before starting) | After 10 min | After 1 min
|
Time to look at all 200 events | 10 hours+ time the job waites in queue before
starting | 4 hours | 1 hour
|
Method 2 and 3 are so fast because
they use 3 jobs that run on three different computers in parallel. The 3 jobs are optimized in order to use less resources and to be run interactively. Each job is also specialized for a task.
- The Selector: this job is specialized for selecting events. This can be done using directly the events (method 2) or using the event classification (method 3). In this case doing interactively a list of all events in a run with an arbitrary cut (i.e. tracks in TID and in pixel) takes less than 1 minute. Otherwise using FWLite it needs 4 hours.
- The Analyzer. A CMSSW task creating the file in the format requested by the visualizer for the selected events extracts quickly from events the information needed by the Visualizer. As soon as a new event is ready it is made available to the Visualizer. This job in method 2 works in parallel with 1 and so it takes no time.
- The Visualizer. After a few minutes let me look at the first selected events. In an hour I can look at all selected events in a 6M events run with method 3(4 hours with method 2) . This job can run on any computer with an Internet connection.
In any moment I can stop the process and try with a new selection
Note that both The Selector and The Analyzer could be implemented as centralized services available to
the CMS community through the Web if the CMS management decides to do so. This would clearly profit the CMS
community making the task of displaying any selection of events incredibly easy in comparison to what we have now.
The Selector would be a central service to select events with arbitrary queries returning the list of events
satisfying the query. The Analyzer would instead digest these selections returning the events in a format ready
to be seen by the Visualizer. Note that the attempt done now for an "Event Server" returning the complete events
aren't good for event display for obvious reason : the complete event needs all CMSSW software to be visualized
(instead of the Visualizer that can be downloaded and used even in a Internet cafe') and also needs a fast
link because of the size of data to be sent.
Conclusion
During data taking in CRAFT , event display use for tracker monitoring was almost absent. This was something completely
new compared to old hep experiments. The reason for this could be traced to the big number of events taken,
to the fact that it was difficult to select from all these events the interesting ones, and last but not least
to the slowness of available event display tools. I present here a possible strategy that could overcome
these difficulties and allow in principle to look at all interesting events . I have considered the practical case
that the interesting events are noise events in the tracker. A first step will create a visual summary of each run
and a classification
of the events based on their most relevant feature. This classification is an event database stored as
a simple root tree . This database is then harvested to create quickly subsamples of interesting events in a format
that can be scanned fast. This strategy was successfully tested using frog , one of the three event display tools
available. But in future it should be possible to use also iguana with its light version . After I started this study fireworks has added 3D display and it is
possible that we could somehow adapt this strategy also for use with it.
The proposed strategy requires an huge amount of resources . Its use by a
normal user is impractical and
for this reason I propose here to include it in normal DQM monitoring adding to the thousands of plots already
built for each run the following new items:
-
synoptic views of subdetectors
- a "database" of all events with a few informations necessary to classify them.
The synoptic view will show to the user which runs are to be visualized. The database will make then possible a fast creation of subsamples of interesting events in a format ready for fast scanning.
The answer of the riddle
Once the system was working it was easy to solve the riddle of these aligned
strings of clusters in tracker endcaps:
- Processing a run without magnetic field showed that this pattern is absent
- Processing a run with MC generated data showed that the pattern was present
although in a slightly smaller percentage of events.
- Processing 3 complete runs I could look at a sample of 1000 events containing
this pattern. It was clear that, in almost all cases ,when the primary cosmic was
present in tracker, you could find that the line along the string of clusters
was connected in 3D to the cosmic track. Sometimes ,like in this event
, you could see the complete track originating from the primary
track.
- So these strings of aligned clusters are secondary tracks generated by
interaction of the primary track with detector materials. These low
momentum tracks spiral along the magnetic field generating the pattern
- A plot of the cluster charge confirmed this explanation.
See this post to commissioning hypernews
References
Giuseppe Zito:
info@zitogiuseppe.com
Last modified .