CMS 3D CMS Logo

Classes | Functions | Variables
cmsHarvester Namespace Reference

Classes

class  CMSHarvester
 CMSHarvester class. More...
 
class  CMSHarvesterHelpFormatter
 Helper class: CMSHarvesterHelpFormatter. More...
 
class  DBSXMLHandler
 Helper class: DBSXMLHandler. More...
 
class  Error
 Helper class: Error exception. More...
 
class  Usage
 Helper class: Usage exception. More...
 

Functions

def build_dataset_ignore_list (self)
 
def build_dataset_list (self, input_method, input_name)
 class Handler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): if name == "result": site_name = str(attrs["STORAGEELEMENT_SENAME"])

TODO TODO TODO

Ugly hack to get around cases like this:

$ dbs search –query="find dataset, site, file.count where dataset=/RelValQCD_Pt_3000_3500/CMSSW_3_3_0_pre1-STARTUP31X_V4-v1/GEN-SIM-RECO"

Using DBS instance at: http://cmsdbsprod.cern.ch/cms_dbs_prod_global/servlet/DBSServlet

Processing ...

More...
 
def build_dataset_use_list (self)
 
def build_datasets_information (self)
 
def build_runs_ignore_list (self)
 
def build_runs_list (self, input_method, input_name)
 
def build_runs_use_list (self)
 
def check_cmssw (self)
 
def check_dataset_list (self)
 
def check_dbs (self)
 
def check_globaltag (self, globaltag=None)
 

CRAB

More...
 
def check_globaltag_contains_ref_hist_key (self, globaltag, connect_name)
 
def check_globaltag_exists (self, globaltag, connect_name)
 
def check_input_status (self)
 
def check_ref_hist_mappings (self)
 
def check_ref_hist_tag (self, tag_name)
 
def create_and_check_castor_dir (self, castor_dir)
 
def create_and_check_castor_dirs (self)
 
def create_castor_path_name_common (self, dataset_name)
 
def create_castor_path_name_special (self, dataset_name, run_number, castor_path_common)
 
def create_config_file_name (self, dataset_name, run_number)
 
def create_crab_config (self)
 
def create_es_prefer_snippet (self, dataset_name)
 
def create_harvesting_config (self, dataset_name)
 
def create_harvesting_config_file_name (self, dataset_name)
 

Only add the alarming piece to the file name if this is

a spread-out dataset.

More...
 
def create_harvesting_output_file_name (self, dataset_name, run_number)
 
def create_me_extraction_config (self, dataset_name)
 

In case this file is the second step (the real harvesting

step) of the two-step harvesting we have to tell it to use

our local files.

More...
 
def create_me_summary_config_file_name (self, dataset_name)
 
def create_me_summary_output_file_name (self, dataset_name)
 
def create_multicrab_block_name (self, dataset_name, run_number, index)
 
def create_multicrab_config (self)
 

CRAB

More...
 
def create_output_file_name (self, dataset_name, run_number=None)
 
def dbs_check_dataset_spread (self, dataset_name)
 def dbs_resolve_dataset_number_of_sites(self, dataset_name): """Ask DBS across how many sites this dataset has been spread out. More...
 
def dbs_resolve_cmssw_version (self, dataset_name)
 
def dbs_resolve_dataset_name (self, dataset_name)
 
def dbs_resolve_datatype (self, dataset_name)
 
def dbs_resolve_globaltag (self, dataset_name)
 
def dbs_resolve_number_of_events (self, dataset_name, run_number=None)
 
def dbs_resolve_runs (self, dataset_name)
 def dbs_resolve_dataset_number_of_events(self, dataset_name): """Ask DBS across how many events this dataset has been spread out. More...
 
def escape_dataset_name (self, dataset_name)
 if self.datasets_information[dataset_name]["num_events"][run_number] != 0: pdb.set_trace() DEBUG DEBUG DEBUG end More...
 
def load_ref_hist_mappings (self)
 
def option_handler_caf_access (self, option, opt_str, value, parser)
 
def option_handler_castor_dir (self, option, opt_str, value, parser)
 def option_handler_dataset_name(self, option, opt_str, value, parser): """Specify the name(s) of the dataset(s) to be processed. More...
 
def option_handler_crab_submission (self, option, opt_str, value, parser)
 
def option_handler_list_types (self, option, opt_str, value, parser)
 
def option_handler_no_t1access (self, option, opt_str, value, parser)
 
def option_handler_preferred_site (self, option, opt_str, value, parser)
 
def option_handler_saveByLumiSection (self, option, opt_str, value, parser)
 
def option_handler_sites (self, option, opt_str, value, parser)
 
def parse_cmd_line_options (self)
 
def pick_a_site (self, sites, cmssw_version)
 self.logger.debug("Checking CASTOR path piece `%s'" % \ piece) More...
 
def process_dataset_ignore_list (self)
 
def process_runs_use_and_ignore_lists (self)
 
def ref_hist_mappings_needed (self, dataset_name=None)
 
def run (self)
 
def setup_dbs (self)
 

Now we try to do a very simple DBS search.

More...
 
def setup_harvesting_info (self)
 
def show_exit_message (self)
 

DEBUG DEBUG DEBUG

This is probably only useful to make sure we don't muck

things up, right?

Figure out across how many sites this sample has been spread.

More...
 
def singlify_datasets (self)
 
def write_crab_config (self)
 def create_harvesting_config(self, dataset_name): """Create the Python harvesting configuration for a given job. More...
 
def write_harvesting_config (self, dataset_name)
 
def write_me_extraction_config (self, dataset_name)
 
def write_multicrab_config (self)
 

Variables

 caf_access
 
 castor_base_dir
 
 cmssw_version
 
 crab_submission
 
 datasets_information
 
 datasets_to_ignore
 
 datasets_to_use
 
 dbs_api
 
 globaltag
 
 harvesting_info
 
 harvesting_mode
 
 harvesting_type
 
 Jsonfilename
 
 Jsonlumi
 
 non_t1access
 
 nr_max_sites
 
 option_parser
 
 preferred_site
 
 ref_hist_mappings_file_name
 
 runs_to_ignore
 
 runs_to_use
 
 saveByLumiSection
 

Function Documentation

def cmsHarvester.build_dataset_ignore_list (   self)
Build a list of datasets to ignore.

NOTE: We should always have a list of datasets to process, but
it may be that we don't have a list of datasets to ignore.

Definition at line 3442 of file cmsHarvester.py.

3443  """Build a list of datasets to ignore.
3444 
3445  NOTE: We should always have a list of datasets to process, but
3446  it may be that we don't have a list of datasets to ignore.
3447 
3448  """
3449 
3450  self.logger.info("Building list of datasets to ignore...")
3451 
3452  input_method = self.input_method["datasets"]["ignore"]
3453  input_name = self.input_name["datasets"]["ignore"]
3454  dataset_names = self.build_dataset_list(input_method,
3455  input_name)
3456  self.datasets_to_ignore = dict(list(zip(dataset_names,
3457  [None] * len(dataset_names))))
3458 
3459  self.logger.info(" found %d dataset(s) to ignore:" % \
3460  len(dataset_names))
3461  for dataset in dataset_names:
3462  self.logger.info(" `%s'" % dataset)
3463 
3464  # End of build_dataset_ignore_list.
3465 
def build_dataset_ignore_list(self)
OutputIterator zip(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2, OutputIterator result, Compare comp)
How EventSelector::AcceptEvent() decides whether to accept an event for output otherwise it is excluding the probing of A single or multiple positive and the trigger will pass if any such matching triggers are PASS or EXCEPTION[A criterion thatmatches no triggers at all is detected and causes a throw.] A single negative with an expectation of appropriate bit checking in the decision and the trigger will pass if any such matching triggers are FAIL or EXCEPTION A wildcarded negative criterion that matches more than one trigger in the trigger list("!*","!HLTx*"if it matches 2 triggers or more) will accept the event if all the matching triggers are FAIL.It will reject the event if any of the triggers are PASS or EXCEPTION(this matches the behavior of"!*"before the partial wildcard feature was incorporated).Triggers which are in the READY state are completely ignored.(READY should never be returned since the trigger paths have been run
def cmsHarvester.build_dataset_list (   self,
  input_method,
  input_name 
)

class Handler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): if name == "result": site_name = str(attrs["STORAGEELEMENT_SENAME"])

TODO TODO TODO

Ugly hack to get around cases like this:

$ dbs search –query="find dataset, site, file.count where dataset=/RelValQCD_Pt_3000_3500/CMSSW_3_3_0_pre1-STARTUP31X_V4-v1/GEN-SIM-RECO"

Using DBS instance at: http://cmsdbsprod.cern.ch/cms_dbs_prod_global/servlet/DBSServlet

Processing ...

\

PATH STORAGEELEMENT_SENAME COUNT_FILES

_________________________________________________________________________________

/RelValQCD_Pt_3000_3500/CMSSW_3_3_0_pre1-STARTUP31X_V4-v1/GEN-SIM-RECO 1

/RelValQCD_Pt_3000_3500/CMSSW_3_3_0_pre1-STARTUP31X_V4-v1/GEN-SIM-RECO cmssrm.fnal.gov 12

/RelValQCD_Pt_3000_3500/CMSSW_3_3_0_pre1-STARTUP31X_V4-v1/GEN-SIM-RECO srm-cms.cern.ch 12

if len(site_name) < 1: return

TODO TODO TODO end

run_number = int(attrs["RUNS_RUNNUMBER"]) file_name = str(attrs["FILES_LOGICALFILENAME"]) nevents = int(attrs["FILES_NUMBEROFEVENTS"])

I know, this is a bit of a kludge.

if not files_info.has_key(run_number):

New run.

files_info[run_number] = {} files_info[run_number][file_name] = (nevents, [site_name]) elif not files_info[run_number].has_key(file_name):

New file for a known run.

files_info[run_number][file_name] = (nevents, [site_name]) else:

New entry for a known file for a known run.

DEBUG DEBUG DEBUG

Each file should have the same number of

events independent of the site it's at.

assert nevents == files_info[run_number][file_name][0]

DEBUG DEBUG DEBUG end

files_info[run_number][file_name][1].append(site_name) OBSOLETE OBSOLETE OBSOLETE end site_names_ref = set(files_info[run_number].values()[0][1]) for site_names_tmp in files_info[run_number].values()[1:]: if set(site_names_tmp[1]) != site_names_ref: mirrored = False break def dbs_check_dataset_num_events(self, dataset_name): """Figure out the number of events in each run of this dataset. This is a more efficient way of doing this than calling dbs_resolve_number_of_events for each run. # BUG BUG BUG

This might very well not work at all for spread-out samples. (?)

BUG BUG BUG end

""" # DEBUG DEBUG DEBUG

If we get here DBS should have been set up already.

assert not self.dbs_api is None

DEBUG DEBUG DEBUG end

api = self.dbs_api dbs_query = "find run.number, file.name, file.numevents where dataset = %s " \ "and dataset.status = VALID" % \ dataset_name try: api_result = api.executeQuery(dbs_query) except DbsApiException: msg = "ERROR: Could not execute DBS query" self.logger.fatal(msg) raise Error(msg) try: files_info = {} class Handler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): if name == "result": run_number = int(attrs["RUNS_RUNNUMBER"]) file_name = str(attrs["FILES_LOGICALFILENAME"]) nevents = int(attrs["FILES_NUMBEROFEVENTS"]) try: files_info[run_number][file_name] = nevents except KeyError: files_info[run_number] = {file_name: nevents} xml.sax.parseString(api_result, Handler()) except SAXParseException: msg = "ERROR: Could not parse DBS server output" self.logger.fatal(msg) raise Error(msg) num_events_catalog = {} for run_number in files_info.keys(): num_events_catalog[run_number] = sum(files_info[run_number].values()) # End of dbs_check_dataset_num_events. return num_events_catalog End of old version.

Build a list of all datasets to be processed.

Definition at line 3356 of file cmsHarvester.py.

References dbs_resolve_dataset_name().

3356  def build_dataset_list(self, input_method, input_name):
3357  """Build a list of all datasets to be processed.
3358 
3359  """
3360 
3361  dataset_names = []
3362 
3363  # It may be, but only for the list of datasets to ignore, that
3364  # the input method and name are None because nothing was
3365  # specified. In that case just an empty list is returned.
3366  if input_method is None:
3367  pass
3368  elif input_method == "dataset":
3369  # Input comes from a dataset name directly on the command
3370  # line. But, this can also contain wildcards so we need
3371  # DBS to translate it conclusively into a list of explicit
3372  # dataset names.
3373  self.logger.info("Asking DBS for dataset names")
3374  dataset_names = self.dbs_resolve_dataset_name(input_name)
3375  elif input_method == "datasetfile":
3376  # In this case a file containing a list of dataset names
3377  # is specified. Still, each line may contain wildcards so
3378  # this step also needs help from DBS.
3379  # NOTE: Lines starting with a `#' are ignored.
3380  self.logger.info("Reading input from list file `%s'" % \
3381  input_name)
3382  try:
3383  listfile = open("/afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/harvesting/bin/%s" %input_name, "r")
3384  print "open listfile"
3385  for dataset in listfile:
3386  # Skip empty lines.
3387  dataset_stripped = dataset.strip()
3388  if len(dataset_stripped) < 1:
3389  continue
3390  # Skip lines starting with a `#'.
3391  if dataset_stripped[0] != "#":
3392  dataset_names.extend(self. \
3393  dbs_resolve_dataset_name(dataset_stripped))
3394  listfile.close()
3395  except IOError:
3396  msg = "ERROR: Could not open input list file `%s'" % \
3397  input_name
3398  self.logger.fatal(msg)
3399  raise Error(msg)
3400  else:
3401  # DEBUG DEBUG DEBUG
3402  # We should never get here.
3403  assert False, "Unknown input method `%s'" % input_method
3404  # DEBUG DEBUG DEBUG end
3405 
3406  # Remove duplicates from the dataset list.
3407  # NOTE: There should not be any duplicates in any list coming
3408  # from DBS, but maybe the user provided a list file with less
3409  # care.
3410  # Store for later use.
3411  dataset_names = sorted(set(dataset_names))
3412 
3413 
3414  # End of build_dataset_list.
3415  return dataset_names
3416 
Helper class: Error exception.
def dbs_resolve_dataset_name(self, dataset_name)
def build_dataset_list(self, input_method, input_name)
class Handler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): if name == "resul...
def cmsHarvester.build_dataset_use_list (   self)
Build a list of datasets to process.

Definition at line 3419 of file cmsHarvester.py.

3420  """Build a list of datasets to process.
3421 
3422  """
3423 
3424  self.logger.info("Building list of datasets to consider...")
3425 
3426  input_method = self.input_method["datasets"]["use"]
3427  input_name = self.input_name["datasets"]["use"]
3428  dataset_names = self.build_dataset_list(input_method,
3429  input_name)
3430  self.datasets_to_use = dict(list(zip(dataset_names,
3431  [None] * len(dataset_names))))
3432 
3433  self.logger.info(" found %d dataset(s) to process:" % \
3434  len(dataset_names))
3435  for dataset in dataset_names:
3436  self.logger.info(" `%s'" % dataset)
3437 
3438  # End of build_dataset_use_list.
3439 
OutputIterator zip(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2, OutputIterator result, Compare comp)
def build_dataset_use_list(self)
How EventSelector::AcceptEvent() decides whether to accept an event for output otherwise it is excluding the probing of A single or multiple positive and the trigger will pass if any such matching triggers are PASS or EXCEPTION[A criterion thatmatches no triggers at all is detected and causes a throw.] A single negative with an expectation of appropriate bit checking in the decision and the trigger will pass if any such matching triggers are FAIL or EXCEPTION A wildcarded negative criterion that matches more than one trigger in the trigger list("!*","!HLTx*"if it matches 2 triggers or more) will accept the event if all the matching triggers are FAIL.It will reject the event if any of the triggers are PASS or EXCEPTION(this matches the behavior of"!*"before the partial wildcard feature was incorporated).Triggers which are in the READY state are completely ignored.(READY should never be returned since the trigger paths have been run
def cmsHarvester.build_datasets_information (   self)
Obtain all information on the datasets that we need to run.

Use DBS to figure out all required information on our
datasets, like the run numbers and the GlobalTag. All
information is stored in the datasets_information member
variable.

Definition at line 5319 of file cmsHarvester.py.

5320  """Obtain all information on the datasets that we need to run.
5321 
5322  Use DBS to figure out all required information on our
5323  datasets, like the run numbers and the GlobalTag. All
5324  information is stored in the datasets_information member
5325  variable.
5326 
5327  """
5328 
5329  # Get a list of runs in the dataset.
5330  # NOTE: The harvesting has to be done run-by-run, so we
5331  # split up datasets based on the run numbers. Strictly
5332  # speaking this is not (yet?) necessary for Monte Carlo
5333  # since all those samples use run number 1. Still, this
5334  # general approach should work for all samples.
5335 
5336  # Now loop over all datasets in the list and process them.
5337  # NOTE: This processing has been split into several loops
5338  # to be easier to follow, sacrificing a bit of efficiency.
5339  self.datasets_information = {}
5340  self.logger.info("Collecting information for all datasets to process")
5341  dataset_names = sorted(self.datasets_to_use.keys())
5342  for dataset_name in dataset_names:
5343 
5344  # Tell the user which dataset: nice with many datasets.
5345  sep_line = "-" * 30
5346  self.logger.info(sep_line)
5347  self.logger.info(" `%s'" % dataset_name)
5348  self.logger.info(sep_line)
5349 
5350  runs = self.dbs_resolve_runs(dataset_name)
5351  self.logger.info(" found %d run(s)" % len(runs))
5352  if len(runs) > 0:
5353  self.logger.debug(" run number(s): %s" % \
5354  ", ".join([str(i) for i in runs]))
5355  else:
5356  # DEBUG DEBUG DEBUG
5357  # This should never happen after the DBS checks.
5358  self.logger.warning(" --> skipping dataset "
5359  "without any runs")
5360  assert False, "Panic: found a dataset without runs " \
5361  "after DBS checks!"
5362  # DEBUG DEBUG DEBUG end
5363 
5364  cmssw_version = self.dbs_resolve_cmssw_version(dataset_name)
5365  self.logger.info(" found CMSSW version `%s'" % cmssw_version)
5366 
5367  # Figure out if this is data or MC.
5368  datatype = self.dbs_resolve_datatype(dataset_name)
5369  self.logger.info(" sample is data or MC? --> %s" % \
5370  datatype)
5371 
5372  ###
5373 
5374  # Try and figure out the GlobalTag to be used.
5375  if self.globaltag is None:
5376  globaltag = self.dbs_resolve_globaltag(dataset_name)
5377  else:
5378  globaltag = self.globaltag
5379 
5380  self.logger.info(" found GlobalTag `%s'" % globaltag)
5381 
5382  # DEBUG DEBUG DEBUG
5383  if globaltag == "":
5384  # Actually we should not even reach this point, after
5385  # our dataset sanity checks.
5386  assert datatype == "data", \
5387  "ERROR Empty GlobalTag for MC dataset!!!"
5388  # DEBUG DEBUG DEBUG end
5389 
5390  ###
5391 
5392  # DEBUG DEBUG DEBUG
5393  #tmp = self.dbs_check_dataset_spread_old(dataset_name)
5394  # DEBUG DEBUG DEBUG end
5395  sites_catalog = self.dbs_check_dataset_spread(dataset_name)
5396 
5397  # Extract the total event counts.
5398  num_events = {}
5399  for run_number in sites_catalog.keys():
5400  num_events[run_number] = sites_catalog \
5401  [run_number]["all_sites"]
5402  del sites_catalog[run_number]["all_sites"]
5403 
5404  # Extract the information about whether or not datasets
5405  # are mirrored.
5406  mirror_catalog = {}
5407  for run_number in sites_catalog.keys():
5408  mirror_catalog[run_number] = sites_catalog \
5409  [run_number]["mirrored"]
5410  del sites_catalog[run_number]["mirrored"]
5411 
5412  # BUG BUG BUG
5413  # I think I could now get rid of that and just fill the
5414  # "sites" entry with the `inverse' of this
5415  # num_events_catalog(?).
5416  #num_sites = self.dbs_resolve_dataset_number_of_sites(dataset_name)
5417  #sites_catalog = self.dbs_check_dataset_spread(dataset_name)
5418  #sites_catalog = dict(zip(num_events_catalog.keys(),
5419  # [[j for i in num_events_catalog.values() for j in i.keys()]]))
5420  # BUG BUG BUG end
5421 
def build_datasets_information(self)
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
#define str(s)
def cmsHarvester.build_runs_ignore_list (   self)
Build a list of runs to ignore.

NOTE: We should always have a list of runs to process, but
it may be that we don't have a list of runs to ignore.

Definition at line 3540 of file cmsHarvester.py.

3541  """Build a list of runs to ignore.
3542 
3543  NOTE: We should always have a list of runs to process, but
3544  it may be that we don't have a list of runs to ignore.
3545 
3546  """
3547 
3548  self.logger.info("Building list of runs to ignore...")
3549 
3550  input_method = self.input_method["runs"]["ignore"]
3551  input_name = self.input_name["runs"]["ignore"]
3552  runs = self.build_runs_list(input_method, input_name)
3553  self.runs_to_ignore = dict(list(zip(runs, [None] * len(runs))))
3554 
3555  self.logger.info(" found %d run(s) to ignore:" % \
3556  len(runs))
3557  if len(runs) > 0:
3558  self.logger.info(" %s" % ", ".join([str(i) for i in runs]))
3559 
3560  # End of build_runs_ignore_list().
3561 
def build_runs_ignore_list(self)
OutputIterator zip(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2, OutputIterator result, Compare comp)
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
#define str(s)
How EventSelector::AcceptEvent() decides whether to accept an event for output otherwise it is excluding the probing of A single or multiple positive and the trigger will pass if any such matching triggers are PASS or EXCEPTION[A criterion thatmatches no triggers at all is detected and causes a throw.] A single negative with an expectation of appropriate bit checking in the decision and the trigger will pass if any such matching triggers are FAIL or EXCEPTION A wildcarded negative criterion that matches more than one trigger in the trigger list("!*","!HLTx*"if it matches 2 triggers or more) will accept the event if all the matching triggers are FAIL.It will reject the event if any of the triggers are PASS or EXCEPTION(this matches the behavior of"!*"before the partial wildcard feature was incorporated).Triggers which are in the READY state are completely ignored.(READY should never be returned since the trigger paths have been run
def cmsHarvester.build_runs_list (   self,
  input_method,
  input_name 
)

Definition at line 3468 of file cmsHarvester.py.

References createfilelist.int, and list().

3468  def build_runs_list(self, input_method, input_name):
3469 
3470  runs = []
3471 
3472  # A list of runs (either to use or to ignore) is not
3473  # required. This protects against `empty cases.'
3474  if input_method is None:
3475  pass
3476  elif input_method == "runs":
3477  # A list of runs was specified directly from the command
3478  # line.
3479  self.logger.info("Reading list of runs from the " \
3480  "command line")
3481  runs.extend([int(i.strip()) \
3482  for i in input_name.split(",") \
3483  if len(i.strip()) > 0])
3484  elif input_method == "runslistfile":
3485  # We were passed a file containing a list of runs.
3486  self.logger.info("Reading list of runs from file `%s'" % \
3487  input_name)
3488  try:
3489  listfile = open(input_name, "r")
3490  for run in listfile:
3491  # Skip empty lines.
3492  run_stripped = run.strip()
3493  if len(run_stripped) < 1:
3494  continue
3495  # Skip lines starting with a `#'.
3496  if run_stripped[0] != "#":
3497  runs.append(int(run_stripped))
3498  listfile.close()
3499  except IOError:
3500  msg = "ERROR: Could not open input list file `%s'" % \
3501  input_name
3502  self.logger.fatal(msg)
3503  raise Error(msg)
3504 
3505  else:
3506  # DEBUG DEBUG DEBUG
3507  # We should never get here.
3508  assert False, "Unknown input method `%s'" % input_method
3509  # DEBUG DEBUG DEBUG end
3510 
3511  # Remove duplicates, sort and done.
3512  runs = list(set(runs))
3513 
3514  # End of build_runs_list().
3515  return runs
3516 
Helper class: Error exception.
def build_runs_list(self, input_method, input_name)
How EventSelector::AcceptEvent() decides whether to accept an event for output otherwise it is excluding the probing of A single or multiple positive and the trigger will pass if any such matching triggers are PASS or EXCEPTION[A criterion thatmatches no triggers at all is detected and causes a throw.] A single negative with an expectation of appropriate bit checking in the decision and the trigger will pass if any such matching triggers are FAIL or EXCEPTION A wildcarded negative criterion that matches more than one trigger in the trigger list("!*","!HLTx*"if it matches 2 triggers or more) will accept the event if all the matching triggers are FAIL.It will reject the event if any of the triggers are PASS or EXCEPTION(this matches the behavior of"!*"before the partial wildcard feature was incorporated).Triggers which are in the READY state are completely ignored.(READY should never be returned since the trigger paths have been run
def cmsHarvester.build_runs_use_list (   self)
Build a list of runs to process.

Definition at line 3519 of file cmsHarvester.py.

3520  """Build a list of runs to process.
3521 
3522  """
3523 
3524  self.logger.info("Building list of runs to consider...")
3525 
3526  input_method = self.input_method["runs"]["use"]
3527  input_name = self.input_name["runs"]["use"]
3528  runs = self.build_runs_list(input_method, input_name)
3529  self.runs_to_use = dict(list(zip(runs, [None] * len(runs))))
3530 
3531  self.logger.info(" found %d run(s) to process:" % \
3532  len(runs))
3533  if len(runs) > 0:
3534  self.logger.info(" %s" % ", ".join([str(i) for i in runs]))
3535 
3536  # End of build_runs_list().
3537 
OutputIterator zip(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2, OutputIterator result, Compare comp)
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
#define str(s)
def build_runs_use_list(self)
How EventSelector::AcceptEvent() decides whether to accept an event for output otherwise it is excluding the probing of A single or multiple positive and the trigger will pass if any such matching triggers are PASS or EXCEPTION[A criterion thatmatches no triggers at all is detected and causes a throw.] A single negative with an expectation of appropriate bit checking in the decision and the trigger will pass if any such matching triggers are FAIL or EXCEPTION A wildcarded negative criterion that matches more than one trigger in the trigger list("!*","!HLTx*"if it matches 2 triggers or more) will accept the event if all the matching triggers are FAIL.It will reject the event if any of the triggers are PASS or EXCEPTION(this matches the behavior of"!*"before the partial wildcard feature was incorporated).Triggers which are in the READY state are completely ignored.(READY should never be returned since the trigger paths have been run
def cmsHarvester.check_cmssw (   self)
Check if CMSSW is setup.

Definition at line 2332 of file cmsHarvester.py.

2332  def check_cmssw(self):
2333  """Check if CMSSW is setup.
2334 
2335  """
2336 
2337  # Try to access the CMSSW_VERSION environment variable. If
2338  # it's something useful we consider CMSSW to be set up
2339  # properly. Otherwise we raise an error.
2340  cmssw_version = os.getenv("CMSSW_VERSION")
2341  if cmssw_version is None:
2342  self.logger.fatal("It seems CMSSW is not setup...")
2343  self.logger.fatal("($CMSSW_VERSION is empty)")
2344  raise Error("ERROR: CMSSW needs to be setup first!")
2345 
2346  self.cmssw_version = cmssw_version
2347  self.logger.info("Found CMSSW version %s properly set up" % \
2348  self.cmssw_version)
2349 
2350  # End of check_cmsssw.
2351  return True
2352 
def check_cmssw(self)
Helper class: Error exception.
def cmsHarvester.check_dataset_list (   self)
Check list of dataset names for impossible ones.

Two kinds of checks are done:
- Checks for things that do not make sense. These lead to
  errors and skipped datasets.
- Sanity checks. For these warnings are issued but the user is
  considered to be the authoritative expert.

Checks performed:
- The CMSSW version encoded in the dataset name should match
  self.cmssw_version. This is critical.
- There should be some events in the dataset/run. This is
  critical in the sense that CRAB refuses to create jobs for
  zero events. And yes, this does happen in practice. E.g. the
  reprocessed CRAFT08 datasets contain runs with zero events.
- A cursory check is performed to see if the harvesting type
  makes sense for the data type. This should prevent the user
  from inadvertently running RelVal for data.
- It is not possible to run single-step harvesting jobs on
  samples that are not fully contained at a single site.
- Each dataset/run has to be available at at least one site.

Definition at line 3793 of file cmsHarvester.py.

3794  """Check list of dataset names for impossible ones.
3795 
3796  Two kinds of checks are done:
3797  - Checks for things that do not make sense. These lead to
3798  errors and skipped datasets.
3799  - Sanity checks. For these warnings are issued but the user is
3800  considered to be the authoritative expert.
3801 
3802  Checks performed:
3803  - The CMSSW version encoded in the dataset name should match
3804  self.cmssw_version. This is critical.
3805  - There should be some events in the dataset/run. This is
3806  critical in the sense that CRAB refuses to create jobs for
3807  zero events. And yes, this does happen in practice. E.g. the
3808  reprocessed CRAFT08 datasets contain runs with zero events.
3809  - A cursory check is performed to see if the harvesting type
3810  makes sense for the data type. This should prevent the user
3811  from inadvertently running RelVal for data.
3812  - It is not possible to run single-step harvesting jobs on
3813  samples that are not fully contained at a single site.
3814  - Each dataset/run has to be available at at least one site.
3815 
3816  """
3817 
3818  self.logger.info("Performing sanity checks on dataset list...")
3819 
3820  dataset_names_after_checks = copy.deepcopy(self.datasets_to_use)
3821 
3822  for dataset_name in self.datasets_to_use.keys():
3823 
3824  # Check CMSSW version.
3825  version_from_dataset = self.datasets_information[dataset_name] \
3826  ["cmssw_version"]
3827  if version_from_dataset != self.cmssw_version:
3828  msg = " CMSSW version mismatch for dataset `%s' " \
3829  "(%s vs. %s)" % \
3830  (dataset_name,
3831  self.cmssw_version, version_from_dataset)
3832  if self.force_running:
3833  # Expert mode: just warn, then continue.
3834  self.logger.warning("%s " \
3835  "--> `force mode' active: " \
3836  "run anyway" % msg)
3837  else:
3838  del dataset_names_after_checks[dataset_name]
3839  self.logger.warning("%s " \
3840  "--> skipping" % msg)
3841  continue
3842 
3843  ###
3844 
3845  # Check that the harvesting type makes sense for the
3846  # sample. E.g. normally one would not run the DQMOffline
3847  # harvesting on Monte Carlo.
3848  # TODO TODO TODO
3849  # This should be further refined.
3850  suspicious = False
3851  datatype = self.datasets_information[dataset_name]["datatype"]
3852  if datatype == "data":
3853  # Normally only DQM harvesting is run on data.
3854  if self.harvesting_type != "DQMOffline":
3855  suspicious = True
3856  elif datatype == "mc":
3857  if self.harvesting_type == "DQMOffline":
3858  suspicious = True
3859  else:
3860  # Doh!
3861  assert False, "ERROR Impossible data type `%s' " \
3862  "for dataset `%s'" % \
3863  (datatype, dataset_name)
3864  if suspicious:
3865  msg = " Normally one does not run `%s' harvesting " \
3866  "on %s samples, are you sure?" % \
3867  (self.harvesting_type, datatype)
3868  if self.force_running:
3869  self.logger.warning("%s " \
3870  "--> `force mode' active: " \
3871  "run anyway" % msg)
3872  else:
3873  del dataset_names_after_checks[dataset_name]
3874  self.logger.warning("%s " \
3875  "--> skipping" % msg)
3876  continue
3877 
3878  # TODO TODO TODO end
3879 
3880  ###
3881 
3882  # BUG BUG BUG
3883  # For the moment, due to a problem with DBS, I cannot
3884  # figure out the GlobalTag for data by myself. (For MC
3885  # it's no problem.) This means that unless a GlobalTag was
3886  # specified from the command line, we will have to skip
3887  # any data datasets.
3888 
3889  if datatype == "data":
3890  if self.globaltag is None:
3891  msg = "For data datasets (like `%s') " \
3892  "we need a GlobalTag" % \
3893  dataset_name
3894  del dataset_names_after_checks[dataset_name]
3895  self.logger.warning("%s " \
3896  "--> skipping" % msg)
3897  continue
3898 
3899  # BUG BUG BUG end
3900 
3901  ###
3902 
3903  # Check if the GlobalTag exists and (if we're using
3904  # reference histograms) if it's ready to be used with
3905  # reference histograms.
3906  globaltag = self.datasets_information[dataset_name]["globaltag"]
3907  if not globaltag in self.globaltag_check_cache:
3908  if self.check_globaltag(globaltag):
3909  self.globaltag_check_cache.append(globaltag)
3910  else:
3911  msg = "Something is wrong with GlobalTag `%s' " \
3912  "used by dataset `%s'!" % \
3913  (globaltag, dataset_name)
3914  if self.use_ref_hists:
3915  msg += "\n(Either it does not exist or it " \
3916  "does not contain the required key to " \
3917  "be used with reference histograms.)"
3918  else:
3919  msg += "\n(It probably just does not exist.)"
3920  self.logger.fatal(msg)
3921  raise Usage(msg)
3922 
3923  ###
3924 
3925  # Require that each run is available at least somewhere.
3926  runs_without_sites = [i for (i, j) in \
3927  self.datasets_information[dataset_name] \
3928  ["sites"].items() \
3929  if len(j) < 1 and \
3930  i in self.datasets_to_use[dataset_name]]
3931  if len(runs_without_sites) > 0:
3932  for run_without_sites in runs_without_sites:
3933  try:
3934  dataset_names_after_checks[dataset_name].remove(run_without_sites)
3935  except KeyError:
3936  pass
3937  self.logger.warning(" removed %d unavailable run(s) " \
3938  "from dataset `%s'" % \
3939  (len(runs_without_sites), dataset_name))
3940  self.logger.debug(" (%s)" % \
3941  ", ".join([str(i) for i in \
3942  runs_without_sites]))
3943 
3944  ###
3945 
3946  # Unless we're running two-step harvesting: only allow
3947  # samples located on a single site.
3948  if not self.harvesting_mode == "two-step":
3949  for run_number in self.datasets_to_use[dataset_name]:
3950  # DEBUG DEBUG DEBUG
Helper class: Usage exception.
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def remove(d, key, TELL=False)
Definition: MatrixUtil.py:211
def check_dataset_list(self)
#define str(s)
def cmsHarvester.check_dbs (   self)
Check if DBS is setup.

Definition at line 2355 of file cmsHarvester.py.

2355  def check_dbs(self):
2356  """Check if DBS is setup.
2357 
2358  """
2359 
2360  # Try to access the DBSCMD_HOME environment variable. If this
2361  # looks useful we consider DBS to be set up
2362  # properly. Otherwise we raise an error.
2363  dbs_home = os.getenv("DBSCMD_HOME")
2364  if dbs_home is None:
2365  self.logger.fatal("It seems DBS is not setup...")
2366  self.logger.fatal(" $DBSCMD_HOME is empty")
2367  raise Error("ERROR: DBS needs to be setup first!")
2368 
Helper class: Error exception.
def check_dbs(self)
def cmsHarvester.check_globaltag (   self,
  globaltag = None 
)

CRAB

GRID

USER

CMSSW

CAF

Check if globaltag exists.

Check if globaltag exists as GlobalTag in the database given
by self.frontier_connection_name['globaltag']. If globaltag is
None, self.globaltag is used instead.

If we're going to use reference histograms this method also
checks for the existence of the required key in the GlobalTag.

Definition at line 4499 of file cmsHarvester.py.

4499  def check_globaltag(self, globaltag=None):
4500  """Check if globaltag exists.
4501 
4502  Check if globaltag exists as GlobalTag in the database given
4503  by self.frontier_connection_name['globaltag']. If globaltag is
4504  None, self.globaltag is used instead.
4505 
4506  If we're going to use reference histograms this method also
4507  checks for the existence of the required key in the GlobalTag.
4508 
4509  """
4510 
4511  if globaltag is None:
4512  globaltag = self.globaltag
4513 
4514  # All GlobalTags should end in `::All', right?
4515  if globaltag.endswith("::All"):
4516  globaltag = globaltag[:-5]
4517 
4518  connect_name = self.frontier_connection_name["globaltag"]
4519  # BUG BUG BUG
4520  # There is a bug in cmscond_tagtree_list: some magic is
4521  # missing from the implementation requiring one to specify
4522  # explicitly the name of the squid to connect to. Since the
4523  # cmsHarvester can only be run from the CERN network anyway,
4524  # cmsfrontier:8000 is hard-coded in here. Not nice but it
4525  # works.
4526  connect_name = connect_name.replace("frontier://",
4527  "frontier://cmsfrontier:8000/")
4528  # BUG BUG BUG end
4529  connect_name += self.db_account_name_cms_cond_globaltag()
4530 
4531  tag_exists = self.check_globaltag_exists(globaltag, connect_name)
4532 
4533  #----------
4534 
4535  tag_contains_ref_hist_key = False
4536  if self.use_ref_hists and tag_exists:
4537  # Check for the key required to use reference histograms.
4538  tag_contains_ref_hist_key = self.check_globaltag_contains_ref_hist_key(globaltag, connect_name)
4539 
4540  #----------
4541 
4542  if self.use_ref_hists:
4543  ret_val = tag_exists and tag_contains_ref_hist_key
4544  else:
4545  ret_val = tag_exists
4546 
4547  #----------
4548 
4549  # End of check_globaltag.
4550  return ret_val
4551 
def check_globaltag(self, globaltag=None)
CRAB
def cmsHarvester.check_globaltag_contains_ref_hist_key (   self,
  globaltag,
  connect_name 
)
Check if globaltag contains the required RefHistos key.

Definition at line 4596 of file cmsHarvester.py.

4596  def check_globaltag_contains_ref_hist_key(self, globaltag, connect_name):
4597  """Check if globaltag contains the required RefHistos key.
4598 
4599  """
4600 
4601  # Check for the key required to use reference histograms.
4602  tag_contains_key = None
4603  ref_hist_key = "RefHistos"
4604  self.logger.info("Checking existence of reference " \
4605  "histogram key `%s' in GlobalTag `%s'" % \
4606  (ref_hist_key, globaltag))
4607  self.logger.debug(" (Using database connection `%s')" % \
4608  connect_name)
4609  cmd = "cmscond_tagtree_list -c %s -T %s -n %s" % \
4610  (connect_name, globaltag, ref_hist_key)
4611  (status, output) = commands.getstatusoutput(cmd)
4612  if status != 0 or \
4613  output.find("error") > -1:
4614  msg = "Could not check existence of key `%s'" % \
4615  (ref_hist_key, connect_name)
4616  self.logger.fatal(msg)
4617  self.logger.debug("Command used:")
4618  self.logger.debug(" %s" % cmd)
4619  self.logger.debug("Output received:")
4620  self.logger.debug(" %s" % output)
4621  raise Error(msg)
4622  if len(output) < 1:
4623  self.logger.debug("Required key for use of reference " \
4624  "histograms `%s' does not exist " \
4625  "in GlobalTag `%s':" % \
4626  (ref_hist_key, globaltag))
4627  self.logger.debug("Output received:")
4628  self.logger.debug(output)
4629  tag_contains_key = False
4630  else:
4631  tag_contains_key = True
4632 
4633  self.logger.info(" GlobalTag contains `%s' key? -> %s" % \
4634  (ref_hist_key, tag_contains_key))
4635 
4636  # End of check_globaltag_contains_ref_hist_key.
4637  return tag_contains_key
4638 
Helper class: Error exception.
def check_globaltag_contains_ref_hist_key(self, globaltag, connect_name)
def cmsHarvester.check_globaltag_exists (   self,
  globaltag,
  connect_name 
)
Check if globaltag exists.

Definition at line 4554 of file cmsHarvester.py.

References split.

4554  def check_globaltag_exists(self, globaltag, connect_name):
4555  """Check if globaltag exists.
4556 
4557  """
4558 
4559  self.logger.info("Checking existence of GlobalTag `%s'" % \
4560  globaltag)
4561  self.logger.debug(" (Using database connection `%s')" % \
4562  connect_name)
4563 
4564  cmd = "cmscond_tagtree_list -c %s -T %s" % \
4565  (connect_name, globaltag)
4566  (status, output) = commands.getstatusoutput(cmd)
4567  if status != 0 or \
4568  output.find("error") > -1:
4569  msg = "Could not check existence of GlobalTag `%s' in `%s'" % \
4570  (globaltag, connect_name)
4571  if output.find(".ALL_TABLES not found") > -1:
4572  msg = "%s\n" \
4573  "Missing database account `%s'" % \
4574  (msg, output.split(".ALL_TABLES")[0].split()[-1])
4575  self.logger.fatal(msg)
4576  self.logger.debug("Command used:")
4577  self.logger.debug(" %s" % cmd)
4578  self.logger.debug("Output received:")
4579  self.logger.debug(output)
4580  raise Error(msg)
4581  if output.find("does not exist") > -1:
4582  self.logger.debug("GlobalTag `%s' does not exist in `%s':" % \
4583  (globaltag, connect_name))
4584  self.logger.debug("Output received:")
4585  self.logger.debug(output)
4586  tag_exists = False
4587  else:
4588  tag_exists = True
4589  self.logger.info(" GlobalTag exists? -> %s" % tag_exists)
4590 
4591  # End of check_globaltag_exists.
4592  return tag_exists
4593 
Helper class: Error exception.
def check_globaltag_exists(self, globaltag, connect_name)
double split
Definition: MVATrainer.cc:139
def cmsHarvester.check_input_status (   self)
Check completeness and correctness of input information.

Check that all required information has been specified and
that, at least as far as can be easily checked, it makes
sense.

NOTE: This is also where any default values are applied.

Definition at line 2191 of file cmsHarvester.py.

References join().

2192  """Check completeness and correctness of input information.
2193 
2194  Check that all required information has been specified and
2195  that, at least as far as can be easily checked, it makes
2196  sense.
2197 
2198  NOTE: This is also where any default values are applied.
2199 
2200  """
2201 
2202  self.logger.info("Checking completeness/correctness of input...")
2203 
2204  # The cmsHarvester does not take (i.e. understand) any
2205  # arguments so there should not be any.
2206  if len(self.args) > 0:
2207  msg = "Sorry but I don't understand `%s'" % \
2208  (" ".join(self.args))
2209  self.logger.fatal(msg)
2210  raise Usage(msg)
2211 
2212  # BUG BUG BUG
2213  # While we wait for some bugs left and right to get fixed, we
2214  # disable two-step.
2215  if self.harvesting_mode == "two-step":
2216  msg = "--------------------\n" \
2217  " Sorry, but for the moment (well, till it works)" \
2218  " the two-step mode has been disabled.\n" \
2219  "--------------------\n"
2220  self.logger.fatal(msg)
2221  raise Error(msg)
2222  # BUG BUG BUG end
2223 
2224  # We need a harvesting method to be specified
2225  if self.harvesting_type is None:
2226  msg = "Please specify a harvesting type"
2227  self.logger.fatal(msg)
2228  raise Usage(msg)
2229  # as well as a harvesting mode.
2230  if self.harvesting_mode is None:
2231  self.harvesting_mode = self.harvesting_mode_default
2232  msg = "No harvesting mode specified --> using default `%s'" % \
2233  self.harvesting_mode
2234  self.logger.warning(msg)
2235  #raise Usage(msg)
2236 
2237  ###
2238 
2239  # We need an input method so we can find the dataset name(s).
2240  if self.input_method["datasets"]["use"] is None:
2241  msg = "Please specify an input dataset name " \
2242  "or a list file name"
2243  self.logger.fatal(msg)
2244  raise Usage(msg)
2245 
2246  # DEBUG DEBUG DEBUG
2247  # If we get here, we should also have an input name.
2248  assert not self.input_name["datasets"]["use"] is None
2249  # DEBUG DEBUG DEBUG end
2250 
2251  ###
2252 
2253  # The same holds for the reference histogram mapping file (if
2254  # we're using references).
2255  if self.use_ref_hists:
2256  if self.ref_hist_mappings_file_name is None:
2257  self.ref_hist_mappings_file_name = self.ref_hist_mappings_file_name_default
2258  msg = "No reference histogram mapping file specified --> " \
2259  "using default `%s'" % \
2260  self.ref_hist_mappings_file_name
2261  self.logger.warning(msg)
2262 
2263  ###
2264 
2265  # We need to know where to put the stuff (okay, the results)
2266  # on CASTOR.
2267  if self.castor_base_dir is None:
2268  self.castor_base_dir = self.castor_base_dir_default
2269  msg = "No CASTOR area specified -> using default `%s'" % \
2270  self.castor_base_dir
2271  self.logger.warning(msg)
2272  #raise Usage(msg)
2273 
2274  # Only the CERN CASTOR area is supported.
2275  if not self.castor_base_dir.startswith(self.castor_prefix):
2276  msg = "CASTOR area does not start with `%s'" % \
2277  self.castor_prefix
2278  self.logger.fatal(msg)
2279  if self.castor_base_dir.find("castor") > -1 and \
2280  not self.castor_base_dir.find("cern.ch") > -1:
2281  self.logger.fatal("Only CERN CASTOR is supported")
2282  raise Usage(msg)
2283 
2284  ###
2285 
2286  # TODO TODO TODO
2287  # This should be removed in the future, once I find out how to
2288  # get the config file used to create a given dataset from DBS.
2289 
2290  # For data we need to have a GlobalTag. (For MC we can figure
2291  # it out by ourselves.)
2292  if self.globaltag is None:
2293  self.logger.warning("No GlobalTag specified. This means I cannot")
2294  self.logger.warning("run on data, only on MC.")
2295  self.logger.warning("I will skip all data datasets.")
2296 
2297  # TODO TODO TODO end
2298 
2299  # Make sure the GlobalTag ends with `::All'.
2300  if not self.globaltag is None:
2301  if not self.globaltag.endswith("::All"):
2302  self.logger.warning("Specified GlobalTag `%s' does " \
2303  "not end in `::All' --> " \
2304  "appending this missing piece" % \
2305  self.globaltag)
2306  self.globaltag = "%s::All" % self.globaltag
2307 
2308  ###
2309 
2310  # Dump some info about the Frontier connections used.
2311  for (key, value) in six.iteritems(self.frontier_connection_name):
2312  frontier_type_str = "unknown"
2313  if key == "globaltag":
2314  frontier_type_str = "the GlobalTag"
2315  elif key == "refhists":
2316  frontier_type_str = "the reference histograms"
2317  non_str = None
2318  if self.frontier_connection_overridden[key] == True:
2319  non_str = "non-"
2320  else:
2321  non_str = ""
2322  self.logger.info("Using %sdefault Frontier " \
2323  "connection for %s: `%s'" % \
2324  (non_str, frontier_type_str, value))
2325 
2326  ###
2327 
2328  # End of check_input_status.
2329 
Helper class: Error exception.
Helper class: Usage exception.
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def check_input_status(self)
def cmsHarvester.check_ref_hist_mappings (   self)
Make sure all necessary reference histograms exist.

Check that for each of the datasets to be processed a
reference histogram is specified and that that histogram
exists in the database.

NOTE: There's a little complication here. Since this whole
thing was designed to allow (in principle) harvesting of both
data and MC datasets in one go, we need to be careful to check
the availability fof reference mappings only for those
datasets that need it.

Definition at line 5279 of file cmsHarvester.py.

5280  """Make sure all necessary reference histograms exist.
5281 
5282  Check that for each of the datasets to be processed a
5283  reference histogram is specified and that that histogram
5284  exists in the database.
5285 
5286  NOTE: There's a little complication here. Since this whole
5287  thing was designed to allow (in principle) harvesting of both
5288  data and MC datasets in one go, we need to be careful to check
5289  the availability fof reference mappings only for those
5290  datasets that need it.
5291 
5292  """
5293 
5294  self.logger.info("Checking reference histogram mappings")
5295 
5296  for dataset_name in self.datasets_to_use:
5297  try:
5298  ref_hist_name = self.ref_hist_mappings[dataset_name]
5299  except KeyError:
5300  msg = "ERROR: No reference histogram mapping found " \
5301  "for dataset `%s'" % \
5302  dataset_name
5303  self.logger.fatal(msg)
5304  raise Error(msg)
5305 
5306  if not self.check_ref_hist_tag(ref_hist_name):
5307  msg = "Reference histogram tag `%s' " \
5308  "(used for dataset `%s') does not exist!" % \
5309  (ref_hist_name, dataset_name)
5310  self.logger.fatal(msg)
5311  raise Usage(msg)
5312 
5313  self.logger.info(" Done checking reference histogram mappings.")
5314 
5315  # End of check_ref_hist_mappings.
5316 
Helper class: Error exception.
Helper class: Usage exception.
def check_ref_hist_mappings(self)
def cmsHarvester.check_ref_hist_tag (   self,
  tag_name 
)
Check the existence of tag_name in database connect_name.

Check if tag_name exists as a reference histogram tag in the
database given by self.frontier_connection_name['refhists'].

Definition at line 4641 of file cmsHarvester.py.

References join().

4641  def check_ref_hist_tag(self, tag_name):
4642  """Check the existence of tag_name in database connect_name.
4643 
4644  Check if tag_name exists as a reference histogram tag in the
4645  database given by self.frontier_connection_name['refhists'].
4646 
4647  """
4648 
4649  connect_name = self.frontier_connection_name["refhists"]
4650  connect_name += self.db_account_name_cms_cond_dqm_summary()
4651 
4652  self.logger.debug("Checking existence of reference " \
4653  "histogram tag `%s'" % \
4654  tag_name)
4655  self.logger.debug(" (Using database connection `%s')" % \
4656  connect_name)
4657 
4658  cmd = "cmscond_list_iov -c %s" % \
4659  connect_name
4660  (status, output) = commands.getstatusoutput(cmd)
4661  if status != 0:
4662  msg = "Could not check existence of tag `%s' in `%s'" % \
4663  (tag_name, connect_name)
4664  self.logger.fatal(msg)
4665  self.logger.debug("Command used:")
4666  self.logger.debug(" %s" % cmd)
4667  self.logger.debug("Output received:")
4668  self.logger.debug(output)
4669  raise Error(msg)
4670  if not tag_name in output.split():
4671  self.logger.debug("Reference histogram tag `%s' " \
4672  "does not exist in `%s'" % \
4673  (tag_name, connect_name))
4674  self.logger.debug(" Existing tags: `%s'" % \
4675  "', `".join(output.split()))
4676  tag_exists = False
4677  else:
4678  tag_exists = True
4679  self.logger.debug(" Reference histogram tag exists? " \
4680  "-> %s" % tag_exists)
4681 
4682  # End of check_ref_hist_tag.
4683  return tag_exists
4684 
Helper class: Error exception.
def check_ref_hist_tag(self, tag_name)
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def cmsHarvester.create_and_check_castor_dir (   self,
  castor_dir 
)
Check existence of the give CASTOR dir, if necessary create
it.

Some special care has to be taken with several things like
setting the correct permissions such that CRAB can store the
output results. Of course this means that things like
/castor/cern.ch/ and user/j/ have to be recognised and treated
properly.

NOTE: Only CERN CASTOR area (/castor/cern.ch/) supported for
the moment.

NOTE: This method uses some slightly tricky caching to make
sure we don't keep over and over checking the same base paths.

Definition at line 1489 of file cmsHarvester.py.

References spr.find(), createfilelist.int, join(), SiStripPI.max, str, and ComparisonHelper.zip().

1489  def create_and_check_castor_dir(self, castor_dir):
1490  """Check existence of the give CASTOR dir, if necessary create
1491  it.
1492 
1493  Some special care has to be taken with several things like
1494  setting the correct permissions such that CRAB can store the
1495  output results. Of course this means that things like
1496  /castor/cern.ch/ and user/j/ have to be recognised and treated
1497  properly.
1498 
1499  NOTE: Only CERN CASTOR area (/castor/cern.ch/) supported for
1500  the moment.
1501 
1502  NOTE: This method uses some slightly tricky caching to make
1503  sure we don't keep over and over checking the same base paths.
1504 
1505  """
1506 
1507  ###
1508 
1509  # Local helper function to fully split a path into pieces.
1510  def split_completely(path):
1511  (parent_path, name) = os.path.split(path)
1512  if name == "":
1513  return (parent_path, )
1514  else:
1515  return split_completely(parent_path) + (name, )
1516 
1517  ###
1518 
1519  # Local helper function to check rfio (i.e. CASTOR)
1520  # directories.
1521  def extract_permissions(rfstat_output):
1522  """Parse the output from rfstat and return the
1523  5-digit permissions string."""
1524 
1525  permissions_line = [i for i in output.split("\n") \
1526  if i.lower().find("protection") > -1]
1527  regexp = re.compile(".*\(([0123456789]{5})\).*")
1528  match = regexp.search(rfstat_output)
1529  if not match or len(match.groups()) != 1:
1530  msg = "Could not extract permissions " \
1531  "from output: %s" % rfstat_output
1532  self.logger.fatal(msg)
1533  raise Error(msg)
1534  permissions = match.group(1)
1535 
1536  # End of extract_permissions.
1537  return permissions
1538 
1539  ###
1540 
1541  # These are the pieces of CASTOR directories that we do not
1542  # want to touch when modifying permissions.
1543 
1544  # NOTE: This is all a bit involved, basically driven by the
1545  # fact that one wants to treat the `j' directory of
1546  # `/castor/cern.ch/user/j/jhegeman/' specially.
1547  # BUG BUG BUG
1548  # This should be simplified, for example by comparing to the
1549  # CASTOR prefix or something like that.
1550  # BUG BUG BUG end
1551  castor_paths_dont_touch = {
1552  0: ["/", "castor", "cern.ch", "cms", "store", "temp",
1553  "dqm", "offline", "user"],
1554  -1: ["user", "store"]
1555  }
1556 
1557  self.logger.debug("Checking CASTOR path `%s'" % castor_dir)
1558 
1559  ###
1560 
1561  # First we take the full CASTOR path apart.
1562  castor_path_pieces = split_completely(castor_dir)
1563 
1564  # Now slowly rebuild the CASTOR path and see if a) all
1565  # permissions are set correctly and b) the final destination
1566  # exists.
1567  path = ""
1568  check_sizes = sorted(castor_paths_dont_touch.keys())
1569  len_castor_path_pieces = len(castor_path_pieces)
1570  for piece_index in xrange (len_castor_path_pieces):
1571  skip_this_path_piece = False
1572  piece = castor_path_pieces[piece_index]
void find(edm::Handle< EcalRecHitCollection > &hits, DetId thisDet, std::vector< EcalRecHitCollection::const_iterator > &hit, bool debug=false)
Definition: FindCaloHit.cc:20
Helper class: Error exception.
def create_and_check_castor_dir(self, castor_dir)
def cmsHarvester.create_and_check_castor_dirs (   self)
Make sure all required CASTOR output dirs exist.

This checks the CASTOR base dir specified by the user as well
as all the subdirs required by the current set of jobs.

Definition at line 1430 of file cmsHarvester.py.

References SiStripPI.max.

1431  """Make sure all required CASTOR output dirs exist.
1432 
1433  This checks the CASTOR base dir specified by the user as well
1434  as all the subdirs required by the current set of jobs.
1435 
1436  """
1437 
1438  self.logger.info("Checking (and if necessary creating) CASTOR " \
1439  "output area(s)...")
1440 
1441  # Call the real checker method for the base dir.
1442  self.create_and_check_castor_dir(self.castor_base_dir)
1443 
1444  # Now call the checker for all (unique) subdirs.
1445  castor_dirs = []
1446  for (dataset_name, runs) in six.iteritems(self.datasets_to_use):
1447 
1448  for run in runs:
1449  castor_dirs.append(self.datasets_information[dataset_name] \
1450  ["castor_path"][run])
1451  castor_dirs_unique = sorted(set(castor_dirs))
1452  # This can take some time. E.g. CRAFT08 has > 300 runs, each
1453  # of which will get a new directory. So we show some (rough)
1454  # info in between.
1455  ndirs = len(castor_dirs_unique)
1456  step = max(ndirs / 10, 1)
1457  for (i, castor_dir) in enumerate(castor_dirs_unique):
1458  if (i + 1) % step == 0 or \
1459  (i + 1) == ndirs:
1460  self.logger.info(" %d/%d" % \
1461  (i + 1, ndirs))
1462  self.create_and_check_castor_dir(castor_dir)
1463 
1464  # Now check if the directory is empty. If (an old version
1465  # of) the output file already exists CRAB will run new
1466  # jobs but never copy the results back. We assume the user
1467  # knows what they are doing and only issue a warning in
1468  # case the directory is not empty.
1469  self.logger.debug("Checking if path `%s' is empty" % \
1470  castor_dir)
1471  cmd = "rfdir %s" % castor_dir
1472  (status, output) = commands.getstatusoutput(cmd)
1473  if status != 0:
1474  msg = "Could not access directory `%s'" \
1475  " !!! This is bad since I should have just" \
1476  " created it !!!" % castor_dir
1477  self.logger.fatal(msg)
1478  raise Error(msg)
1479  if len(output) > 0:
1480  self.logger.warning("Output directory `%s' is not empty:" \
1481  " new jobs will fail to" \
1482  " copy back output" % \
1483  castor_dir)
1484 
1485  # End of create_and_check_castor_dirs.
1486 
Helper class: Error exception.
def create_and_check_castor_dirs(self)
def cmsHarvester.create_castor_path_name_common (   self,
  dataset_name 
)
Build the common part of the output path to be used on
CASTOR.

This consists of the CASTOR area base path specified by the
user and a piece depending on the data type (data vs. MC), the
harvesting type and the dataset name followed by a piece
containing the run number and event count. (See comments in
create_castor_path_name_special for details.) This method
creates the common part, without run number and event count.

Definition at line 1326 of file cmsHarvester.py.

References create_castor_path_name_special(), python.rootplot.root2matplotlib.replace(), and digitizers_cfi.strip.

1326  def create_castor_path_name_common(self, dataset_name):
1327  """Build the common part of the output path to be used on
1328  CASTOR.
1329 
1330  This consists of the CASTOR area base path specified by the
1331  user and a piece depending on the data type (data vs. MC), the
1332  harvesting type and the dataset name followed by a piece
1333  containing the run number and event count. (See comments in
1334  create_castor_path_name_special for details.) This method
1335  creates the common part, without run number and event count.
1336 
1337  """
1338 
1339  castor_path = self.castor_base_dir
1340 
1341  ###
1342 
1343  # The data type: data vs. mc.
1344  datatype = self.datasets_information[dataset_name]["datatype"]
1345  datatype = datatype.lower()
1346  castor_path = os.path.join(castor_path, datatype)
1347 
1348  # The harvesting type.
1349  harvesting_type = self.harvesting_type
1350  harvesting_type = harvesting_type.lower()
1351  castor_path = os.path.join(castor_path, harvesting_type)
1352 
1353  # The CMSSW release version (only the `digits'). Note that the
1354  # CMSSW version used here is the version used for harvesting,
1355  # not the one from the dataset. This does make the results
1356  # slightly harder to find. On the other hand it solves
1357  # problems in case one re-harvests a given dataset with a
1358  # different CMSSW version, which would lead to ambiguous path
1359  # names. (Of course for many cases the harvesting is done with
1360  # the same CMSSW version the dataset was created with.)
1361  release_version = self.cmssw_version
1362  release_version = release_version.lower(). \
1363  replace("cmssw", ""). \
1364  strip("_")
1365  castor_path = os.path.join(castor_path, release_version)
1366 
1367  # The dataset name.
1368  dataset_name_escaped = self.escape_dataset_name(dataset_name)
1369  castor_path = os.path.join(castor_path, dataset_name_escaped)
1370 
1371  ###
1372 
1373  castor_path = os.path.normpath(castor_path)
1374 
1375  # End of create_castor_path_name_common.
1376  return castor_path
1377 
def replace(string, replacements)
def create_castor_path_name_common(self, dataset_name)
def cmsHarvester.create_castor_path_name_special (   self,
  dataset_name,
  run_number,
  castor_path_common 
)
Create the specialised part of the CASTOR output dir name.

NOTE: To avoid clashes with `incremental harvesting'
(re-harvesting when a dataset grows) we have to include the
event count in the path name. The underlying `problem' is that
CRAB does not overwrite existing output files so if the output
file already exists CRAB will fail to copy back the output.

NOTE: It's not possible to create different kinds of
harvesting jobs in a single call to this tool. However, in
principle it could be possible to create both data and MC jobs
in a single go.

NOTE: The number of events used in the path name is the
_total_ number of events in the dataset/run at the time of
harvesting. If we're doing partial harvesting the final
results will reflect lower statistics. This is a) the easiest
to code and b) the least likely to lead to confusion if
someone ever decides to swap/copy around file blocks between
sites.

Definition at line 1382 of file cmsHarvester.py.

Referenced by create_castor_path_name_common().

1382  castor_path_common):
1383  """Create the specialised part of the CASTOR output dir name.
1384 
1385  NOTE: To avoid clashes with `incremental harvesting'
1386  (re-harvesting when a dataset grows) we have to include the
1387  event count in the path name. The underlying `problem' is that
1388  CRAB does not overwrite existing output files so if the output
1389  file already exists CRAB will fail to copy back the output.
1390 
1391  NOTE: It's not possible to create different kinds of
1392  harvesting jobs in a single call to this tool. However, in
1393  principle it could be possible to create both data and MC jobs
1394  in a single go.
1395 
1396  NOTE: The number of events used in the path name is the
1397  _total_ number of events in the dataset/run at the time of
1398  harvesting. If we're doing partial harvesting the final
1399  results will reflect lower statistics. This is a) the easiest
1400  to code and b) the least likely to lead to confusion if
1401  someone ever decides to swap/copy around file blocks between
1402  sites.
1403 
1404  """
1405 
1406  castor_path = castor_path_common
1407 
1408  ###
1409 
1410  # The run number part.
1411  castor_path = os.path.join(castor_path, "run_%d" % run_number)
1412 
1413  ###
1414 
1415  # The event count (i.e. the number of events we currently see
1416  # for this dataset).
1417  #nevents = self.datasets_information[dataset_name] \
1418  # ["num_events"][run_number]
1419  castor_path = os.path.join(castor_path, "nevents")
1420 
1421  ###
1422 
1423  castor_path = os.path.normpath(castor_path)
1424 
1425  # End of create_castor_path_name_special.
1426  return castor_path
1427 
def cmsHarvester.create_config_file_name (   self,
  dataset_name,
  run_number 
)
Generate the name of the configuration file to be run by
CRAB.

Depending on the harvesting mode (single-step or two-step)
this is the name of the real harvesting configuration or the
name of the first-step ME summary extraction configuration.

Definition at line 4063 of file cmsHarvester.py.

4063  def create_config_file_name(self, dataset_name, run_number):
4064  """Generate the name of the configuration file to be run by
4065  CRAB.
4066 
4067  Depending on the harvesting mode (single-step or two-step)
4068  this is the name of the real harvesting configuration or the
4069  name of the first-step ME summary extraction configuration.
4070 
4071  """
4072 
4073  if self.harvesting_mode == "single-step":
4074  config_file_name = self.create_harvesting_config_file_name(dataset_name)
4075  elif self.harvesting_mode == "single-step-allow-partial":
4076  config_file_name = self.create_harvesting_config_file_name(dataset_name)
def create_config_file_name(self, dataset_name, run_number)
def cmsHarvester.create_crab_config (   self)
Create a CRAB configuration for a given job.

NOTE: This is _not_ a complete (as in: submittable) CRAB
configuration. It is used to store the common settings for the
multicrab configuration.

NOTE: Only CERN CASTOR area (/castor/cern.ch/) is supported.

NOTE: According to CRAB, you `Must define exactly two of
total_number_of_events, events_per_job, or
number_of_jobs.'. For single-step harvesting we force one job,
for the rest we don't really care.

# BUG BUG BUG
# With the current version of CRAB (2.6.1), in which Daniele
# fixed the behaviour of no_block_boundary for me, one _has to
# specify_ the total_number_of_events and one single site in
# the se_white_list.
# BUG BUG BUG end

Definition at line 4231 of file cmsHarvester.py.

References join().

4232  """Create a CRAB configuration for a given job.
4233 
4234  NOTE: This is _not_ a complete (as in: submittable) CRAB
4235  configuration. It is used to store the common settings for the
4236  multicrab configuration.
4237 
4238  NOTE: Only CERN CASTOR area (/castor/cern.ch/) is supported.
4239 
4240  NOTE: According to CRAB, you `Must define exactly two of
4241  total_number_of_events, events_per_job, or
4242  number_of_jobs.'. For single-step harvesting we force one job,
4243  for the rest we don't really care.
4244 
4245  # BUG BUG BUG
4246  # With the current version of CRAB (2.6.1), in which Daniele
4247  # fixed the behaviour of no_block_boundary for me, one _has to
4248  # specify_ the total_number_of_events and one single site in
4249  # the se_white_list.
4250  # BUG BUG BUG end
4251 
4252  """
4253 
4254  tmp = []
4255 
4256  # This is the stuff we will need to fill in.
4257  castor_prefix = self.castor_prefix
4258 
4259  tmp.append(self.config_file_header())
4260  tmp.append("")
4261 
def create_crab_config(self)
def cmsHarvester.create_es_prefer_snippet (   self,
  dataset_name 
)
Build the es_prefer snippet for the reference histograms.

The building of the snippet is wrapped in some care-taking
code that figures out the name of the reference histogram set
and makes sure the corresponding tag exists.

Definition at line 4687 of file cmsHarvester.py.

References join().

4687  def create_es_prefer_snippet(self, dataset_name):
4688  """Build the es_prefer snippet for the reference histograms.
4689 
4690  The building of the snippet is wrapped in some care-taking
4691  code that figures out the name of the reference histogram set
4692  and makes sure the corresponding tag exists.
4693 
4694  """
4695 
4696  # Figure out the name of the reference histograms tag.
4697  # NOTE: The existence of these tags has already been checked.
4698  ref_hist_tag_name = self.ref_hist_mappings[dataset_name]
4699 
4700  connect_name = self.frontier_connection_name["refhists"]
4701  connect_name += self.db_account_name_cms_cond_dqm_summary()
4702  record_name = "DQMReferenceHistogramRootFileRcd"
4703 
4704  # Build up the code snippet.
4705  code_lines = []
4706  code_lines.append("from CondCore.DBCommon.CondDBSetup_cfi import *")
4707  code_lines.append("process.ref_hist_source = cms.ESSource(\"PoolDBESSource\", CondDBSetup,")
4708  code_lines.append(" connect = cms.string(\"%s\")," % connect_name)
4709  code_lines.append(" toGet = cms.VPSet(cms.PSet(record = cms.string(\"%s\")," % record_name)
4710  code_lines.append(" tag = cms.string(\"%s\"))," % ref_hist_tag_name)
4711  code_lines.append(" )")
4712  code_lines.append(" )")
4713  code_lines.append("process.es_prefer_ref_hist_source = cms.ESPrefer(\"PoolDBESSource\", \"ref_hist_source\")")
4714 
4715  snippet = "\n".join(code_lines)
4716 
4717  # End of create_es_prefer_snippet.
4718  return snippet
4719 
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def create_es_prefer_snippet(self, dataset_name)
def cmsHarvester.create_harvesting_config (   self,
  dataset_name 
)
Create the Python harvesting configuration for harvesting.

The basic configuration is created by
Configuration.PyReleaseValidation.ConfigBuilder. (This mimics
what cmsDriver.py does.) After that we add some specials
ourselves.

NOTE: On one hand it may not be nice to circumvent
cmsDriver.py, on the other hand cmsDriver.py does not really
do anything itself. All the real work is done by the
ConfigBuilder so there is not much risk that we miss out on
essential developments of cmsDriver in the future.

Definition at line 4722 of file cmsHarvester.py.

References join().

4722  def create_harvesting_config(self, dataset_name):
4723  """Create the Python harvesting configuration for harvesting.
4724 
4725  The basic configuration is created by
4726  Configuration.PyReleaseValidation.ConfigBuilder. (This mimics
4727  what cmsDriver.py does.) After that we add some specials
4728  ourselves.
4729 
4730  NOTE: On one hand it may not be nice to circumvent
4731  cmsDriver.py, on the other hand cmsDriver.py does not really
4732  do anything itself. All the real work is done by the
4733  ConfigBuilder so there is not much risk that we miss out on
4734  essential developments of cmsDriver in the future.
4735 
4736  """
4737 
4738  # Setup some options needed by the ConfigBuilder.
4739  config_options = defaultOptions
4740 
4741  # These are fixed for all kinds of harvesting jobs. Some of
4742  # them are not needed for the harvesting config, but to keep
4743  # the ConfigBuilder happy.
4744  config_options.name = "harvesting"
4745  config_options.scenario = "pp"
4746  config_options.number = 1
4747  config_options.arguments = self.ident_string()
4748  config_options.evt_type = config_options.name
4749  config_options.customisation_file = None
4750  config_options.filein = "dummy_value"
4751  config_options.filetype = "EDM"
4752  # This seems to be new in CMSSW 3.3.X, no clue what it does.
4753  config_options.gflash = "dummy_value"
4754  # This seems to be new in CMSSW 3.3.0.pre6, no clue what it
4755  # does.
4756  #config_options.himix = "dummy_value"
def create_harvesting_config(self, dataset_name)
def cmsHarvester.create_harvesting_config_file_name (   self,
  dataset_name 
)

Only add the alarming piece to the file name if this is

a spread-out dataset.

pdb.set_trace() if self.datasets_information[dataset_name] \ ["mirrored"][run_number] == False: config_file_name = config_file_name.replace(".py", "_partial.py")

Definition at line 4095 of file cmsHarvester.py.

Referenced by write_harvesting_config().

4095  def create_harvesting_config_file_name(self, dataset_name):
4096  "Generate the name to be used for the harvesting config file."
4097 
4098  file_name_base = "harvesting.py"
4099  dataset_name_escaped = self.escape_dataset_name(dataset_name)
4100  config_file_name = file_name_base.replace(".py",
4101  "_%s.py" % \
4102  dataset_name_escaped)
4103 
4104  # End of create_harvesting_config_file_name.
4105  return config_file_name
4106 
def create_harvesting_config_file_name(self, dataset_name)
Only add the alarming piece to the file name if this isa spread-out dataset.
def cmsHarvester.create_harvesting_output_file_name (   self,
  dataset_name,
  run_number 
)
Generate the name to be used for the harvesting output file.

This harvesting output file is the _final_ ROOT output file
containing the harvesting results. In case of two-step
harvesting there is an intermediate ME output file as well.

Definition at line 4167 of file cmsHarvester.py.

4167  def create_harvesting_output_file_name(self, dataset_name, run_number):
4168  """Generate the name to be used for the harvesting output file.
4169 
4170  This harvesting output file is the _final_ ROOT output file
4171  containing the harvesting results. In case of two-step
4172  harvesting there is an intermediate ME output file as well.
4173 
4174  """
4175 
4176  dataset_name_escaped = self.escape_dataset_name(dataset_name)
4177 
4178  # Hmmm, looking at the code for the DQMFileSaver this might
4179  # actually be the place where the first part of this file
4180  # naming scheme comes from.
4181  # NOTE: It looks like the `V0001' comes from the DQM
4182  # version. This is something that cannot be looked up from
4183  # here, so let's hope it does not change too often.
4184  output_file_name = "DQM_V0001_R%09d__%s.root" % \
4185  (run_number, dataset_name_escaped)
4186  if self.harvesting_mode.find("partial") > -1:
4187  # Only add the alarming piece to the file name if this is
4188  # a spread-out dataset.
4189  if self.datasets_information[dataset_name] \
4190  ["mirrored"][run_number] == False:
4191  output_file_name = output_file_name.replace(".root", \
4192  "_partial.root")
4193 
4194  # End of create_harvesting_output_file_name.
4195  return output_file_name
4196 
def create_harvesting_output_file_name(self, dataset_name, run_number)
def cmsHarvester.create_me_extraction_config (   self,
  dataset_name 
)

In case this file is the second step (the real harvesting

step) of the two-step harvesting we have to tell it to use

our local files.

if self.harvesting_mode == "two-step": castor_dir = self.datasets_information[dataset_name] \ ["castor_path"][run] customisations.append("") customisations.append("# This is the second step (the real") customisations.append("# harvesting step) of a two-step") customisations.append("# harvesting procedure.")

BUG BUG BUG

To be removed in production version.

customisations.append("import pdb")

BUG BUG BUG end

customisations.append("import commands") customisations.append("import os") customisations.append("castor_dir = \"s"" % castor_dir) customisations.append("cmd = "rfdir s" % castor_dir") customisations.append("(status, output) = commands.getstatusoutput(cmd)") customisations.append("if status != 0:") customisations.append(" print "ERROR"") customisations.append(" raise Exception, "ERROR"") customisations.append("file_names = [os.path.join("rfio:s" % path, i) for i in output.split() if i.startswith("EDM_summary") and i.endswith(".root")]") #customisations.append("pdb.set_trace()") customisations.append("process.source.fileNames = cms.untracked.vstring(*file_names)") customisations.append("") ########## def create_harvesting_config_two_step(self, dataset_name): """Create the Python harvesting configuration for two-step harvesting. """ # BUG BUG BUG config_contents = self.create_harvesting_config_single_step(dataset_name)

BUG BUG BUG end

End of create_harvesting_config_two_step.

return config_contents

 

Definition at line 4948 of file cmsHarvester.py.

References create_output_file_name(), and join().

4948  def create_me_extraction_config(self, dataset_name):
4949  """
4950 
4951  """
4952 
4953  # Big chunk of hard-coded Python. Not such a big deal since
4954  # this does not do much and is not likely to break.
4955  tmp = []
4956  tmp.append(self.config_file_header())
4957  tmp.append("")
4958  tmp.append("import FWCore.ParameterSet.Config as cms")
4959  tmp.append("")
4960  tmp.append("process = cms.Process(\"ME2EDM\")")
4961  tmp.append("")
4962  tmp.append("# Import of standard configurations")
4963  tmp.append("process.load(\"Configuration/EventContent/EventContent_cff\")")
4964  tmp.append("")
4965  tmp.append("# We don't really process any events, just keep this set to one to")
4966  tmp.append("# make sure things work.")
4967  tmp.append("process.maxEvents = cms.untracked.PSet(")
4968  tmp.append(" input = cms.untracked.int32(1)")
4969  tmp.append(" )")
4970  tmp.append("")
4971  tmp.append("process.options = cms.untracked.PSet(")
4972  tmp.append(" Rethrow = cms.untracked.vstring(\"ProductNotFound\")")
4973  tmp.append(" )")
4974  tmp.append("")
4975  tmp.append("process.source = cms.Source(\"PoolSource\",")
4976  tmp.append(" processingMode = \\")
4977  tmp.append(" cms.untracked.string(\"RunsAndLumis\"),")
4978  tmp.append(" fileNames = \\")
4979  tmp.append(" cms.untracked.vstring(\"no_file_specified\")")
4980  tmp.append(" )")
4981  tmp.append("")
4982  tmp.append("# Output definition: drop everything except for the monitoring.")
4983  tmp.append("process.output = cms.OutputModule(")
4984  tmp.append(" \"PoolOutputModule\",")
4985  tmp.append(" outputCommands = \\")
4986  tmp.append(" cms.untracked.vstring(\"drop *\", \\")
4987  tmp.append(" \"keep *_MEtoEDMConverter_*_*\"),")
4988  output_file_name = self. \
4989  create_output_file_name(dataset_name)
4990  tmp.append(" fileName = \\")
4991  tmp.append(" cms.untracked.string(\"%s\")," % output_file_name)
4992  tmp.append(" dataset = cms.untracked.PSet(")
4993  tmp.append(" dataTier = cms.untracked.string(\"RECO\"),")
4994  tmp.append(" filterName = cms.untracked.string(\"\")")
4995  tmp.append(" )")
4996  tmp.append(" )")
4997  tmp.append("")
4998  tmp.append("# Additional output definition")
4999  tmp.append("process.out_step = cms.EndPath(process.output)")
5000  tmp.append("")
5001  tmp.append("# Schedule definition")
5002  tmp.append("process.schedule = cms.Schedule(process.out_step)")
5003  tmp.append("")
5004 
5005  config_contents = "\n".join(tmp)
5006 
5007  # End of create_me_extraction_config.
5008  return config_contents
5009 
def create_me_extraction_config(self, dataset_name)
In case this file is the second step (the real harvestingstep) of the two-step harvesting we have to ...
def create_output_file_name(self, dataset_name, run_number=None)
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def cmsHarvester.create_me_summary_config_file_name (   self,
  dataset_name 
)

Definition at line 4109 of file cmsHarvester.py.

Referenced by write_me_extraction_config().

4109  def create_me_summary_config_file_name(self, dataset_name):
4110  "Generate the name of the ME summary extraction config file."
4111 
4112  file_name_base = "me_extraction.py"
4113  dataset_name_escaped = self.escape_dataset_name(dataset_name)
4114  config_file_name = file_name_base.replace(".py",
4115  "_%s.py" % \
4116  dataset_name_escaped)
4117 
4118  # End of create_me_summary_config_file_name.
4119  return config_file_name
4120 
def create_me_summary_config_file_name(self, dataset_name)
def cmsHarvester.create_me_summary_output_file_name (   self,
  dataset_name 
)
Generate the name of the intermediate ME file name to be
used in two-step harvesting.

Definition at line 4199 of file cmsHarvester.py.

4199  def create_me_summary_output_file_name(self, dataset_name):
4200  """Generate the name of the intermediate ME file name to be
4201  used in two-step harvesting.
4202 
4203  """
4204 
4205  dataset_name_escaped = self.escape_dataset_name(dataset_name)
4206  output_file_name = "me_summary_%s.root" % \
4207  dataset_name_escaped
4208 
4209  # End of create_me_summary_output_file_name.
4210  return output_file_name
4211 
def create_me_summary_output_file_name(self, dataset_name)
def cmsHarvester.create_multicrab_block_name (   self,
  dataset_name,
  run_number,
  index 
)
Create the block name to use for this dataset/run number.

This is what appears in the brackets `[]' in multicrab.cfg. It
is used as the name of the job and to create output
directories.

Definition at line 4214 of file cmsHarvester.py.

4214  def create_multicrab_block_name(self, dataset_name, run_number, index):
4215  """Create the block name to use for this dataset/run number.
4216 
4217  This is what appears in the brackets `[]' in multicrab.cfg. It
4218  is used as the name of the job and to create output
4219  directories.
4220 
4221  """
4222 
4223  dataset_name_escaped = self.escape_dataset_name(dataset_name)
4224  block_name = "%s_%09d_%s" % (dataset_name_escaped, run_number, index)
4225 
4226  # End of create_multicrab_block_name.
4227  return block_name
4228 
def create_multicrab_block_name(self, dataset_name, run_number, index)
def cmsHarvester.create_multicrab_config (   self)

CRAB

GRID

USER

CMSSW

CAF

Create a multicrab.cfg file for all samples.

This creates the contents for a multicrab.cfg file that uses
the crab.cfg file (generated elsewhere) for the basic settings
and contains blocks for each run of each dataset.

# BUG BUG BUG
# The fact that it's necessary to specify the se_white_list
# and the total_number_of_events is due to our use of CRAB
# version 2.6.1. This should no longer be necessary in the
# future.
# BUG BUG BUG end

Definition at line 4311 of file cmsHarvester.py.

4312  """Create a multicrab.cfg file for all samples.
4313 
4314  This creates the contents for a multicrab.cfg file that uses
4315  the crab.cfg file (generated elsewhere) for the basic settings
4316  and contains blocks for each run of each dataset.
4317 
4318  # BUG BUG BUG
4319  # The fact that it's necessary to specify the se_white_list
4320  # and the total_number_of_events is due to our use of CRAB
4321  # version 2.6.1. This should no longer be necessary in the
4322  # future.
4323  # BUG BUG BUG end
4324 
4325  """
4326 
def create_multicrab_config(self)
CRAB
def cmsHarvester.create_output_file_name (   self,
  dataset_name,
  run_number = None 
)
Create the name of the output file name to be used.

This is the name of the output file of the `first step'. In
the case of single-step harvesting this is already the final
harvesting output ROOT file. In the case of two-step
harvesting it is the name of the intermediary ME summary
file.

Definition at line 4123 of file cmsHarvester.py.

Referenced by create_me_extraction_config().

4123  def create_output_file_name(self, dataset_name, run_number=None):
4124  """Create the name of the output file name to be used.
4125 
4126  This is the name of the output file of the `first step'. In
4127  the case of single-step harvesting this is already the final
4128  harvesting output ROOT file. In the case of two-step
4129  harvesting it is the name of the intermediary ME summary
4130  file.
4131 
4132  """
4133 
4134  # BUG BUG BUG
4135  # This method has become a bit of a mess. Originally it was
4136  # nice to have one entry point for both single- and two-step
4137  # output file names. However, now the former needs the run
4138  # number, while the latter does not even know about run
4139  # numbers. This should be fixed up a bit.
4140  # BUG BUG BUG end
4141 
4142  if self.harvesting_mode == "single-step":
4143  # DEBUG DEBUG DEBUG
4144  assert not run_number is None
4145  # DEBUG DEBUG DEBUG end
4146  output_file_name = self.create_harvesting_output_file_name(dataset_name, run_number)
4147  elif self.harvesting_mode == "single-step-allow-partial":
4148  # DEBUG DEBUG DEBUG
4149  assert not run_number is None
4150  # DEBUG DEBUG DEBUG end
4151  output_file_name = self.create_harvesting_output_file_name(dataset_name, run_number)
4152  elif self.harvesting_mode == "two-step":
4153  # DEBUG DEBUG DEBUG
4154  assert run_number is None
4155  # DEBUG DEBUG DEBUG end
4156  output_file_name = self.create_me_summary_output_file_name(dataset_name)
4157  else:
4158  # This should not be possible, but hey...
4159  assert False, "ERROR Unknown harvesting mode `%s'" % \
4160  self.harvesting_mode
4161 
4162  # End of create_harvesting_output_file_name.
4163  return output_file_name
4164 
def create_output_file_name(self, dataset_name, run_number=None)
def cmsHarvester.dbs_check_dataset_spread (   self,
  dataset_name 
)

def dbs_resolve_dataset_number_of_sites(self, dataset_name): """Ask DBS across how many sites this dataset has been spread out.

This is especially useful to check that we do not submit a job supposed to run on a complete sample that is not contained at a single site. """ # DEBUG DEBUG DEBUG

If we get here DBS should have been set up already.

assert not self.dbs_api is None

DEBUG DEBUG DEBUG end

api = self.dbs_api dbs_query = "find count(site) where dataset = %s " \ "and dataset.status = VALID" % \ dataset_name try: api_result = api.executeQuery(dbs_query) except DbsApiException: raise Error("ERROR: Could not execute DBS query") try: num_sites = [] class Handler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): if name == "result": num_sites.append(str(attrs["COUNT_STORAGEELEMENT"])) xml.sax.parseString(api_result, Handler()) except SAXParseException: raise Error("ERROR: Could not parse DBS server output") # DEBUG DEBUG DEBUG assert len(num_sites) == 1

DEBUG DEBUG DEBUG end

num_sites = int(num_sites[0]) # End of dbs_resolve_dataset_number_of_sites. return num_sites def dbs_check_dataset_spread(self, dataset_name): """Figure out across how many sites this dataset is spread. NOTE: This is something we need to figure out per run, since we want to submit harvesting jobs per run. Basically three things can happen with a given dataset:

  • the whole dataset is available on a single site,
  • the whole dataset is available (mirrored) at multiple sites,
  • the dataset is spread across multiple sites and there is no single site containing the full dataset in one place. NOTE: If all goes well, it should not be possible that anything but a full dataset is mirrored. So we ignore the possibility in which for example one site contains the full dataset and two others mirror half of it. ANOTHER NOTE: According to some people this last case could actually happen. I will not design for it, but make sure it ends up as a false negative, in which case we just loose some efficiency and treat the dataset (unnecessarily) as spread-out. We don't really care about the first two possibilities, but in the third case we need to make sure to run the harvesting in two-step mode. This method checks with DBS which of the above cases is true for the dataset name given, and returns a 1 for the first two cases, and the number of sites across which the dataset is spread for the third case. The way in which this is done is by asking how many files each site has for the dataset. In the first case there is only one site, in the second case all sites should have the same number of files (i.e. the total number of files in the dataset) and in the third case the file counts from all sites should add up to the total file count for the dataset. """ # DEBUG DEBUG DEBUG

    If we get here DBS should have been set up already.

assert not self.dbs_api is None

DEBUG DEBUG DEBUG end

api = self.dbs_api dbs_query = "find run, run.numevents, site, file.count " \ "where dataset = %s " \ "and dataset.status = VALID" % \ dataset_name try: api_result = api.executeQuery(dbs_query) except DbsApiException: msg = "ERROR: Could not execute DBS query" self.logger.fatal(msg) raise Error(msg) # Index things by run number. No cross-check is done to make

sure we get results for each and every run in the

dataset. I'm not sure this would make sense since we'd be

cross-checking DBS info with DBS info anyway. Note that we

use the file count per site to see if we're dealing with an

incomplete vs. a mirrored dataset.

sample_info = {} try: class Handler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): if name == "result": run_number = int(attrs["RUNS_RUNNUMBER"]) site_name = str(attrs["STORAGEELEMENT_SENAME"]) file_count = int(attrs["COUNT_FILES"])

BUG BUG BUG

Doh! For some reason DBS never returns any other

event count than zero.

event_count = int(attrs["RUNS_NUMBEROFEVENTS"])

BUG BUG BUG end

info = (site_name, file_count, event_count) try: sample_info[run_number].append(info) except KeyError: sample_info[run_number] = [info] xml.sax.parseString(api_result, Handler()) except SAXParseException: msg = "ERROR: Could not parse DBS server output" self.logger.fatal(msg) raise Error(msg) # Now translate this into a slightly more usable mapping. sites = {} for (run_number, site_info) in six.iteritems(sample_info):

Quick-n-dirty trick to see if all file counts are the

same.

unique_file_counts = set([i[1] for i in site_info]) if len(unique_file_counts) == 1:

Okay, so this must be a mirrored dataset.

We have to pick one but we have to be careful. We

cannot submit to things like a T0, a T1, or CAF.

site_names = [self.pick_a_site([i[0] for i in site_info])] nevents = [site_info[0][2]] else:

Looks like this is a spread-out sample.

site_names = [i[0] for i in site_info] nevents = [i[2] for i in site_info] sites[run_number] = zip(site_names, nevents) self.logger.debug("Sample `%s' spread is:" % dataset_name) run_numbers = sites.keys() run_numbers.sort() for run_number in run_numbers: self.logger.debug(" run # %6d: %d sites (%s)" % \ (run_number, len(sites[run_number]), ", ".join([i[0] for i in sites[run_number]]))) # End of dbs_check_dataset_spread. return sites # DEBUG DEBUG DEBUG

Just kept for debugging now.

def dbs_check_dataset_spread_old(self, dataset_name): """Figure out across how many sites this dataset is spread. NOTE: This is something we need to figure out per run, since we want to submit harvesting jobs per run. Basically three things can happen with a given dataset:

  • the whole dataset is available on a single site,
  • the whole dataset is available (mirrored) at multiple sites,
  • the dataset is spread across multiple sites and there is no single site containing the full dataset in one place. NOTE: If all goes well, it should not be possible that anything but a full dataset is mirrored. So we ignore the possibility in which for example one site contains the full dataset and two others mirror half of it. ANOTHER NOTE: According to some people this last case could actually happen. I will not design for it, but make sure it ends up as a false negative, in which case we just loose some efficiency and treat the dataset (unnecessarily) as spread-out. We don't really care about the first two possibilities, but in the third case we need to make sure to run the harvesting in two-step mode. This method checks with DBS which of the above cases is true for the dataset name given, and returns a 1 for the first two cases, and the number of sites across which the dataset is spread for the third case. The way in which this is done is by asking how many files each site has for the dataset. In the first case there is only one site, in the second case all sites should have the same number of files (i.e. the total number of files in the dataset) and in the third case the file counts from all sites should add up to the total file count for the dataset. """ # DEBUG DEBUG DEBUG

    If we get here DBS should have been set up already.

assert not self.dbs_api is None

DEBUG DEBUG DEBUG end

api = self.dbs_api dbs_query = "find run, run.numevents, site, file.count " \ "where dataset = %s " \ "and dataset.status = VALID" % \ dataset_name try: api_result = api.executeQuery(dbs_query) except DbsApiException: msg = "ERROR: Could not execute DBS query" self.logger.fatal(msg) raise Error(msg) # Index things by run number. No cross-check is done to make

sure we get results for each and every run in the

dataset. I'm not sure this would make sense since we'd be

cross-checking DBS info with DBS info anyway. Note that we

use the file count per site to see if we're dealing with an

incomplete vs. a mirrored dataset.

sample_info = {} try: class Handler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): if name == "result": run_number = int(attrs["RUNS_RUNNUMBER"]) site_name = str(attrs["STORAGEELEMENT_SENAME"]) file_count = int(attrs["COUNT_FILES"])

BUG BUG BUG

Doh! For some reason DBS never returns any other

event count than zero.

event_count = int(attrs["RUNS_NUMBEROFEVENTS"])

BUG BUG BUG end

info = (site_name, file_count, event_count) try: sample_info[run_number].append(info) except KeyError: sample_info[run_number] = [info] xml.sax.parseString(api_result, Handler()) except SAXParseException: msg = "ERROR: Could not parse DBS server output" self.logger.fatal(msg) raise Error(msg) # Now translate this into a slightly more usable mapping. sites = {} for (run_number, site_info) in six.iteritems(sample_info):

Quick-n-dirty trick to see if all file counts are the

same.

unique_file_counts = set([i[1] for i in site_info]) if len(unique_file_counts) == 1:

Okay, so this must be a mirrored dataset.

We have to pick one but we have to be careful. We

cannot submit to things like a T0, a T1, or CAF.

site_names = [self.pick_a_site([i[0] for i in site_info])] nevents = [site_info[0][2]] else:

Looks like this is a spread-out sample.

site_names = [i[0] for i in site_info] nevents = [i[2] for i in site_info] sites[run_number] = zip(site_names, nevents) self.logger.debug("Sample `%s' spread is:" % dataset_name) run_numbers = sites.keys() run_numbers.sort() for run_number in run_numbers: self.logger.debug(" run # %6d: %d site(s) (%s)" % \ (run_number, len(sites[run_number]), ", ".join([i[0] for i in sites[run_number]]))) # End of dbs_check_dataset_spread_old. return sites

DEBUG DEBUG DEBUG end

Figure out the number of events in each run of this dataset.

This is a more efficient way of doing this than calling
dbs_resolve_number_of_events for each run.

Definition at line 3075 of file cmsHarvester.py.

References mps_setup.append, createfilelist.int, mps_monitormerge.items, relativeConstraints.keys, list(), and MuonErrorMatrixValues_cff.values.

3075  def dbs_check_dataset_spread(self, dataset_name):
3076  """Figure out the number of events in each run of this dataset.
3077 
3078  This is a more efficient way of doing this than calling
3079  dbs_resolve_number_of_events for each run.
3080 
3081  """
3082 
3083  self.logger.debug("Checking spread of dataset `%s'" % dataset_name)
3084 
3085  # DEBUG DEBUG DEBUG
3086  # If we get here DBS should have been set up already.
3087  assert not self.dbs_api is None
3088  # DEBUG DEBUG DEBUG end
3089 
3090  api = self.dbs_api
3091  dbs_query = "find run.number, site, file.name, file.numevents " \
3092  "where dataset = %s " \
3093  "and dataset.status = VALID" % \
3094  dataset_name
3095  try:
3096  api_result = api.executeQuery(dbs_query)
3097  except DBSAPI.dbsApiException.DbsApiException:
3098  msg = "ERROR: Could not execute DBS query"
3099  self.logger.fatal(msg)
3100  raise Error(msg)
3101 
3102  handler = DBSXMLHandler(["run.number", "site", "file.name", "file.numevents"])
3103  parser = xml.sax.make_parser()
3104  parser.setContentHandler(handler)
3105 
3106  try:
3107  # OBSOLETE OBSOLETE OBSOLETE
Helper class: Error exception.
Helper class: DBSXMLHandler.
def dbs_check_dataset_spread(self, dataset_name)
def dbs_resolve_dataset_number_of_sites(self, dataset_name): """Ask DBS across how many sites this da...
def cmsHarvester.dbs_resolve_cmssw_version (   self,
  dataset_name 
)
Ask DBS for the CMSSW version used to create this dataset.

Definition at line 2474 of file cmsHarvester.py.

2474  def dbs_resolve_cmssw_version(self, dataset_name):
2475  """Ask DBS for the CMSSW version used to create this dataset.
2476 
2477  """
2478 
2479  # DEBUG DEBUG DEBUG
2480  # If we get here DBS should have been set up already.
2481  assert not self.dbs_api is None
2482  # DEBUG DEBUG DEBUG end
2483 
2484  api = self.dbs_api
2485  dbs_query = "find algo.version where dataset = %s " \
2486  "and dataset.status = VALID" % \
2487  dataset_name
2488  try:
2489  api_result = api.executeQuery(dbs_query)
2490  except DBSAPI.dbsApiException.DbsApiException:
2491  msg = "ERROR: Could not execute DBS query"
2492  self.logger.fatal(msg)
2493  raise Error(msg)
2494 
2495  handler = DBSXMLHandler(["algo.version"])
2496  parser = xml.sax.make_parser()
2497  parser.setContentHandler(handler)
2498 
2499  try:
2500  xml.sax.parseString(api_result, handler)
2501  except SAXParseException:
2502  msg = "ERROR: Could not parse DBS server output"
2503  self.logger.fatal(msg)
2504  raise Error(msg)
2505 
2506  # DEBUG DEBUG DEBUG
2507  assert(handler.check_results_validity()), "ERROR The DBSXMLHandler screwed something up!"
2508  # DEBUG DEBUG DEBUG end
2509 
2510  cmssw_version = handler.results.values()[0]
2511 
2512  # DEBUG DEBUG DEBUG
2513  assert len(cmssw_version) == 1
2514  # DEBUG DEBUG DEBUG end
2515 
2516  cmssw_version = cmssw_version[0]
2517 
2518  # End of dbs_resolve_cmssw_version.
2519  return cmssw_version
2520 
Helper class: Error exception.
Helper class: DBSXMLHandler.
def dbs_resolve_cmssw_version(self, dataset_name)
def cmsHarvester.dbs_resolve_dataset_name (   self,
  dataset_name 
)
Use DBS to resolve a wildcarded dataset name.

Definition at line 2418 of file cmsHarvester.py.

Referenced by build_dataset_list().

2418  def dbs_resolve_dataset_name(self, dataset_name):
2419  """Use DBS to resolve a wildcarded dataset name.
2420 
2421  """
2422 
2423  # DEBUG DEBUG DEBUG
2424  # If we get here DBS should have been set up already.
2425  assert not self.dbs_api is None
2426  # DEBUG DEBUG DEBUG end
2427 
2428  # Some minor checking to make sure that whatever we've been
2429  # given as dataset name actually sounds like a dataset name.
2430  if not (dataset_name.startswith("/") and \
2431  dataset_name.endswith("RECO")):
2432  self.logger.warning("Dataset name `%s' does not sound " \
2433  "like a valid dataset name!" % \
2434  dataset_name)
2435 
2436  #----------
2437 
2438  api = self.dbs_api
2439  dbs_query = "find dataset where dataset like %s " \
2440  "and dataset.status = VALID" % \
2441  dataset_name
2442  try:
2443  api_result = api.executeQuery(dbs_query)
2444  except DBSAPI.dbsApiException.DbsApiException:
2445  msg = "ERROR: Could not execute DBS query"
2446  self.logger.fatal(msg)
2447  raise Error(msg)
2448 
2449  # Setup parsing.
2450  handler = DBSXMLHandler(["dataset"])
2451  parser = xml.sax.make_parser()
2452  parser.setContentHandler(handler)
2453 
2454  # Parse.
2455  try:
2456  xml.sax.parseString(api_result, handler)
2457  except SAXParseException:
2458  msg = "ERROR: Could not parse DBS server output"
2459  self.logger.fatal(msg)
2460  raise Error(msg)
2461 
2462  # DEBUG DEBUG DEBUG
2463  assert(handler.check_results_validity()), "ERROR The DBSXMLHandler screwed something up!"
2464  # DEBUG DEBUG DEBUG end
2465 
2466  # Extract the results.
2467  datasets = handler.results.values()[0]
2468 
2469  # End of dbs_resolve_dataset_name.
2470  return datasets
2471 
Helper class: Error exception.
Helper class: DBSXMLHandler.
def dbs_resolve_dataset_name(self, dataset_name)
def cmsHarvester.dbs_resolve_datatype (   self,
  dataset_name 
)
Ask DBS for the the data type (data or mc) of a given
dataset.

Definition at line 2681 of file cmsHarvester.py.

2681  def dbs_resolve_datatype(self, dataset_name):
2682  """Ask DBS for the the data type (data or mc) of a given
2683  dataset.
2684 
2685  """
2686 
2687  # DEBUG DEBUG DEBUG
2688  # If we get here DBS should have been set up already.
2689  assert not self.dbs_api is None
2690  # DEBUG DEBUG DEBUG end
2691 
2692  api = self.dbs_api
2693  dbs_query = "find datatype.type where dataset = %s " \
2694  "and dataset.status = VALID" % \
2695  dataset_name
2696  try:
2697  api_result = api.executeQuery(dbs_query)
2698  except DBSAPI.dbsApiException.DbsApiException:
2699  msg = "ERROR: Could not execute DBS query"
2700  self.logger.fatal(msg)
2701  raise Error(msg)
2702 
2703  handler = DBSXMLHandler(["datatype.type"])
2704  parser = xml.sax.make_parser()
2705  parser.setContentHandler(handler)
2706 
2707  try:
2708  xml.sax.parseString(api_result, handler)
2709  except SAXParseException:
2710  msg = "ERROR: Could not parse DBS server output"
2711  self.logger.fatal(msg)
2712  raise Error(msg)
2713 
2714  # DEBUG DEBUG DEBUG
2715  assert(handler.check_results_validity()), "ERROR The DBSXMLHandler screwed something up!"
2716  # DEBUG DEBUG DEBUG end
2717 
2718  datatype = handler.results.values()[0]
2719 
2720  # DEBUG DEBUG DEBUG
2721  assert len(datatype) == 1
2722  # DEBUG DEBUG DEBUG end
2723 
2724  datatype = datatype[0]
2725 
2726  # End of dbs_resolve_datatype.
2727  return datatype
2728 
Helper class: Error exception.
def dbs_resolve_datatype(self, dataset_name)
Helper class: DBSXMLHandler.
def cmsHarvester.dbs_resolve_globaltag (   self,
  dataset_name 
)
Ask DBS for the globaltag corresponding to a given dataset.

# BUG BUG BUG
# This does not seem to work for data datasets? E.g. for
# /Cosmics/Commissioning08_CRAFT0831X_V1_311_ReReco_FromSuperPointing_v1/RAW-RECO
# Probaly due to the fact that the GlobalTag changed during
# datataking...
BUG BUG BUG end

Definition at line 2625 of file cmsHarvester.py.

2625  def dbs_resolve_globaltag(self, dataset_name):
2626  """Ask DBS for the globaltag corresponding to a given dataset.
2627 
2628  # BUG BUG BUG
2629  # This does not seem to work for data datasets? E.g. for
2630  # /Cosmics/Commissioning08_CRAFT0831X_V1_311_ReReco_FromSuperPointing_v1/RAW-RECO
2631  # Probaly due to the fact that the GlobalTag changed during
2632  # datataking...
2633  BUG BUG BUG end
2634 
2635  """
2636 
2637  # DEBUG DEBUG DEBUG
2638  # If we get here DBS should have been set up already.
2639  assert not self.dbs_api is None
2640  # DEBUG DEBUG DEBUG end
2641 
2642  api = self.dbs_api
2643  dbs_query = "find dataset.tag where dataset = %s " \
2644  "and dataset.status = VALID" % \
2645  dataset_name
2646  try:
2647  api_result = api.executeQuery(dbs_query)
2648  except DBSAPI.dbsApiException.DbsApiException:
2649  msg = "ERROR: Could not execute DBS query"
2650  self.logger.fatal(msg)
2651  raise Error(msg)
2652 
2653  handler = DBSXMLHandler(["dataset.tag"])
2654  parser = xml.sax.make_parser()
2655  parser.setContentHandler(parser)
2656 
2657  try:
2658  xml.sax.parseString(api_result, handler)
2659  except SAXParseException:
2660  msg = "ERROR: Could not parse DBS server output"
2661  self.logger.fatal(msg)
2662  raise Error(msg)
2663 
2664  # DEBUG DEBUG DEBUG
2665  assert(handler.check_results_validity()), "ERROR The DBSXMLHandler screwed something up!"
2666  # DEBUG DEBUG DEBUG end
2667 
2668  globaltag = handler.results.values()[0]
2669 
2670  # DEBUG DEBUG DEBUG
2671  assert len(globaltag) == 1
2672  # DEBUG DEBUG DEBUG end
2673 
2674  globaltag = globaltag[0]
2675 
2676  # End of dbs_resolve_globaltag.
2677  return globaltag
2678 
Helper class: Error exception.
Helper class: DBSXMLHandler.
def dbs_resolve_globaltag(self, dataset_name)
def cmsHarvester.dbs_resolve_number_of_events (   self,
  dataset_name,
  run_number = None 
)
Determine the number of events in a given dataset (and run).

Ask DBS for the number of events in a dataset. If a run number
is specified the number of events returned is that in that run
of that dataset. If problems occur we throw an exception.

# BUG BUG BUG
# Since DBS does not return the number of events correctly,
# neither for runs nor for whole datasets, we have to work
# around that a bit...
# BUG BUG BUG end

Definition at line 2734 of file cmsHarvester.py.

2734  def dbs_resolve_number_of_events(self, dataset_name, run_number=None):
2735  """Determine the number of events in a given dataset (and run).
2736 
2737  Ask DBS for the number of events in a dataset. If a run number
2738  is specified the number of events returned is that in that run
2739  of that dataset. If problems occur we throw an exception.
2740 
2741  # BUG BUG BUG
2742  # Since DBS does not return the number of events correctly,
2743  # neither for runs nor for whole datasets, we have to work
2744  # around that a bit...
2745  # BUG BUG BUG end
2746 
2747  """
2748 
2749  # DEBUG DEBUG DEBUG
2750  # If we get here DBS should have been set up already.
2751  assert not self.dbs_api is None
2752  # DEBUG DEBUG DEBUG end
2753 
2754  api = self.dbs_api
2755  dbs_query = "find file.name, file.numevents where dataset = %s " \
2756  "and dataset.status = VALID" % \
2757  dataset_name
2758  if not run_number is None:
2759  dbs_query = dbq_query + (" and run = %d" % run_number)
2760  try:
2761  api_result = api.executeQuery(dbs_query)
2762  except DBSAPI.dbsApiException.DbsApiException:
2763  msg = "ERROR: Could not execute DBS query"
2764  self.logger.fatal(msg)
2765  raise Error(msg)
2766 
2767  handler = DBSXMLHandler(["file.name", "file.numevents"])
2768  parser = xml.sax.make_parser()
2769  parser.setContentHandler(handler)
2770 
2771  try:
2772  xml.sax.parseString(api_result, handler)
2773  except SAXParseException:
2774  msg = "ERROR: Could not parse DBS server output"
2775  self.logger.fatal(msg)
2776  raise Error(msg)
2777 
2778  # DEBUG DEBUG DEBUG
2779  assert(handler.check_results_validity()), "ERROR The DBSXMLHandler screwed something up!"
2780  # DEBUG DEBUG DEBUG end
2781 
2782  num_events = sum(handler.results["file.numevents"])
2783 
2784  # End of dbs_resolve_number_of_events.
2785  return num_events
2786 
Helper class: Error exception.
Helper class: DBSXMLHandler.
def dbs_resolve_number_of_events(self, dataset_name, run_number=None)
def cmsHarvester.dbs_resolve_runs (   self,
  dataset_name 
)

def dbs_resolve_dataset_number_of_events(self, dataset_name): """Ask DBS across how many events this dataset has been spread out.

This is especially useful to check that we do not submit a job supposed to run on a complete sample that is not contained at a single site. """ # DEBUG DEBUG DEBUG

If we get here DBS should have been set up already.

assert not self.dbs_api is None

DEBUG DEBUG DEBUG end

api = self.dbs_api dbs_query = "find count(site) where dataset = %s " \ "and dataset.status = VALID" % \ dataset_name try: api_result = api.executeQuery(dbs_query) except DbsApiException: raise Error("ERROR: Could not execute DBS query") try: num_events = [] class Handler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): if name == "result": num_events.append(str(attrs["COUNT_STORAGEELEMENT"])) xml.sax.parseString(api_result, Handler()) except SAXParseException: raise Error("ERROR: Could not parse DBS server output") # DEBUG DEBUG DEBUG assert len(num_events) == 1

DEBUG DEBUG DEBUG end

num_events = int(num_events[0]) # End of dbs_resolve_dataset_number_of_events. return num_events

Ask DBS for the list of runs in a given dataset.

# NOTE: This does not (yet?) skip/remove empty runs. There is
# a bug in the DBS entry run.numevents (i.e. it always returns
# zero) which should be fixed in the `next DBS release'.
# See also:
#   https://savannah.cern.ch/bugs/?53452
#   https://savannah.cern.ch/bugs/?53711

Definition at line 2568 of file cmsHarvester.py.

References createfilelist.int.

2568  def dbs_resolve_runs(self, dataset_name):
2569  """Ask DBS for the list of runs in a given dataset.
2570 
2571  # NOTE: This does not (yet?) skip/remove empty runs. There is
2572  # a bug in the DBS entry run.numevents (i.e. it always returns
2573  # zero) which should be fixed in the `next DBS release'.
2574  # See also:
2575  # https://savannah.cern.ch/bugs/?53452
2576  # https://savannah.cern.ch/bugs/?53711
2577 
2578  """
2579 
2580  # TODO TODO TODO
2581  # We should remove empty runs as soon as the above mentioned
2582  # bug is fixed.
2583  # TODO TODO TODO end
2584 
2585  # DEBUG DEBUG DEBUG
2586  # If we get here DBS should have been set up already.
2587  assert not self.dbs_api is None
2588  # DEBUG DEBUG DEBUG end
2589 
2590  api = self.dbs_api
2591  dbs_query = "find run where dataset = %s " \
2592  "and dataset.status = VALID" % \
2593  dataset_name
2594  try:
2595  api_result = api.executeQuery(dbs_query)
2596  except DBSAPI.dbsApiException.DbsApiException:
2597  msg = "ERROR: Could not execute DBS query"
2598  self.logger.fatal(msg)
2599  raise Error(msg)
2600 
2601  handler = DBSXMLHandler(["run"])
2602  parser = xml.sax.make_parser()
2603  parser.setContentHandler(handler)
2604 
2605  try:
2606  xml.sax.parseString(api_result, handler)
2607  except SAXParseException:
2608  msg = "ERROR: Could not parse DBS server output"
2609  self.logger.fatal(msg)
2610  raise Error(msg)
2611 
2612  # DEBUG DEBUG DEBUG
2613  assert(handler.check_results_validity()), "ERROR The DBSXMLHandler screwed something up!"
2614  # DEBUG DEBUG DEBUG end
2615 
2616  runs = handler.results.values()[0]
2617  # Turn strings into integers.
2618  runs = sorted([int(i) for i in runs])
2619 
2620  # End of dbs_resolve_runs.
2621  return runs
2622 
Helper class: Error exception.
def dbs_resolve_runs(self, dataset_name)
def dbs_resolve_dataset_number_of_events(self, dataset_name): """Ask DBS across how many events this ...
Helper class: DBSXMLHandler.
def cmsHarvester.escape_dataset_name (   self,
  dataset_name 
)

if self.datasets_information[dataset_name]["num_events"][run_number] != 0: pdb.set_trace() DEBUG DEBUG DEBUG end

Escape a DBS dataset name.

Escape a DBS dataset name such that it does not cause trouble
with the file system. This means turning each `/' into `__',
except for the first one which is just removed.

Definition at line 4044 of file cmsHarvester.py.

4044  def escape_dataset_name(self, dataset_name):
4045  """Escape a DBS dataset name.
4046 
4047  Escape a DBS dataset name such that it does not cause trouble
4048  with the file system. This means turning each `/' into `__',
4049  except for the first one which is just removed.
4050 
4051  """
4052 
4053  escaped_dataset_name = dataset_name
4054  escaped_dataset_name = escaped_dataset_name.strip("/")
4055  escaped_dataset_name = escaped_dataset_name.replace("/", "__")
4056 
4057  return escaped_dataset_name
4058 
def escape_dataset_name(self, dataset_name)
if self.datasets_information[dataset_name]["num_events"][run_number] != 0: pdb.set_trace() DEBUG DEBU...
def cmsHarvester.load_ref_hist_mappings (   self)
Load the reference histogram mappings from file.

The dataset name to reference histogram name mappings are read
from a text file specified in self.ref_hist_mappings_file_name.

Definition at line 5203 of file cmsHarvester.py.

References FrontierConditions_GlobalTag_cff.file, SiStripPI.max, and digitizers_cfi.strip.

5204  """Load the reference histogram mappings from file.
5205 
5206  The dataset name to reference histogram name mappings are read
5207  from a text file specified in self.ref_hist_mappings_file_name.
5208 
5209  """
5210 
5211  # DEBUG DEBUG DEBUG
5212  assert len(self.ref_hist_mappings) < 1, \
5213  "ERROR Should not be RE-loading " \
5214  "reference histogram mappings!"
5215  # DEBUG DEBUG DEBUG end
5216 
5217  self.logger.info("Loading reference histogram mappings " \
5218  "from file `%s'" % \
5219  self.ref_hist_mappings_file_name)
5220 
5221  mappings_lines = None
5222  try:
5223  mappings_file = file(self.ref_hist_mappings_file_name, "r")
5224  mappings_lines = mappings_file.readlines()
5225  mappings_file.close()
5226  except IOError:
5227  msg = "ERROR: Could not open reference histogram mapping "\
5228  "file `%s'" % self.ref_hist_mappings_file_name
5229  self.logger.fatal(msg)
5230  raise Error(msg)
5231 
5232  ##########
5233 
5234  # The format we expect is: two white-space separated pieces
5235  # per line. The first the dataset name for which the reference
5236  # should be used, the second one the name of the reference
5237  # histogram in the database.
5238 
5239  for mapping in mappings_lines:
5240  # Skip comment lines.
5241  if not mapping.startswith("#"):
5242  mapping = mapping.strip()
5243  if len(mapping) > 0:
5244  mapping_pieces = mapping.split()
5245  if len(mapping_pieces) != 2:
5246  msg = "ERROR: The reference histogram mapping " \
5247  "file contains a line I don't " \
5248  "understand:\n %s" % mapping
5249  self.logger.fatal(msg)
5250  raise Error(msg)
5251  dataset_name = mapping_pieces[0].strip()
5252  ref_hist_name = mapping_pieces[1].strip()
5253  # We don't want people to accidentally specify
5254  # multiple mappings for the same dataset. Just
5255  # don't accept those cases.
5256  if dataset_name in self.ref_hist_mappings:
5257  msg = "ERROR: The reference histogram mapping " \
5258  "file contains multiple mappings for " \
5259  "dataset `%s'."
5260  self.logger.fatal(msg)
5261  raise Error(msg)
5262 
5263  # All is well that ends well.
5264  self.ref_hist_mappings[dataset_name] = ref_hist_name
5265 
5266  ##########
5267 
5268  self.logger.info(" Successfully loaded %d mapping(s)" % \
5269  len(self.ref_hist_mappings))
5270  max_len = max([len(i) for i in self.ref_hist_mappings.keys()])
5271  for (map_from, map_to) in six.iteritems(self.ref_hist_mappings):
5272  self.logger.info(" %-*s -> %s" % \
5273  (max_len, map_from, map_to))
5274 
5275  # End of load_ref_hist_mappings.
5276 
Helper class: Error exception.
def load_ref_hist_mappings(self)
def cmsHarvester.option_handler_caf_access (   self,
  option,
  opt_str,
  value,
  parser 
)
Set the self.caf_access flag to try and create jobs that
run on the CAF.

Definition at line 1102 of file cmsHarvester.py.

1102  def option_handler_caf_access(self, option, opt_str, value, parser):
1103  """Set the self.caf_access flag to try and create jobs that
1104  run on the CAF.
1105 
1106  """
1107  self.caf_access = True
1108 
1109  self.logger.warning("Running in `caf_access' mode. " \
1110  "Will try to create jobs that run " \
1111  "on CAF but no" \
1112  "further promises...")
1113 
1114  # End of option_handler_caf_access.
1115 
def option_handler_caf_access(self, option, opt_str, value, parser)
def cmsHarvester.option_handler_castor_dir (   self,
  option,
  opt_str,
  value,
  parser 
)

def option_handler_dataset_name(self, option, opt_str, value, parser): """Specify the name(s) of the dataset(s) to be processed.

It is checked to make sure that no dataset name or listfile names are given yet. If all is well (i.e. we still have a clean slate) the dataset name is stored for later use, otherwise a Usage exception is raised. """ if not self.input_method is None: if self.input_method == "dataset": raise Usage("Please only feed me one dataset specification") elif self.input_method == "listfile": raise Usage("Cannot specify both dataset and input list file") else: assert False, "Unknown input method `%s'" % self.input_method self.input_method = "dataset" self.input_name = value self.logger.info("Input method used: %s" % self.input_method) # End of option_handler_dataset_name. ########## def option_handler_listfile_name(self, option, opt_str, value, parser): """Specify the input list file containing datasets to be processed. It is checked to make sure that no dataset name or listfile names are given yet. If all is well (i.e. we still have a clean slate) the listfile name is stored for later use, otherwise a Usage exception is raised. """ if not self.input_method is None: if self.input_method == "listfile": raise Usage("Please only feed me one list file") elif self.input_method == "dataset": raise Usage("Cannot specify both dataset and input list file") else: assert False, "Unknown input method `%s'" % self.input_method self.input_method = "listfile" self.input_name = value self.logger.info("Input method used: %s" % self.input_method) # End of option_handler_listfile_name.

Specify where on CASTOR the output should go.

At the moment only output to CERN CASTOR is
supported. Eventually the harvested results should go into the
central place for DQM on CASTOR anyway.

Definition at line 1060 of file cmsHarvester.py.

1060  def option_handler_castor_dir(self, option, opt_str, value, parser):
1061  """Specify where on CASTOR the output should go.
1062 
1063  At the moment only output to CERN CASTOR is
1064  supported. Eventually the harvested results should go into the
1065  central place for DQM on CASTOR anyway.
1066 
1067  """
1068 
1069  # Check format of specified CASTOR area.
1070  castor_dir = value
1071  #castor_dir = castor_dir.lstrip(os.path.sep)
1072  castor_prefix = self.castor_prefix
1073 
1074  # Add a leading slash if necessary and clean up the path.
1075  castor_dir = os.path.join(os.path.sep, castor_dir)
1076  self.castor_base_dir = os.path.normpath(castor_dir)
1077 
1078  self.logger.info("CASTOR (base) area to be used: `%s'" % \
1079  self.castor_base_dir)
1080 
1081  # End of option_handler_castor_dir.
1082 
def option_handler_castor_dir(self, option, opt_str, value, parser)
def option_handler_dataset_name(self, option, opt_str, value, parser): """Specify the name(s) of the ...
def cmsHarvester.option_handler_crab_submission (   self,
  option,
  opt_str,
  value,
  parser 
)
Crab jobs are not created and
    "submitted automatically",

Definition at line 1130 of file cmsHarvester.py.

1130  def option_handler_crab_submission(self, option, opt_str, value, parser):
1131  """Crab jobs are not created and
1132  "submitted automatically",
1133  """
1134  self.crab_submission = True
1135 
1136  # End of option_handler_crab_submission.
1137 
def option_handler_crab_submission(self, option, opt_str, value, parser)
def cmsHarvester.option_handler_list_types (   self,
  option,
  opt_str,
  value,
  parser 
)
List all harvesting types and their mappings.

This lists all implemented harvesting types with their
corresponding mappings to sequence names. This had to be
separated out from the help since it depends on the CMSSW
version and was making things a bit of a mess.

NOTE: There is no way (at least not that I could come up with)
to code this in a neat generic way that can be read both by
this method and by setup_harvesting_info(). Please try hard to
keep these two methods in sync!

Definition at line 1152 of file cmsHarvester.py.

1152  def option_handler_list_types(self, option, opt_str, value, parser):
1153  """List all harvesting types and their mappings.
1154 
1155  This lists all implemented harvesting types with their
1156  corresponding mappings to sequence names. This had to be
1157  separated out from the help since it depends on the CMSSW
1158  version and was making things a bit of a mess.
1159 
1160  NOTE: There is no way (at least not that I could come up with)
1161  to code this in a neat generic way that can be read both by
1162  this method and by setup_harvesting_info(). Please try hard to
1163  keep these two methods in sync!
1164 
1165  """
1166 
1167  sep_line = "-" * 50
1168  sep_line_short = "-" * 20
1169 
1170  print sep_line
1171  print "The following harvesting types are available:"
1172  print sep_line
1173 
1174  print "`RelVal' maps to:"
1175  print " pre-3_3_0 : HARVESTING:validationHarvesting"
1176  print " 3_4_0_pre2 and later: HARVESTING:validationHarvesting+dqmHarvesting"
1177  print " Exceptions:"
1178  print " 3_3_0_pre1-4 : HARVESTING:validationHarvesting"
1179  print " 3_3_0_pre6 : HARVESTING:validationHarvesting"
1180  print " 3_4_0_pre1 : HARVESTING:validationHarvesting"
1181 
1182  print sep_line_short
1183 
1184  print "`RelValFS' maps to:"
1185  print " always : HARVESTING:validationHarvestingFS"
1186 
1187  print sep_line_short
1188 
1189  print "`MC' maps to:"
1190  print " always : HARVESTING:validationprodHarvesting"
1191 
1192  print sep_line_short
1193 
1194  print "`DQMOffline' maps to:"
1195  print " always : HARVESTING:dqmHarvesting"
1196 
1197  print sep_line
1198 
1199  # We're done, let's quit. (This is the same thing optparse
1200  # does after printing the help.)
1201  raise SystemExit
1202 
1203  # End of option_handler_list_types.
1204 
def option_handler_list_types(self, option, opt_str, value, parser)
def cmsHarvester.option_handler_no_t1access (   self,
  option,
  opt_str,
  value,
  parser 
)
Set the self.no_t1access flag to try and create jobs that
run without special `t1access' role.

Definition at line 1085 of file cmsHarvester.py.

1085  def option_handler_no_t1access(self, option, opt_str, value, parser):
1086  """Set the self.no_t1access flag to try and create jobs that
1087  run without special `t1access' role.
1088 
1089  """
1090 
1091  self.non_t1access = True
1092 
1093  self.logger.warning("Running in `non-t1access' mode. " \
1094  "Will try to create jobs that run " \
1095  "without special rights but no " \
1096  "further promises...")
1097 
1098  # End of option_handler_no_t1access.
1099 
def option_handler_no_t1access(self, option, opt_str, value, parser)
def cmsHarvester.option_handler_preferred_site (   self,
  option,
  opt_str,
  value,
  parser 
)

Definition at line 1146 of file cmsHarvester.py.

1146  def option_handler_preferred_site(self, option, opt_str, value, parser):
1147 
1148  self.preferred_site = value
1149 
def option_handler_preferred_site(self, option, opt_str, value, parser)
def cmsHarvester.option_handler_saveByLumiSection (   self,
  option,
  opt_str,
  value,
  parser 
)
Set process.dqmSaver.saveByLumiSectiont=1 in cfg harvesting file

Definition at line 1118 of file cmsHarvester.py.

1118  def option_handler_saveByLumiSection(self, option, opt_str, value, parser):
1119  """Set process.dqmSaver.saveByLumiSectiont=1 in cfg harvesting file
1120  """
1121  self.saveByLumiSection = True
1122 
1123  self.logger.warning("waning concerning saveByLumiSection option")
1124 
1125  # End of option_handler_saveByLumiSection.
1126 
1127 
def option_handler_saveByLumiSection(self, option, opt_str, value, parser)
def cmsHarvester.option_handler_sites (   self,
  option,
  opt_str,
  value,
  parser 
)

Definition at line 1140 of file cmsHarvester.py.

1140  def option_handler_sites(self, option, opt_str, value, parser):
1141 
1142  self.nr_max_sites = value
1143 
def option_handler_sites(self, option, opt_str, value, parser)
def cmsHarvester.parse_cmd_line_options (   self)

Definition at line 1869 of file cmsHarvester.py.

1870 
1871  # Set up the command line parser. Note that we fix up the help
1872  # formatter so that we can add some text pointing people to
1873  # the Twiki etc.
1874  parser = optparse.OptionParser(version="%s %s" % \
1875  ("%prog", self.version),
1876  formatter=CMSHarvesterHelpFormatter())
1877 
1878  self.option_parser = parser
1879 
1880  # The debug switch.
1881  parser.add_option("-d", "--debug",
1882  help="Switch on debug mode",
1883  action="callback",
1884  callback=self.option_handler_debug)
1885 
1886  # The quiet switch.
1887  parser.add_option("-q", "--quiet",
1888  help="Be less verbose",
1889  action="callback",
1890  callback=self.option_handler_quiet)
1891 
1892  # The force switch. If this switch is used sanity checks are
1893  # performed but failures do not lead to aborts. Use with care.
1894  parser.add_option("", "--force",
1895  help="Force mode. Do not abort on sanity check "
1896  "failures",
1897  action="callback",
1898  callback=self.option_handler_force)
1899 
1900  # Choose between the different kinds of harvesting.
1901  parser.add_option("", "--harvesting_type",
1902  help="Harvesting type: %s" % \
1903  ", ".join(self.harvesting_types),
1904  action="callback",
1905  callback=self.option_handler_harvesting_type,
1906  type="string",
1907  metavar="HARVESTING_TYPE")
1908 
1909  # Choose between single-step and two-step mode.
1910  parser.add_option("", "--harvesting_mode",
1911  help="Harvesting mode: %s (default = %s)" % \
1912  (", ".join(self.harvesting_modes),
1913  self.harvesting_mode_default),
1914  action="callback",
1915  callback=self.option_handler_harvesting_mode,
1916  type="string",
1917  metavar="HARVESTING_MODE")
1918 
1919  # Override the GlobalTag chosen by the cmsHarvester.
1920  parser.add_option("", "--globaltag",
1921  help="GlobalTag to use. Default is the ones " \
1922  "the dataset was created with for MC, for data" \
1923  "a GlobalTag has to be specified.",
1924  action="callback",
1925  callback=self.option_handler_globaltag,
1926  type="string",
1927  metavar="GLOBALTAG")
1928 
1929  # Allow switching off of reference histograms.
1930  parser.add_option("", "--no-ref-hists",
1931  help="Don't use any reference histograms",
1932  action="callback",
1933  callback=self.option_handler_no_ref_hists)
1934 
1935  # Allow the default (i.e. the one that should be used)
1936  # Frontier connection to be overridden.
1937  parser.add_option("", "--frontier-connection",
1938  help="Use this Frontier connection to find " \
1939  "GlobalTags and LocalTags (for reference " \
1940  "histograms).\nPlease only use this for " \
1941  "testing.",
1942  action="callback",
1943  callback=self.option_handler_frontier_connection,
1944  type="string",
1945  metavar="FRONTIER")
1946 
1947  # Similar to the above but specific to the Frontier connection
1948  # to be used for the GlobalTag.
1949  parser.add_option("", "--frontier-connection-for-globaltag",
1950  help="Use this Frontier connection to find " \
1951  "GlobalTags.\nPlease only use this for " \
1952  "testing.",
1953  action="callback",
1954  callback=self.option_handler_frontier_connection,
1955  type="string",
1956  metavar="FRONTIER")
1957 
1958  # Similar to the above but specific to the Frontier connection
1959  # to be used for the reference histograms.
1960  parser.add_option("", "--frontier-connection-for-refhists",
1961  help="Use this Frontier connection to find " \
1962  "LocalTags (for reference " \
1963  "histograms).\nPlease only use this for " \
1964  "testing.",
1965  action="callback",
1966  callback=self.option_handler_frontier_connection,
1967  type="string",
1968  metavar="FRONTIER")
1969 
1970  # Option to specify the name (or a regexp) of the dataset(s)
1971  # to be used.
1972  parser.add_option("", "--dataset",
1973  help="Name (or regexp) of dataset(s) to process",
1974  action="callback",
1975  #callback=self.option_handler_dataset_name,
1976  callback=self.option_handler_input_spec,
1977  type="string",
1978  #dest="self.input_name",
1979  metavar="DATASET")
1980 
1981  # Option to specify the name (or a regexp) of the dataset(s)
1982  # to be ignored.
1983  parser.add_option("", "--dataset-ignore",
1984  help="Name (or regexp) of dataset(s) to ignore",
1985  action="callback",
1986  callback=self.option_handler_input_spec,
1987  type="string",
1988  metavar="DATASET-IGNORE")
1989 
1990  # Option to specify the name (or a regexp) of the run(s)
1991  # to be used.
1992  parser.add_option("", "--runs",
1993  help="Run number(s) to process",
1994  action="callback",
1995  callback=self.option_handler_input_spec,
1996  type="string",
1997  metavar="RUNS")
1998 
1999  # Option to specify the name (or a regexp) of the run(s)
2000  # to be ignored.
2001  parser.add_option("", "--runs-ignore",
2002  help="Run number(s) to ignore",
2003  action="callback",
2004  callback=self.option_handler_input_spec,
2005  type="string",
2006  metavar="RUNS-IGNORE")
2007 
2008  # Option to specify a file containing a list of dataset names
2009  # (or regexps) to be used.
2010  parser.add_option("", "--datasetfile",
2011  help="File containing list of dataset names " \
2012  "(or regexps) to process",
2013  action="callback",
2014  #callback=self.option_handler_listfile_name,
2015  callback=self.option_handler_input_spec,
2016  type="string",
2017  #dest="self.input_name",
2018  metavar="DATASETFILE")
2019 
2020  # Option to specify a file containing a list of dataset names
2021  # (or regexps) to be ignored.
2022  parser.add_option("", "--datasetfile-ignore",
2023  help="File containing list of dataset names " \
2024  "(or regexps) to ignore",
2025  action="callback",
2026  callback=self.option_handler_input_spec,
2027  type="string",
2028  metavar="DATASETFILE-IGNORE")
2029 
2030  # Option to specify a file containing a list of runs to be
2031  # used.
2032  parser.add_option("", "--runslistfile",
2033  help="File containing list of run numbers " \
2034  "to process",
2035  action="callback",
2036  callback=self.option_handler_input_spec,
2037  type="string",
2038  metavar="RUNSLISTFILE")
2039 
2040  # Option to specify a file containing a list of runs
2041  # to be ignored.
2042  parser.add_option("", "--runslistfile-ignore",
2043  help="File containing list of run numbers " \
2044  "to ignore",
2045  action="callback",
2046  callback=self.option_handler_input_spec,
2047  type="string",
2048  metavar="RUNSLISTFILE-IGNORE")
2049 
2050  # Option to specify a Jsonfile contaning a list of runs
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def parse_cmd_line_options(self)
Helper class: CMSHarvesterHelpFormatter.
def cmsHarvester.pick_a_site (   self,
  sites,
  cmssw_version 
)

self.logger.debug("Checking CASTOR path piece `%s'" % \ piece)

self.logger.debug("Checking `%s' against `%s'" % \ (castor_path_pieces[piece_index + check_size], castor_paths_dont_touch[check_size])) self.logger.debug(" skipping") else:

Piece not in the list, fine.

self.logger.debug(" accepting") Add piece to the path we're building. self.logger.debug("!!! Skip path piece `%s'? %s" % \ (piece, str(skip_this_path_piece))) self.logger.debug("Adding piece to path...") self.logger.debug("Path is now `%s'" % \ path)

Definition at line 1705 of file cmsHarvester.py.

1705  def pick_a_site(self, sites, cmssw_version):
1706 
def pick_a_site(self, sites, cmssw_version)
self.logger.debug("Checking CASTOR path piece `%s&#39;" % \ piece)
def cmsHarvester.process_dataset_ignore_list (   self)
Update the list of datasets taking into account the ones to
ignore.

Both lists have been generated before from DBS and both are
assumed to be unique.

NOTE: The advantage of creating the ignore list from DBS (in
case a regexp is given) and matching that instead of directly
matching the ignore criterion against the list of datasets (to
consider) built from DBS is that in the former case we're sure
that all regexps are treated exactly as DBS would have done
without the cmsHarvester.

NOTE: This only removes complete samples. Exclusion of single
runs is done by the book keeping. So the assumption is that a
user never wants to harvest just part (i.e. n out of N runs)
of a sample.

Definition at line 3564 of file cmsHarvester.py.

3565  """Update the list of datasets taking into account the ones to
3566  ignore.
3567 
3568  Both lists have been generated before from DBS and both are
3569  assumed to be unique.
3570 
3571  NOTE: The advantage of creating the ignore list from DBS (in
3572  case a regexp is given) and matching that instead of directly
3573  matching the ignore criterion against the list of datasets (to
3574  consider) built from DBS is that in the former case we're sure
3575  that all regexps are treated exactly as DBS would have done
3576  without the cmsHarvester.
3577 
3578  NOTE: This only removes complete samples. Exclusion of single
3579  runs is done by the book keeping. So the assumption is that a
3580  user never wants to harvest just part (i.e. n out of N runs)
3581  of a sample.
3582 
3583  """
3584 
3585  self.logger.info("Processing list of datasets to ignore...")
3586 
3587  self.logger.debug("Before processing ignore list there are %d " \
3588  "datasets in the list to be processed" % \
3589  len(self.datasets_to_use))
3590 
3591  # Simple approach: just loop and search.
3592  dataset_names_filtered = copy.deepcopy(self.datasets_to_use)
3593  for dataset_name in self.datasets_to_use.keys():
3594  if dataset_name in self.datasets_to_ignore.keys():
3595  del dataset_names_filtered[dataset_name]
3596 
3597  self.logger.info(" --> Removed %d dataset(s)" % \
3598  (len(self.datasets_to_use) -
3599  len(dataset_names_filtered)))
3600 
3601  self.datasets_to_use = dataset_names_filtered
3602 
3603  self.logger.debug("After processing ignore list there are %d " \
3604  "datasets in the list to be processed" % \
3605  len(self.datasets_to_use))
3606 
def process_dataset_ignore_list(self)
def cmsHarvester.process_runs_use_and_ignore_lists (   self)

Definition at line 3611 of file cmsHarvester.py.

3612 
3613  self.logger.info("Processing list of runs to use and ignore...")
3614 
3615  # This basically adds all runs in a dataset to be processed,
3616  # except for any runs that are not specified in the `to use'
3617  # list and any runs that are specified in the `to ignore'
3618  # list.
3619 
3620  # NOTE: It is assumed that those lists make sense. The input
3621  # should be checked against e.g. overlapping `use' and
3622  # `ignore' lists.
3623 
3624  runs_to_use = self.runs_to_use
3625  runs_to_ignore = self.runs_to_ignore
3626 
3627  for dataset_name in self.datasets_to_use:
3628  runs_in_dataset = self.datasets_information[dataset_name]["runs"]
3629 
3630  # First some sanity checks.
3631  runs_to_use_tmp = []
3632  for run in runs_to_use:
3633  if not run in runs_in_dataset:
3634  self.logger.warning("Dataset `%s' does not contain " \
3635  "requested run %d " \
3636  "--> ignoring `use' of this run" % \
3637  (dataset_name, run))
3638  else:
3639  runs_to_use_tmp.append(run)
3640 
3641  if len(runs_to_use) > 0:
3642  runs = runs_to_use_tmp
3643  self.logger.info("Using %d out of %d runs " \
3644  "of dataset `%s'" % \
3645  (len(runs), len(runs_in_dataset),
3646  dataset_name))
3647  else:
3648  runs = runs_in_dataset
3649 
3650  if len(runs_to_ignore) > 0:
3651  runs_tmp = []
3652  for run in runs:
3653  if not run in runs_to_ignore:
3654  runs_tmp.append(run)
3655  self.logger.info("Ignoring %d out of %d runs " \
3656  "of dataset `%s'" % \
3657  (len(runs)- len(runs_tmp),
3658  len(runs_in_dataset),
3659  dataset_name))
3660  runs = runs_tmp
3661 
3662  if self.todofile != "YourToDofile.txt":
3663  runs_todo = []
3664  print "Reading runs from file /afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/harvesting/%s" %self.todofile
3665  cmd="grep %s /afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/harvesting/%s | cut -f5 -d' '" %(dataset_name,self.todofile)
3666  (status, output)=commands.getstatusoutput(cmd)
3667  for run in runs:
3668  run_str="%s" %run
3669  if run_str in output:
3670  runs_todo.append(run)
3671  self.logger.info("Using %d runs " \
3672  "of dataset `%s'" % \
3673  (len(runs_todo),
3674  dataset_name))
3675  runs=runs_todo
3676 
3677  Json_runs = []
3678  if self.Jsonfilename != "YourJSON.txt":
3679  good_runs = []
3680  self.Jsonlumi = True
3681  # We were passed a Jsonfile containing a dictionary of
3682  # run/lunisection-pairs
3683  self.logger.info("Reading runs and lumisections from file `%s'" % \
3684  self.Jsonfilename)
3685  try:
3686  Jsonfile = open(self.Jsonfilename, "r")
3687  for names in Jsonfile:
3688  dictNames= eval(str(names))
3689  for key in dictNames:
3690  intkey=int(key)
3691  Json_runs.append(intkey)
3692  Jsonfile.close()
3693  except IOError:
3694  msg = "ERROR: Could not open Jsonfile `%s'" % \
3695  input_name
3696  self.logger.fatal(msg)
3697  raise Error(msg)
3698  for run in runs:
3699  if run in Json_runs:
3700  good_runs.append(run)
3701  self.logger.info("Using %d runs " \
3702  "of dataset `%s'" % \
3703  (len(good_runs),
3704  dataset_name))
3705  runs=good_runs
3706  if (self.Jsonrunfilename != "YourJSON.txt") and (self.Jsonfilename == "YourJSON.txt"):
3707  good_runs = []
3708  # We were passed a Jsonfile containing a dictionary of
3709  # run/lunisection-pairs
3710  self.logger.info("Reading runs from file `%s'" % \
3711  self.Jsonrunfilename)
3712  try:
3713  Jsonfile = open(self.Jsonrunfilename, "r")
3714  for names in Jsonfile:
3715  dictNames= eval(str(names))
3716  for key in dictNames:
3717  intkey=int(key)
3718  Json_runs.append(intkey)
3719  Jsonfile.close()
3720  except IOError:
3721  msg = "ERROR: Could not open Jsonfile `%s'" % \
3722  input_name
3723  self.logger.fatal(msg)
3724  raise Error(msg)
3725  for run in runs:
3726  if run in Json_runs:
3727  good_runs.append(run)
3728  self.logger.info("Using %d runs " \
3729  "of dataset `%s'" % \
3730  (len(good_runs),
3731  dataset_name))
3732  runs=good_runs
3733 
3734  self.datasets_to_use[dataset_name] = runs
3735 
3736  # End of process_runs_use_and_ignore_lists().
3737 
def process_runs_use_and_ignore_lists(self)
Helper class: Error exception.
#define str(s)
def cmsHarvester.ref_hist_mappings_needed (   self,
  dataset_name = None 
)
Check if we need to load and check the reference mappings.

For data the reference histograms should be taken
automatically from the GlobalTag, so we don't need any
mappings. For RelVals we need to know a mapping to be used in
the es_prefer code snippet (different references for each of
the datasets.)

WARNING: This implementation is a bit convoluted.

Definition at line 5169 of file cmsHarvester.py.

5169  def ref_hist_mappings_needed(self, dataset_name=None):
5170  """Check if we need to load and check the reference mappings.
5171 
5172  For data the reference histograms should be taken
5173  automatically from the GlobalTag, so we don't need any
5174  mappings. For RelVals we need to know a mapping to be used in
5175  the es_prefer code snippet (different references for each of
5176  the datasets.)
5177 
5178  WARNING: This implementation is a bit convoluted.
5179 
5180  """
5181 
5182  # If no dataset name given, do everything, otherwise check
5183  # only this one dataset.
5184  if not dataset_name is None:
5185  data_type = self.datasets_information[dataset_name] \
5186  ["datatype"]
5187  mappings_needed = (data_type == "mc")
5188  # DEBUG DEBUG DEBUG
5189  if not mappings_needed:
5190  assert data_type == "data"
5191  # DEBUG DEBUG DEBUG end
5192  else:
5193  tmp = [self.ref_hist_mappings_needed(dataset_name) \
5194  for dataset_name in \
5195  self.datasets_information.keys()]
5196  mappings_needed = (True in tmp)
5197 
5198  # End of ref_hist_mappings_needed.
5199  return mappings_needed
5200 
def ref_hist_mappings_needed(self, dataset_name=None)
def cmsHarvester.run (   self)

Definition at line 5520 of file cmsHarvester.py.

References str, and update.

5520  def run(self):
5521  "Main entry point of the CMS harvester."
5522 
5523  # Start with a positive thought.
5524  exit_code = 0
5525 
5526  try:
5527 
5528  try:
5529 
5530  # Parse all command line options and arguments
5531  self.parse_cmd_line_options()
5532  # and check that they make sense.
5533  self.check_input_status()
5534 
5535  # Check if CMSSW is setup.
5536  self.check_cmssw()
5537 
5538  # Check if DBS is setup,
5539  self.check_dbs()
5540  # and if all is fine setup the Python side.
5541  self.setup_dbs()
5542 
5543  # Fill our dictionary with all the required info we
5544  # need to understand harvesting jobs. This needs to be
5545  # done after the CMSSW version is known.
5546  self.setup_harvesting_info()
5547 
5548  # Obtain list of dataset names to consider
5549  self.build_dataset_use_list()
5550  # and the list of dataset names to ignore.
5551  self.build_dataset_ignore_list()
5552 
5553  # The same for the runs lists (if specified).
5554  self.build_runs_use_list()
5555  self.build_runs_ignore_list()
5556 
5557  # Process the list of datasets to ignore and fold that
5558  # into the list of datasets to consider.
5559  # NOTE: The run-based selection is done later since
5560  # right now we don't know yet which runs a dataset
5561  # contains.
5562  self.process_dataset_ignore_list()
5563 
5564  # Obtain all required information on the datasets,
5565  # like run numbers and GlobalTags.
5566  self.build_datasets_information()
5567 
5568  if self.use_ref_hists and \
5569  self.ref_hist_mappings_needed():
5570  # Load the dataset name to reference histogram
5571  # name mappings from file.
5572  self.load_ref_hist_mappings()
5573  # Now make sure that for all datasets we want to
5574  # process there is a reference defined. Otherwise
5575  # just bomb out before wasting any more time.
5576  self.check_ref_hist_mappings()
5577  else:
5578  self.logger.info("No need to load reference " \
5579  "histogram mappings file")
5580 
5581  # OBSOLETE OBSOLETE OBSOLETE
def run(self)
def cmsHarvester.setup_dbs (   self)

Now we try to do a very simple DBS search.

If that works

instead of giving us the `Unsupported API call' crap, we

should be good to go.

NOTE: Not ideal, I know, but it reduces the amount of

complaints I get...

cmd = "dbs search --query=\"find dataset where dataset = impossible"" (status, output) = commands.getstatusoutput(cmd) pdb.set_trace() if status != 0 or \ output.lower().find("unsupported api call") > -1: self.logger.fatal("It seems DBS is not setup...") self.logger.fatal(" %s returns crap:" % cmd) for line in output.split("\n"): self.logger.fatal(" %s" % line) raise Error("ERROR: DBS needs to be setup first!")

Setup the Python side of DBS.

For more information see the DBS Python API documentation:
https://twiki.cern.ch/twiki/bin/view/CMS/DBSApiDocumentation

Definition at line 2392 of file cmsHarvester.py.

2392  def setup_dbs(self):
2393  """Setup the Python side of DBS.
2394 
2395  For more information see the DBS Python API documentation:
2396  https://twiki.cern.ch/twiki/bin/view/CMS/DBSApiDocumentation
2397 
2398  """
2399 
2400  try:
2401  args={}
2402  args["url"]= "http://cmsdbsprod.cern.ch/cms_dbs_prod_global/" \
2403  "servlet/DBSServlet"
2404  api = DbsApi(args)
2405  self.dbs_api = api
2406 
2407  except DBSAPI.dbsApiException.DbsApiException as ex:
2408  self.logger.fatal("Caught DBS API exception %s: %s " % \
2409  (ex.getClassName(), ex.getErrorMessage()))
2410  if ex.getErrorCode() not in (None, ""):
2411  logger.debug("DBS exception error code: ", ex.getErrorCode())
2412  raise
2413 
2414  # End of setup_dbs.
2415 
def setup_dbs(self)
Now we try to do a very simple DBS search.
def cmsHarvester.setup_harvesting_info (   self)
Fill our dictionary with all info needed to understand
harvesting.

This depends on the CMSSW version since at some point the
names and sequences were modified.

NOTE: There is no way (at least not that I could come up with)
to code this in a neat generic way that can be read both by
this method and by option_handler_list_types(). Please try
hard to keep these two methods in sync!

Definition at line 1207 of file cmsHarvester.py.

1208  """Fill our dictionary with all info needed to understand
1209  harvesting.
1210 
1211  This depends on the CMSSW version since at some point the
1212  names and sequences were modified.
1213 
1214  NOTE: There is no way (at least not that I could come up with)
1215  to code this in a neat generic way that can be read both by
1216  this method and by option_handler_list_types(). Please try
1217  hard to keep these two methods in sync!
1218 
1219  """
1220 
1221  assert not self.cmssw_version is None, \
1222  "ERROR setup_harvesting() requires " \
1223  "self.cmssw_version to be set!!!"
1224 
1225  harvesting_info = {}
1226 
1227  # This is the version-independent part.
1228  harvesting_info["DQMOffline"] = {}
1229  harvesting_info["DQMOffline"]["beamspot"] = None
1230  harvesting_info["DQMOffline"]["eventcontent"] = None
1231  harvesting_info["DQMOffline"]["harvesting"] = "AtRunEnd"
1232 
1233  harvesting_info["RelVal"] = {}
1234  harvesting_info["RelVal"]["beamspot"] = None
1235  harvesting_info["RelVal"]["eventcontent"] = None
1236  harvesting_info["RelVal"]["harvesting"] = "AtRunEnd"
1237 
1238  harvesting_info["RelValFS"] = {}
1239  harvesting_info["RelValFS"]["beamspot"] = None
1240  harvesting_info["RelValFS"]["eventcontent"] = None
1241  harvesting_info["RelValFS"]["harvesting"] = "AtRunEnd"
1242 
1243  harvesting_info["MC"] = {}
1244  harvesting_info["MC"]["beamspot"] = None
1245  harvesting_info["MC"]["eventcontent"] = None
1246  harvesting_info["MC"]["harvesting"] = "AtRunEnd"
1247 
1248  # This is the version-dependent part. And I know, strictly
1249  # speaking it's not necessary to fill in all three types since
1250  # in a single run we'll only use one type anyway. This does
1251  # look more readable, however, and required less thought from
1252  # my side when I put this together.
1253 
1254  # DEBUG DEBUG DEBUG
1255  # Check that we understand our own version naming.
1256  assert self.cmssw_version.startswith("CMSSW_")
1257  # DEBUG DEBUG DEBUG end
1258 
1259  version = self.cmssw_version[6:]
1260 
1261  #----------
1262 
1263  # RelVal
1264  step_string = None
1265  if version < "3_3_0":
1266  step_string = "validationHarvesting"
1267  elif version in ["3_3_0_pre1", "3_3_0_pre2",
1268  "3_3_0_pre3", "3_3_0_pre4",
1269  "3_3_0_pre6", "3_4_0_pre1"]:
1270  step_string = "validationHarvesting"
1271  else:
1272  step_string = "validationHarvesting+dqmHarvesting"
1273 
1274  harvesting_info["RelVal"]["step_string"] = step_string
1275 
1276  # DEBUG DEBUG DEBUG
1277  # Let's make sure we found something.
1278  assert not step_string is None, \
1279  "ERROR Could not decide a RelVal harvesting sequence " \
1280  "for CMSSW version %s" % self.cmssw_version
1281  # DEBUG DEBUG DEBUG end
1282 
1283  #----------
1284 
1285  # RelVal
1286  step_string = "validationHarvestingFS"
1287 
1288  harvesting_info["RelValFS"]["step_string"] = step_string
1289 
1290  #----------
1291 
1292  # MC
1293  step_string = "validationprodHarvesting"
1294 
1295  harvesting_info["MC"]["step_string"] = step_string
1296 
1297  # DEBUG DEBUG DEBUG
1298  # Let's make sure we found something.
1299  assert not step_string is None, \
1300  "ERROR Could not decide a MC harvesting " \
1301  "sequence for CMSSW version %s" % self.cmssw_version
1302  # DEBUG DEBUG DEBUG end
1303 
1304  #----------
1305 
1306  # DQMOffline
1307  step_string = "dqmHarvesting"
1308 
1309  harvesting_info["DQMOffline"]["step_string"] = step_string
1310 
1311  #----------
1312 
1313  self.harvesting_info = harvesting_info
1314 
1315  self.logger.info("Based on the CMSSW version (%s) " \
1316  "I decided to use the `HARVESTING:%s' " \
1317  "sequence for %s harvesting" % \
1318  (self.cmssw_version,
1319  self.harvesting_info[self.harvesting_type]["step_string"],
1320  self.harvesting_type))
1321 
1322  # End of setup_harvesting_info.
1323 
def setup_harvesting_info(self)
def cmsHarvester.show_exit_message (   self)

DEBUG DEBUG DEBUG

This is probably only useful to make sure we don't muck

things up, right?

Figure out across how many sites this sample has been spread.

if num_sites == 1: self.logger.info(" sample is contained at a single site") else: self.logger.info(" sample is spread across %d sites" % \ num_sites) if num_sites < 1:

NOTE: This should not happen with any valid dataset.

self.logger.warning(" --> skipping dataset which is not " \ "hosted anywhere")

DEBUG DEBUG DEBUG end

Tell the user what to do now, after this part is done.

This should provide the user with some (preferably
copy-pasteable) instructions on what to do now with the setups
and files that have been created.

Definition at line 5467 of file cmsHarvester.py.

5468  """Tell the user what to do now, after this part is done.
5469 
5470  This should provide the user with some (preferably
5471  copy-pasteable) instructions on what to do now with the setups
5472  and files that have been created.
5473 
5474  """
5475 
5476  # TODO TODO TODO
5477  # This could be improved a bit.
5478  # TODO TODO TODO end
5479 
5480  sep_line = "-" * 60
5481 
5482  self.logger.info("")
5483  self.logger.info(sep_line)
5484  self.logger.info(" Configuration files have been created.")
5485  self.logger.info(" From here on please follow the usual CRAB instructions.")
5486  self.logger.info(" Quick copy-paste instructions are shown below.")
5487  self.logger.info(sep_line)
5488 
5489  self.logger.info("")
5490  self.logger.info(" Create all CRAB jobs:")
5491  self.logger.info(" multicrab -create")
5492  self.logger.info("")
5493  self.logger.info(" Submit all CRAB jobs:")
5494  self.logger.info(" multicrab -submit")
5495  self.logger.info("")
5496  self.logger.info(" Check CRAB status:")
5497  self.logger.info(" multicrab -status")
5498  self.logger.info("")
5499 
5500  self.logger.info("")
5501  self.logger.info(" For more information please see the CMS Twiki:")
5502  self.logger.info(" %s" % twiki_url)
5503  self.logger.info(sep_line)
5504 
5505  # If there were any jobs for which we could not find a
5506  # matching site show a warning message about that.
5507  if not self.all_sites_found:
5508  self.logger.warning(" For some of the jobs no matching " \
5509  "site could be found")
5510  self.logger.warning(" --> please scan your multicrab.cfg" \
5511  "for occurrences of `%s'." % \
5512  self.no_matching_site_found_str)
5513  self.logger.warning(" You will have to fix those " \
5514  "by hand, sorry.")
5515 
5516  # End of show_exit_message.
5517 
def show_exit_message(self)
DEBUG DEBUG DEBUGThis is probably only useful to make sure we don&#39;t muckthings up, right?Figure out across how many sites this sample has been spread.
def cmsHarvester.singlify_datasets (   self)
Remove all but the largest part of all datasets.

This allows us to harvest at least part of these datasets
using single-step harvesting until the two-step approach
works.

Definition at line 3740 of file cmsHarvester.py.

References mps_monitormerge.items, SiStripPI.max, and MuonErrorMatrixValues_cff.values.

3741  """Remove all but the largest part of all datasets.
3742 
3743  This allows us to harvest at least part of these datasets
3744  using single-step harvesting until the two-step approach
3745  works.
3746 
3747  """
3748 
3749  # DEBUG DEBUG DEBUG
3750  assert self.harvesting_mode == "single-step-allow-partial"
3751  # DEBUG DEBUG DEBUG end
3752 
3753  for dataset_name in self.datasets_to_use:
3754  for run_number in self.datasets_information[dataset_name]["runs"]:
3755  max_events = max(self.datasets_information[dataset_name]["sites"][run_number].values())
3756  sites_with_max_events = [i[0] for i in self.datasets_information[dataset_name]["sites"][run_number].items() if i[1] == max_events]
3757  self.logger.warning("Singlifying dataset `%s', " \
3758  "run %d" % \
3759  (dataset_name, run_number))
3760  cmssw_version = self.datasets_information[dataset_name] \
3761  ["cmssw_version"]
3762  selected_site = self.pick_a_site(sites_with_max_events,
3763  cmssw_version)
3764 
3765  # Let's tell the user that we're manhandling this dataset.
3766  nevents_old = self.datasets_information[dataset_name]["num_events"][run_number]
3767  self.logger.warning(" --> " \
3768  "only harvesting partial statistics: " \
3769  "%d out of %d events (5.1%f%%) " \
3770  "at site `%s'" % \
3771  (max_events,
3772  nevents_old,
3773  100. * max_events / nevents_old,
3774  selected_site))
3775  self.logger.warning("!!! Please note that the number of " \
3776  "events in the output path name will " \
3777  "NOT reflect the actual statistics in " \
3778  "the harvested results !!!")
3779 
3780  # We found the site with the highest statistics and
3781  # the corresponding number of events. (CRAB gets upset
3782  # if we ask for more events than there are at a given
3783  # site.) Now update this information in our main
3784  # datasets_information variable.
3785  self.datasets_information[dataset_name]["sites"][run_number] = {selected_site: max_events}
3786  self.datasets_information[dataset_name]["num_events"][run_number] = max_events
3787  #self.datasets_information[dataset_name]["sites"][run_number] = [selected_site]
3788 
3789  # End of singlify_datasets.
3790 
def singlify_datasets(self)
def cmsHarvester.write_crab_config (   self)

def create_harvesting_config(self, dataset_name): """Create the Python harvesting configuration for a given job.

NOTE: The reason to have a single harvesting configuration per sample is to be able to specify the GlobalTag corresponding to each sample. Since it has been decided that (apart from the prompt reco) datasets cannot contain runs with different GlobalTags, we don't need a harvesting config per run. NOTE: This is the place where we distinguish between single-step and two-step harvesting modes (at least for the Python job configuration). """ ### if self.harvesting_mode == "single-step": config_contents = self.create_harvesting_config_single_step(dataset_name) elif self.harvesting_mode == "two-step": config_contents = self.create_harvesting_config_two_step(dataset_name) else:

Impossible harvesting mode, we should never get here.

assert False, "ERROR: unknown harvesting mode `%s'" % \ self.harvesting_mode ### # End of create_harvesting_config. return config_contents

Write a CRAB job configuration Python file.

Definition at line 5045 of file cmsHarvester.py.

References FrontierConditions_GlobalTag_cff.file.

5046  """Write a CRAB job configuration Python file.
5047 
5048  """
5049 
5050  self.logger.info("Writing CRAB configuration...")
5051 
5052  file_name_base = "crab.cfg"
5053 
5054  # Create CRAB configuration.
5055  crab_contents = self.create_crab_config()
5056 
5057  # Write configuration to file.
5058  crab_file_name = file_name_base
5059  try:
5060  crab_file = file(crab_file_name, "w")
5061  crab_file.write(crab_contents)
5062  crab_file.close()
5063  except IOError:
5064  self.logger.fatal("Could not write " \
5065  "CRAB configuration to file `%s'" % \
5066  crab_file_name)
5067  raise Error("ERROR: Could not write to file `%s'!" % \
5068  crab_file_name)
5069 
5070  # End of write_crab_config.
5071 
def write_crab_config(self)
def create_harvesting_config(self, dataset_name): """Create the Python harvesting configuration for a...
Helper class: Error exception.
def cmsHarvester.write_harvesting_config (   self,
  dataset_name 
)
Write a harvesting job configuration Python file.

NOTE: This knows nothing about single-step or two-step
harvesting. That's all taken care of by
create_harvesting_config.

Definition at line 5103 of file cmsHarvester.py.

References create_harvesting_config_file_name(), and FrontierConditions_GlobalTag_cff.file.

5103  def write_harvesting_config(self, dataset_name):
5104  """Write a harvesting job configuration Python file.
5105 
5106  NOTE: This knows nothing about single-step or two-step
5107  harvesting. That's all taken care of by
5108  create_harvesting_config.
5109 
5110  """
5111 
5112  self.logger.debug("Writing harvesting configuration for `%s'..." % \
5113  dataset_name)
5114 
5115  # Create Python configuration.
5116  config_contents = self.create_harvesting_config(dataset_name)
5117 
5118  # Write configuration to file.
5119  config_file_name = self. \
5121  try:
5122  config_file = file(config_file_name, "w")
5123  config_file.write(config_contents)
5124  config_file.close()
5125  except IOError:
5126  self.logger.fatal("Could not write " \
5127  "harvesting configuration to file `%s'" % \
5128  config_file_name)
5129  raise Error("ERROR: Could not write to file `%s'!" % \
5130  config_file_name)
5131 
5132  # End of write_harvesting_config.
5133 
def write_harvesting_config(self, dataset_name)
Helper class: Error exception.
def create_harvesting_config_file_name(self, dataset_name)
Only add the alarming piece to the file name if this isa spread-out dataset.
def cmsHarvester.write_me_extraction_config (   self,
  dataset_name 
)
Write an ME-extraction configuration Python file.

This `ME-extraction' (ME = Monitoring Element) is the first
step of the two-step harvesting.

Definition at line 5136 of file cmsHarvester.py.

References create_me_summary_config_file_name(), and FrontierConditions_GlobalTag_cff.file.

5136  def write_me_extraction_config(self, dataset_name):
5137  """Write an ME-extraction configuration Python file.
5138 
5139  This `ME-extraction' (ME = Monitoring Element) is the first
5140  step of the two-step harvesting.
5141 
5142  """
5143 
5144  self.logger.debug("Writing ME-extraction configuration for `%s'..." % \
5145  dataset_name)
5146 
5147  # Create Python configuration.
5148  config_contents = self.create_me_extraction_config(dataset_name)
5149 
5150  # Write configuration to file.
5151  config_file_name = self. \
5153  try:
5154  config_file = file(config_file_name, "w")
5155  config_file.write(config_contents)
5156  config_file.close()
5157  except IOError:
5158  self.logger.fatal("Could not write " \
5159  "ME-extraction configuration to file `%s'" % \
5160  config_file_name)
5161  raise Error("ERROR: Could not write to file `%s'!" % \
5162  config_file_name)
5163 
5164  # End of write_me_extraction_config.
5165 
def write_me_extraction_config(self, dataset_name)
Helper class: Error exception.
def create_me_summary_config_file_name(self, dataset_name)
def cmsHarvester.write_multicrab_config (   self)
Write a multi-CRAB job configuration Python file.

Definition at line 5074 of file cmsHarvester.py.

References FrontierConditions_GlobalTag_cff.file.

5075  """Write a multi-CRAB job configuration Python file.
5076 
5077  """
5078 
5079  self.logger.info("Writing multi-CRAB configuration...")
5080 
5081  file_name_base = "multicrab.cfg"
5082 
5083  # Create multi-CRAB configuration.
5084  multicrab_contents = self.create_multicrab_config()
5085 
5086  # Write configuration to file.
5087  multicrab_file_name = file_name_base
5088  try:
5089  multicrab_file = file(multicrab_file_name, "w")
5090  multicrab_file.write(multicrab_contents)
5091  multicrab_file.close()
5092  except IOError:
5093  self.logger.fatal("Could not write " \
5094  "multi-CRAB configuration to file `%s'" % \
5095  multicrab_file_name)
5096  raise Error("ERROR: Could not write to file `%s'!" % \
5097  multicrab_file_name)
5098 
5099  # End of write_multicrab_config.
5100 
def write_multicrab_config(self)
Helper class: Error exception.

Variable Documentation

cmsHarvester.caf_access

Definition at line 1107 of file cmsHarvester.py.

cmsHarvester.castor_base_dir

Definition at line 1076 of file cmsHarvester.py.

cmsHarvester.cmssw_version

Definition at line 2346 of file cmsHarvester.py.

cmsHarvester.crab_submission

Definition at line 1134 of file cmsHarvester.py.

cmsHarvester.datasets_information

Definition at line 5339 of file cmsHarvester.py.

cmsHarvester.datasets_to_ignore

Definition at line 3456 of file cmsHarvester.py.

cmsHarvester.datasets_to_use

Definition at line 3430 of file cmsHarvester.py.

cmsHarvester.dbs_api

Definition at line 2405 of file cmsHarvester.py.

cmsHarvester.globaltag

Definition at line 2306 of file cmsHarvester.py.

cmsHarvester.harvesting_info

Definition at line 1313 of file cmsHarvester.py.

cmsHarvester.harvesting_mode

Definition at line 2215 of file cmsHarvester.py.

cmsHarvester.harvesting_type

Definition at line 3857 of file cmsHarvester.py.

cmsHarvester.Jsonfilename

Definition at line 3706 of file cmsHarvester.py.

cmsHarvester.Jsonlumi

Definition at line 3680 of file cmsHarvester.py.

cmsHarvester.non_t1access

Definition at line 1091 of file cmsHarvester.py.

cmsHarvester.nr_max_sites

Definition at line 1142 of file cmsHarvester.py.

cmsHarvester.option_parser

Definition at line 1878 of file cmsHarvester.py.

cmsHarvester.preferred_site

Definition at line 1148 of file cmsHarvester.py.

cmsHarvester.ref_hist_mappings_file_name

Definition at line 2257 of file cmsHarvester.py.

cmsHarvester.runs_to_ignore

Definition at line 3553 of file cmsHarvester.py.

cmsHarvester.runs_to_use

Definition at line 3529 of file cmsHarvester.py.

cmsHarvester.saveByLumiSection

Definition at line 1121 of file cmsHarvester.py.