How to find a SAM_PROJECT name for a job:
$ vi /work/landshark-clued0/weigang/Winter2010/check_data.log
"Reading /prj_root/2671/top_write/weigang/SingleTop2010Summer2009Extended/RunIIa/data/p17_CC_Data_PreTag_0_20091231120843-2953530.d0cabsrv1.fnal.gov/cafe.out
Number of processed events: 38749635
Reading /prj_root/2671/top_write/weigang/SingleTop2010Summer2009Extended/RunIIa/data/p17_CC_Data_PreTag_0_20091231120843-2953564.d0cabsrv1.fnal.gov/cafe.out
Number of processed events: 39353873
Processed[expected] events: 39353873[39381649] BAD"
look for any job name that belongs to a BAD project. then,
$ find /d0mino/weigang -name "*2953564*"$ vi /d0mino/weigang/p17_CC_Data_PreTag_0.o2953564"SAM_PROJECT = weigang_13105_20091231120843"
Liang showed how to do it easier way:
$ cat /d0mino/weigang/*.o3447239 | grep SAM_PROJECT
SAM_PROJECT = weigang_4789_20100227000742
($ setup sam)
$ sam generate strict recovery project --project=weigang_13105_20091231120843 --printQuery
(snapshot_id 2600391 minus (consumer_id 4024850 and consumed_status consumed)) or (consumer_id 4024850 and cf_pid 36006704)
$ sam translate constraints --dim="(snapshot_id 2600391 minus (consumer_id 4024850 and consumed_status consumed)) or (consumer_id 4024850 and cf_pid 36006704)" | grep "Total Event Count"
check if this number + the processed number = expected total number of events.
if yes,
created a recovery def name:
$ sam create definition --defname="singletop_recovery_p20mu_201002261404" --dim="(snapshot_id 2136982 minus (consumer_id 4125778 and consumed_status consumed)) or (consumer_id 4125778 and cf_pid 36666193,36666198,36666273,36666274,36666283,36666307)"
DatasetDefinition saved with definitionId = 1948007
find the BAD definition and created a submission script:
$ top_cafe/configs/singletop/SingleTopDriver.sh -r p20 -a MU -o . WPlusJets | grep CSG_alpgenpythia_w+0lp_lnu+0lp_excl_p211100_v3
runcafe -no-outdir-check -jobs=70 -outdir=. -name=p20_MU_w+0lp_lnu+0lp_excl_PreTag -def=CSG_alpgenpythia_w+0lp_lnu+0lp_excl_p211100_v3 -tar=./jobtarball_26-Feb-2010.tar.gz -- top_cafe/configs/singletop/mujets_p21/MC_SingleTopMuJets_Signal_RunIIb_recaf.config VJets.Analysis: MU VJets.TempTree: TMBTree +USER.Run: Group\(VJetsWPtReWeight\) VJets.WMCType: AlpgenToNLO +USER.Run: TopHistos\(histos_pretag\) histos_pretag.DoMC: true cafe.Output: p20_MU_w+0lp_lnu+0lp_excl_Topo.root mcweight.CSGAlpgenNLONorm: true VJets.IsHeavyFlavorSkimmed: true VJets.SAMDefName: CSG_alpgenpythia_w+0lp_lnu+0lp_excl_p211100_v3 csg_sample.fwk_pXX: p211100 csg_sample.version: v3
change the def name to the recovery def name you just created. reduce the job number to 5 or as needed and run it:
$ runcafe -cabsrv1 -no-outdir-check -jobs=5 -outdir=. -name=p20_MU_w+0lp_lnu+0lp_excl_PreTag -def=singletop_recovery_p20mu_201002261404 -tar=./jobtarball_26-Feb-2010.tar.gz -- top_cafe/configs/singletop/mujets_p21/MC_SingleTopMuJets_Signal_RunIIb_recaf.config VJets.Analysis: MU VJets.TempTree: TMBTree +USER.Run: Group\(VJetsWPtReWeight\) VJets.WMCType: AlpgenToNLO +USER.Run: TopHistos\(histos_pretag\) histos_pretag.DoMC: true cafe.Output: p20_MU_w+0lp_lnu+0lp_excl_Topo.root mcweight.CSGAlpgenNLONorm: true VJets.IsHeavyFlavorSkimmed: true VJets.SAMDefName: CSG_alpgenpythia_w+0lp_lnu+0lp_excl_p211100_v3 csg_sample.fwk_pXX: p211100 csg_sample.version: v3
if not,
get summary of the project:
$ sam get project summary --project=weigang_13105_20091231120843 > & summary.log
$ vi summary.log
search:
/:\ delivered
" Consumer process ID: 36006704
Process description: 2953566.d0cabsrv1.fnal.gov
Process status : completed
Number of files consumed : 6
Number of files delivered : 1
Number of files failed : 0
Number of files skipped : 0
Number of files unknownStatus : 0
Last consumed file : vjets-recaffed-ejets-fall2008_CAF-CSGv1-CSskim-EMinclusive-20060209-003700-2023451_p17.09.03_130302_p18.13.00.root
Last consumed file status : delivered"
go to the job output directory:
$ cd /prj_root/2671/top_write/weigang/SingleTop2010Summer2009Extended/RunIIa/data/
look for that job output:
$ ls -d *2953566*
$ cd p17_CC_Data_PreTag_0_20091231120843-2953566.d0cabsrv1.fnal.gov/
check the number of events of that job:
$ vi cafe.out
SAM thinks these events are not consumed (missing), while CAFE thinks it's processed. so
==> # of events processed by CAFE - # of events of this job + # events missing by SAM = total expected # events
total expected # of events = snapshot_id events.
******then remove this job and run the recovery script?
if # processed events > # expected events, just rerun the definition.
--> find the bad definition in check_output.log
--> delete all outputs related to that definition, using the pattern in check_output.log:
easily see by eyes, but you can double check by comparing ls the pattern and grep the same pattern in the check_output.log, e.g.
$ ls p20_MU_w+w_* | wc
$ grep p20_MU_w+w_ check_diboson.log | wc
grep = ls + 1, due to one more line providing the pattern.
--> create the submission script by SingleTopDriver.sh:
$SRT_LOCAL/top_cafe/configs/singletop/SingleTopDriver.sh -r p20 -a MU -o . -s cabsrv1 Diboson | grep CSG_pythia_w+w_incl_p211100_v3
runcafe -no-outdir-check -jobs=6 -outdir=. -name=p20_MU_w+w_incl_PreTag -def=CSG_pythia_w+w_incl_p211100_v3 -tar=./jobtarball_26-Feb-2010.tar.gz -- top_cafe/configs/singletop/mujets_p21/MC_SingleTopMuJets_Signal_RunIIb_recaf.config VJets.Analysis: MU VJets.TempTree: TMBTree +USER.Run: TopHistos\(histos_pretag\) histos_pretag.DoMC: true cafe.Output: p20_MU_w+w_incl_Topo.root mcweight.CSGAlpgenNLONorm: true VJets.SAMDefName: CSG_pythia_w+w_incl_p211100_v3 csg_sample.fwk_pXX: p211100 csg_sample.version: v
$ vi /work/landshark-clued0/weigang/Winter2010/check_data.log
"Reading /prj_root/2671/top_write/weigang/SingleTop2010Summer2009Extended/RunIIa/data/p17_CC_Data_PreTag_0_20091231120843-2953530.d0cabsrv1.fnal.gov/cafe.out
Number of processed events: 38749635
Reading /prj_root/2671/top_write/weigang/SingleTop2010Summer2009Extended/RunIIa/data/p17_CC_Data_PreTag_0_20091231120843-2953564.d0cabsrv1.fnal.gov/cafe.out
Number of processed events: 39353873
Processed[expected] events: 39353873[39381649] BAD"
look for any job name that belongs to a BAD project. then,
Liang showed how to do it easier way:
$ cat /d0mino/weigang/*.o3447239 | grep SAM_PROJECT
SAM_PROJECT = weigang_4789_20100227000742
($ setup sam)
$ sam generate strict recovery project --project=weigang_13105_20091231120843 --printQuery
(snapshot_id 2600391 minus (consumer_id 4024850 and consumed_status consumed)) or (consumer_id 4024850 and cf_pid 36006704)
$ sam translate constraints --dim="(snapshot_id 2600391 minus (consumer_id 4024850 and consumed_status consumed)) or (consumer_id 4024850 and cf_pid 36006704)" | grep "Total Event Count"
check if this number + the processed number = expected total number of events.
if yes,
created a recovery def name:
$ sam create definition --defname="singletop_recovery_p20mu_201002261404" --dim="(snapshot_id 2136982 minus (consumer_id 4125778 and consumed_status consumed)) or (consumer_id 4125778 and cf_pid 36666193,36666198,36666273,36666274,36666283,36666307)"
DatasetDefinition saved with definitionId = 1948007
find the BAD definition and created a submission script:
$ top_cafe/configs/singletop/SingleTopDriver.sh -r p20 -a MU -o . WPlusJets | grep CSG_alpgenpythia_w+0lp_lnu+0lp_excl_p211100_v3
runcafe -no-outdir-check -jobs=70 -outdir=. -name=p20_MU_w+0lp_lnu+0lp_excl_PreTag -def=CSG_alpgenpythia_w+0lp_lnu+0lp_excl_p211100_v3 -tar=./jobtarball_26-Feb-2010.tar.gz -- top_cafe/configs/singletop/mujets_p21/MC_SingleTopMuJets_Signal_RunIIb_recaf.config VJets.Analysis: MU VJets.TempTree: TMBTree +USER.Run: Group\(VJetsWPtReWeight\) VJets.WMCType: AlpgenToNLO +USER.Run: TopHistos\(histos_pretag\) histos_pretag.DoMC: true cafe.Output: p20_MU_w+0lp_lnu+0lp_excl_Topo.root mcweight.CSGAlpgenNLONorm: true VJets.IsHeavyFlavorSkimmed: true VJets.SAMDefName: CSG_alpgenpythia_w+0lp_lnu+0lp_excl_p211100_v3 csg_sample.fwk_pXX: p211100 csg_sample.version: v3
change the def name to the recovery def name you just created. reduce the job number to 5 or as needed and run it:
$ runcafe -cabsrv1 -no-outdir-check -jobs=5 -outdir=. -name=p20_MU_w+0lp_lnu+0lp_excl_PreTag -def=singletop_recovery_p20mu_201002261404 -tar=./jobtarball_26-Feb-2010.tar.gz -- top_cafe/configs/singletop/mujets_p21/MC_SingleTopMuJets_Signal_RunIIb_recaf.config VJets.Analysis: MU VJets.TempTree: TMBTree +USER.Run: Group\(VJetsWPtReWeight\) VJets.WMCType: AlpgenToNLO +USER.Run: TopHistos\(histos_pretag\) histos_pretag.DoMC: true cafe.Output: p20_MU_w+0lp_lnu+0lp_excl_Topo.root mcweight.CSGAlpgenNLONorm: true VJets.IsHeavyFlavorSkimmed: true VJets.SAMDefName: CSG_alpgenpythia_w+0lp_lnu+0lp_excl_p211100_v3 csg_sample.fwk_pXX: p211100 csg_sample.version: v3
if not,
get summary of the project:
$ sam get project summary --project=weigang_13105_20091231120843 > & summary.log
$ vi summary.log
search:
/:\ delivered
" Consumer process ID: 36006704
Process description: 2953566.d0cabsrv1.fnal.gov
Process status : completed
Number of files consumed : 6
Number of files delivered : 1
Number of files failed : 0
Number of files skipped : 0
Number of files unknownStatus : 0
Last consumed file : vjets-recaffed-ejets-fall2008_CAF-CSGv1-CSskim-EMinclusive-20060209-003700-2023451_p17.09.03_130302_p18.13.00.root
Last consumed file status : delivered"
go to the job output directory:
$ cd /prj_root/2671/top_write/weigang/SingleTop2010Summer2009Extended/RunIIa/data/
look for that job output:
$ ls -d *2953566*
$ cd p17_CC_Data_PreTag_0_20091231120843-2953566.d0cabsrv1.fnal.gov/
check the number of events of that job:
$ vi cafe.out
SAM thinks these events are not consumed (missing), while CAFE thinks it's processed. so
==> # of events processed by CAFE - # of events of this job + # events missing by SAM = total expected # events
total expected # of events = snapshot_id events.
******then remove this job and run the recovery script?
if # processed events > # expected events, just rerun the definition.
--> find the bad definition in check_output.log
--> delete all outputs related to that definition, using the pattern in check_output.log:
easily see by eyes, but you can double check by comparing ls the pattern and grep the same pattern in the check_output.log, e.g.
$ ls p20_MU_w+w_* | wc
$ grep p20_MU_w+w_ check_diboson.log | wc
grep = ls + 1, due to one more line providing the pattern.
--> create the submission script by SingleTopDriver.sh:
$SRT_LOCAL/top_cafe/configs/singletop/SingleTopDriver.sh -r p20 -a MU -o . -s cabsrv1 Diboson | grep CSG_pythia_w+w_incl_p211100_v3
runcafe -no-outdir-check -jobs=6 -outdir=. -name=p20_MU_w+w_incl_PreTag -def=CSG_pythia_w+w_incl_p211100_v3 -tar=./jobtarball_26-Feb-2010.tar.gz -- top_cafe/configs/singletop/mujets_p21/MC_SingleTopMuJets_Signal_RunIIb_recaf.config VJets.Analysis: MU VJets.TempTree: TMBTree +USER.Run: TopHistos\(histos_pretag\) histos_pretag.DoMC: true cafe.Output: p20_MU_w+w_incl_Topo.root mcweight.CSGAlpgenNLONorm: true VJets.SAMDefName: CSG_pythia_w+w_incl_p211100_v3 csg_sample.fwk_pXX: p211100 csg_sample.version: v
Comments
Post a Comment
https://gengwg.blogspot.com/