| 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381 |
1
7
1
1
4
4
4
1
9
4
2
2
1
1
1
1
1
1
2
2
2
16
16
4
4
4
4
4
2
2
2
2
12
105
13
10
1
5
5
5
5
6
6
6
6
6
2
6
5
1
1
1
1
1
1
1
1
9
2
32
2
6
10
13
39
39
39
7
11
5
16
26
26
19
82
67
65
2
65
65
65
65
65
65
65
65
82
82
82
82
82
2
7
7
7
2
80
80
21
8
8
7
13
12
12
14
14
11
9
1
1
| /* coffee-script example usage - at https://github.com/johan/dotjs/commits/johan
given path_re: ['^/([^/]+)/([^/]+)(/?.*)', 'user', 'repo', 'rest']
query: true
dom:
keyboard: 'css .keyboard-shortcuts'
branches: 'css+ .js-filter-branches h4 a'
dates: 'css* .commit-group-heading'
tracker: 'css? #gauges-tracker[defer]'
johan_ci: 'xpath* //li[contains(@class,"commit")][.//a[.="johan"]]'
ready: (path, query, dom) ->
...would make something like this call, as the path regexp matched, and there
were DOM matches for the two mandatory "keyboard" and "branches" selectors:
ready( { user: 'johan', repo: 'dotjs', rest: '/commits/johan' }
, {} // would contain all query args (if any were present)
, { keyboard: Node<a href="#keyboard_shortcuts_pane">
, branches: [ Node<a href="/johan/dotjs/commits/coffee">
, Node<a href="/johan/dotjs/commits/dirs">
, Node<a href="/johan/dotjs/commits/gh-pages">
, Node<a href="/johan/dotjs/commits/johan">
, Node<a href="/johan/dotjs/commits/jquery-1.8.2">
, Node<a href="/johan/dotjs/commits/master">
]
, dates: [ Node<h3 class="commit-group-heading">Oct 07, 2012</h3>
, Node<h3 class="commit-group-heading">Aug 29, 2012</h3>
, ...
]
, tracker: null
, johan_ci: [ Node<li class="commit">, ... ]
}
)
A selector returns an array of matches prefixed for "css*" and "css+" (ditto
xpath), and a single result if it is prefixed "css" or "css?":
If your script should only run on pages with a particular DOM node (or set of
nodes), use the 'css' or 'css+' (ditto xpath) forms - and your callback won't
get fired on pages that lack them. The 'css?' and 'css*' forms would run your
callback but pass null or [] respectively, on not finding such nodes. You may
recognize the semantics of x, x?, x* and x+ from regular expressions.
(see http://goo.gl/ejtMD for a more thorough discussion of something similar)
The dom property is recursively defined so you can make nested structures.
If you want a property that itself is an object full of matched things, pass
an object of sub-dom-spec:s, instead of a string selector:
given dom:
meta:
base: 'xpath? /head/base
title: 'xpath string(/head/title)'
commits: 'css* li.commit'
ready: (dom) ->
You can also deconstruct repeated templated sections of a page into subarrays
scraped as per your specs, by picking a context node for a dom spec. This is
done by passing a two-element array: a selector resolving what node/nodes you
look at and a dom spec describing how you want it/them deconstructed for you:
given dom:
meta:
[ 'xpath /head',
base: 'xpath? base
title: 'xpath string(title)'
]
commits:
[ 'css* li.commit',
avatar_url: ['css img.gravatar', 'xpath string(@src)']
author_name: 'xpath string(.//*[@class="author-name"])'
]
ready: (dom) ->
The mandatory/optional selector rules defined above behave as you'd expect as
used for context selectors too: a mandatory node or array of nodes will limit
what pages your script gets called on to those that match it, so your code is
free to assume it will always be there when it runs. An optional context node
that is not found will instead result in that part of your DOM being null, or
an empty array, in the case of a * selector.
Finally, there is the xpath! keyword, which is similar to xpath, but it also
mandates that whatever is returned is truthy. This is useful when you use the
xpath functions returning strings, numbers and of course booleans, to assert
things about the pages you want to run on, like 'xpath! count(//img) = 0', if
you never want the script to run on pages with inline images, say.
Once you called given(), you may call given.dom to do page scraping later on,
returning whatever matched your selector(s) passed. Mandatory selectors which
failed to match at this point will return undefined, optional selectors null:
given.dom('xpath //a[@id]') => undefined or <a id="...">
given.dom('xpath? //a[@id]') => null or <a id="...">
given.dom('xpath+ //a[@id]') => undefined or [<a id="...">, <a id>, ...]
given.dom('xpath* //a[@id]') => [] or [<a id="...">, <a id>, ...]
To detect a failed mandatory match, you can use given.dom(...) === given.FAIL
Github pjax hook: to re-run the script's given() block for every pjax request
to a site - add a pushstate hook as per http://goo.gl/LNSv1 -- and be sure to
make your script reentrant, so that it won't try to process the same elements
again, if they are still sitting around in the page (see ':not([augmented])')
*/
function given(opts, plugins) {
var Object_toString = Object.prototype.toString
, Array_slice = Array.prototype.slice
, FAIL = 'dom' in given ? undefined : (function() {
var tests =
{ path_re: { fn: test_regexp }
, query: { fn: test_query }
, dom: { fn: test_dom
, my: { 'css*': $c
, 'css+': one_or_more($c)
, 'css?': $C
, 'css': not_null($C)
, 'xpath*': $x
, 'xpath+': one_or_more($x)
, 'xpath?': $X
, 'xpath!': truthy($x)
, 'xpath': not_null($X)
}
}
, inject: { fn: inject }
}
, name, test, me, my, mine
;
for (name in tests) {
test = tests[name];
me = test.fn;
if ((my = test.my))
for (mine in my)
me[mine] = my[mine];
given[name] = me;
}
})()
, input = [] // args for the callback(s?) the script wants to run
, rules = Object.create(opts) // wraps opts in a pokeable inherit layer
, debug = get('debug')
, script = get('name')
, ready = get('ready')
, load = get('load')
, pushState = get('pushstate')
, pjax_event = get('pjaxevent')
, name, rule, test, result, retry, plugin
;
Iif (typeof ready !== 'function' &&
typeof load !== 'function' &&
typeof pushState !== 'function') {
alert('no given function');
throw new Error('given() needs at least a "ready" or "load" function!');
}
if (plugins)
for (name in plugins)
Eif ((rule = plugins[name]) && (test = given[name]))
for (plugin in rule)
Eif (!(test[plugin])) {
given._parse_dom_rule = null;
test[plugin] = rule[plugin];
}
Iif (pushState && history.pushState &&
(given.pushState = given.pushState || []).indexOf(opts) === -1) {
given.pushState.push(opts); // make sure we don't reregister post-navigation
initPushState(pushState, pjax_event);
}
try {
for (name in rules) {
rule = rules[name];
if (rule === undefined) continue; // was some callback or other non-rule
test = given[name];
Iif (!test) throw new Error('did not grok rule "'+ name +'"!');
result = test(rule);
Iif (result === FAIL) return false; // the page doesn't satisfy all rules
input.push(result);
}
}
catch(e) {
if (debug) console.warn("given(debug): we didn't run because " + e.message);
return false;
}
Eif (ready) {
ready.apply(opts, input.concat());
}
Iif (load) window.addEventListener('load', function() {
load.apply(opts, input.concat());
});
return input.concat(opts);
function get(x) { rules[x] = undefined; return opts[x]; }
function isArray(x) { return Object_toString.call(x) === '[object Array]'; }
function isObject(x) { return Object_toString.call(x) === '[object Object]'; }
function array(a) { return Array_slice.call(a, 0); } // array:ish => Array
function arrayify(x) { return isArray(x) ? x : [x]; } // non-array? => Array
function inject(fn, args) {
var script = document.createElement('script')
, parent = document.documentElement;
args = JSON.stringify(args || []).slice(1, -1);
script.textContent = '('+ fn +')('+ args +');';
parent.appendChild(script);
parent.removeChild(script);
}
function initPushState(callback, pjax_event) {
if (!history.pushState.armed) {
inject(function(pjax_event) {
function reportBack() {
var e = document.createEvent('Events');
e.initEvent('history.pushState', !'bubbles', !'cancelable');
document.dispatchEvent(e);
}
var pushState = history.pushState;
history.pushState = function given_pushState() {
if (pjax_event && window.$ && $.pjax)
$(document).one(pjax_event, reportBack);
else
setTimeout(reportBack, 0);
return pushState.apply(this, arguments);
};
}, [pjax_event]);
history.pushState.armed = pjax_event;
}
retry = function after_pushState() {
rules = Object.create(opts);
rules.load = rules.pushstate = undefined;
rules.ready = callback;
given(rules);
};
document.addEventListener('history.pushState', function() {
if (debug) console.log('given.pushstate', location.pathname);
retry();
}, false);
}
function test_query(spec) {
var q = unparam(this === given || this === window ? location.search : this);
Eif (spec === true || spec == null) return q; // decode the query for me!
throw new Error('bad query type '+ (typeof spec) +': '+ spec);
}
function unparam(query) {
var data = {};
(query || '').replace(/\+/g, '%20').split('&').forEach(function(kv) {
kv = /^\??([^=&]*)(?:=(.*))?/.exec(kv);
Iif (!kv) return;
var prop, val, k = kv[1], v = kv[2], e, m;
try { prop = decodeURIComponent(k); } catch (e) { prop = unescape(k); }
if ((val = v) != null)
try { val = decodeURIComponent(v); } catch (e) { val = unescape(v); }
data[prop] = val;
});
return data;
}
function test_regexp(spec) {
Eif (!isArray(spec)) spec = arrayify(spec);
var re = spec.shift();
Eif (typeof re === 'string') re = new RegExp(re);
Iif (!(re instanceof RegExp))
throw new Error((typeof re) +' was not a regexp: '+ re);
var ok = re.exec(this===given || this===window ? location.pathname : this);
Iif (ok === null) return FAIL;
Eif (!spec.length) return ok;
var named = {};
ok.shift(); // drop matching-whole-regexp part
while (spec.length) named[spec.shift()] = ok.shift();
return named;
}
function truthy(fn) { return function(s) {
var x = fn.apply(this, arguments); return x || FAIL;
}; }
function not_null(fn) { return function(s) {
var x = fn.apply(this, arguments); return x !== null ? x : FAIL;
}; }
function one_or_more(fn) { return function(s) {
var x = fn.apply(this, arguments); return x.length ? x : FAIL;
}; }
function $c(css) { return array(this.querySelectorAll(css)); }
function $C(css) { return this.querySelector(css); }
function $x(xpath) {
var doc = this.evaluate ? this : this.ownerDocument, next;
var got = doc.evaluate(xpath, this, null, 0, null), all = [];
switch (got.resultType) {
case 1/*XPathResult.NUMBER_TYPE*/: return got.numberValue;
case 2/*XPathResult.STRING_TYPE*/: return got.stringValue;
case 3/*XPathResult.BOOLEAN_TYPE*/: return got.booleanValue;
default: while ((next = got.iterateNext())) all.push(next); return all;
}
}
function $X(xpath) {
var got = $x.call(this, xpath);
return got instanceof Array ? got[0] || null : got;
}
function quoteRe(s) { return (s+'').replace(/([-$(-+.?[-^{|}])/g, '\\$1'); }
// DOM constraint tester / scraper facility:
// "this" is the context Node(s) - initially the document
// "spec" is either of:
// * css / xpath Selector "selector_type selector"
// * resolved for context [ context Selector, spec ]
// * an Object of spec(s) { property_name: spec, ... }
function test_dom(spec, context) {
// returns FAIL if it turned out it wasn't a mandated match at this level
// returns null if it didn't find optional matches at this level
// returns Node or an Array of nodes, or a basic type from some XPath query
function lookup(rule) {
switch (typeof rule) {
case 'string': break; // main case - rest of function
case 'object': Eif ('nodeType' in rule || rule.length) return rule;
// fall-through
default: throw new Error('non-String dom match rule: '+ rule);
}
if (!given._parse_dom_rule) given._parse_dom_rule = new RegExp('^(' +
Object.keys(given.dom).map(quoteRe).join('|') + ')\\s*(.*)');
var match = given._parse_dom_rule.exec(rule), type, func;
Eif (match) {
type = match[1];
rule = match[2];
func = test_dom[type];
}
Iif (!func) throw new Error('unknown dom match rule '+ type +': '+ rule);
return func.call(this, rule);
}
var results, result, i, property_name;
Eif (context === undefined) {
context = this === given || this === window ? document : this;
}
// validate context:
Iif (context === null || context === FAIL) return FAIL;
if (isArray(context)) {
for (results = [], i = 0; i < context.length; i++) {
result = test_dom.call(context[i], spec);
Eif (result !== FAIL)
results.push(result);
}
return results;
}
Iif (typeof context !== 'object' || !('nodeType' in context))
throw new Error('illegal context: '+ context);
// handle input spec format:
if (typeof spec === 'string') return lookup.call(context, spec);
if (isArray(spec)) {
context = lookup.call(context, spec[0]);
if (context === null || context === FAIL) return context;
return test_dom.call(context, spec[1]);
}
if (isObject(spec)) {
results = {};
for (property_name in spec) {
result = test_dom.call(context, spec[property_name]);
if (result === FAIL) return FAIL;
results[property_name] = result;
}
return results;
}
throw new Error("dom spec was neither a String, Object nor Array: "+ spec);
}
};
Iif ('module' in this) module.exports = given;
|