Automated monthly statistics calculation
Below is a rather simple PHP program that fetches monthly page views statistics provided by stats
As pretty much everything else here on Wikipedia, I'm releasing this program code under the terms of the CC BY-SA 3.0 license, so please feel free to use it and modify according to your needs.
Source code
Before running the program, you need to modify the list of articles contained in $articles
variable, and month/year for which statistics are to be fetched and calculated, which are specified through STATS_MONTH
and STATS_YEAR
constants. When the program is configured to calculate statistics for the current month, it takes into account only the whole/elapsed days; as a result, running the program on the first of the month to calculate current month statistics isn't supported. Also, there are sometimes whole days missing in the statistics data available from statsSTATS_LANG
constant selects the encyclopedia: en
is for the English Wikipedia, de
is for the German Wikipedia, etc.
Just as a note, getting ready-to-run PHP code of this program is as easy as viewing the Wiki code of this page and copying what's between <syntaxhighlight lang="php" line>
and </syntaxhighlight>
tags. The program code below is the latest available version, and it is updated on this page after any improvements or bugfixes are implemented.
1 <?php
2
3 define('STATS_MONTH', '01'); // MM
4 define('STATS_YEAR', '2015'); // YYYY
5 define('STATS_LANG', 'en'); // "en", "de", "fr", etc.
6
7 $articles = array('Stagefright (bug)',
8 'Row hammer',
9 'Address generation unit',
10 'UniDIMM',
11 'kdump (Linux)',
12 'kernfs (BSD)',
13 'kernfs (Linux)',
14 'ftrace',
15 'Android Runtime',
16 'WebScaleSQL',
17 'Intel X99',
18 'HipHop Virtual Machine',
19 'kpatch',
20 'kGraft',
21 'CoreOS',
22 'ARM Cortex-A17',
23 'Solid-state storage',
24 'Port Control Protocol',
25 'zswap',
26 'Emdebian Grip',
27 'ThinkPad 8',
28 'Laravel',
29 'OpenLMI',
30 'Open vSwitch',
31 'Distributed Overlay Virtual Ethernet',
32 'Management Component Transport Protocol',
33 'Buildroot',
34 'dm-cache',
35 'bcache',
36 'SATA Express',
37 'OpenZFS',
38 'List of Eurocrem packages',
39 'M.2',
40 'Eurocrem');
41
42 // ---------------------------------------------
43 // obviously, configurable stuff ends here
44 // ---------------------------------------------
45
46 define('CHUNK_SIZE', 50); // articles
47 define('CHUNK_SLEEP', 3); // seconds
48
49 define('EXIT_SUCCESS', 0); // program exit codes
50 define('EXIT_FAILURE', 1);
51
52 set_time_limit(0);
53 ini_set('memory_limit', 67108864);
54 ini_set('default_socket_timeout', 90);
55
56 // a few small helper functions
57
58 function plural_output($value, $unit) {
59 return (number_format($value) . " {$unit}" . ((abs($value) != 1) ? 's' : ''));
60 }
61
62 function progress_message($message = '.') {
63 static $last_message = null;
64
65 $now = microtime(true);
66 $ret_val = false;
67
68 if (($last_message === null) ||
69 (($now - $last_message) > 0.5)) { // one message every 0.5 seconds
70 echo($message);
71
72 $last_message = $now;
73 $ret_val = true; // the message was printed
74 }
75
76 return ($ret_val);
77 }
78
79 // prepare the cURL handles for all articles
80
81 echo("\nFetching statistics data: ");
82
83 $start_time = microtime(true);
84 $handles = array();
85 $articles_total = count($articles);
86 $current_month = (STATS_MONTH == @date('m'));
87
88 if ($articles_total == 0) { // a small sanity check
89 echo("no articles specified!\n");
90 exit(EXIT_FAILURE);
91 }
92
93 if ($current_month && (@date('j') == 1)) { // only the whole days are accounted
94 echo("no elapsed days in current month!\n");
95 exit(EXIT_FAILURE);
96 }
97
98 for ($id = 0; $id < $articles_total; $id++) {
99 $handles[$id] = curl_init();
100
101 curl_setopt($handles[$id], CURLOPT_URL, 'http://stats.grok.se/json/' . STATS_LANG .
102 '/' . STATS_YEAR . STATS_MONTH .
103 '/' . str_replace('%2F', '/', rawurlencode($articles[$id])));
104
105 curl_setopt($handles[$id], CURLOPT_HEADER, false);
106 curl_setopt($handles[$id], CURLOPT_RETURNTRANSFER, true);
107
108 curl_setopt($handles[$id], CURLOPT_CONNECTTIMEOUT, 20);
109 curl_setopt($handles[$id], CURLOPT_TIMEOUT, 60);
110 curl_setopt($handles[$id], CURLOPT_DNS_CACHE_TIMEOUT, 3600);
111
112 curl_setopt($handles[$id], CURLOPT_FORBID_REUSE, false);
113 curl_setopt($handles[$id], CURLOPT_FRESH_CONNECT, false);
114 curl_setopt($handles[$id], CURLOPT_MAXCONNECTS, 10);
115 }
116
117 progress_message();
118
119 // run the cURL handles in chunks; otherwise, fetching data for a large number
120 // of articles at once causes stats.grok.se to start refusing HTTP connections
121
122 $handle_all = curl_multi_init();
123 $chunks = ceil(1.0 * $articles_total / CHUNK_SIZE);
124 $output = array();
125 $error_messages = array('Parsing JSON data failed' => -1);
126
127 $views_total = 0;
128 $failures = 0;
129 $days_available = array();
130 $today = @date('Y-m-d');
131 $php_version = explode('.', phpversion(), 3);
132
133 if (($php_version[0] >= 5) && // available since PHP 5.5.0
134 ($php_version[1] >= 5)) {
135 curl_multi_setopt($handle_all, CURLMOPT_PIPELINING, true);
136 curl_multi_setopt($handle_all, CURLMOPT_MAXCONNECTS, 10);
137 }
138
139 for ($chunk = 0; $chunk < $chunks; $chunk++) { // fetch one chunk at a time
140 $id_limit = min(($chunk + 1) * CHUNK_SIZE, $articles_total);
141
142 for ($id = $chunk * CHUNK_SIZE; $id < $id_limit; $id++) // all articles in this chunk
143 curl_multi_add_handle($handle_all, $handles[$id]);
144
145 do { // fetch the articles stats data in JSON format...
146 $status = curl_multi_exec($handle_all, $running);
147 progress_message();
148 } while (($status == CURLM_CALL_MULTI_PERFORM) ||
149 ($running > 0));
150
151 for ($id = $chunk * CHUNK_SIZE; $id < $id_limit; $id++) { // ... and process it
152 $json = curl_multi_getcontent($handles[$id]);
153
154 if (($json == '') || // is the JSON Ok?
155 (($json = json_decode($json, true)) === null) ||
156 !array_key_exists('daily_views', $json) ||
157 !is_array($json['daily_views'])) {
158
159 ++$failures;
160
161 if (($message = curl_error($handles[$id])) != '') { // for some reason, curl_errno()
162 if (!array_key_exists($message, $error_messages)) { // always returns zero here
163 $errno = -1 * count($error_messages) - 1;
164 $error_messages[$message] = $errno;
165 }
166 else // already seen
167 $errno = $error_messages[$message];
168 }
169 else // below -1 are the cURL errors
170 $errno = -1;
171
172 $output[$id] = $errno;
173 }
174 else { // fetched JSON data is Ok
175 $views = 0;
176
177 foreach ($json['daily_views'] as $key => $value)
178 if (!$current_month || ($key != $today)) { // account only the whole days
179 $views += abs($value); // just in case, should never be negative
180
181 if ($value > 0) // sometimes there are complete days missing
182 $days_available[$key] = true;
183 }
184
185 $views_total += $views;
186 $output[$id] = $views;
187 }
188
189 curl_multi_remove_handle($handle_all, $handles[$id]);
190 curl_close($handles[$id]);
191
192 progress_message(); // done with this chunk
193 }
194
195 if ($chunk != ($chunks - 1)) { // don't sleep after the last chunk
196 $message = '#'; // all this results in smooth progress messages
197 $limit = CHUNK_SLEEP * 4;
198
199 for ($i = 0; $i <= $limit; $i++) {
200 if (progress_message($message) === true) // print only one "marker"
201 $message = '.';
202
203 usleep(250000);
204 }
205 }
206 }
207
208 curl_multi_close($handle_all);
209 echo(" done.\n\n");
210
211 // done fetching all chunks of the stats data, generate and print the output...
212
213 arsort($output, SORT_NUMERIC);
214
215 $error_messages = array_flip($error_messages);
216 $first_error = true;
217
218 foreach ($output as $id => $views)
219 if ($views >= 0)
220 echo("- {$articles[$id]}: total " . plural_output($views, 'view') . "\n");
221 else {
222 if ($first_error === true) { // display an empty line before
223 echo("\n"); // the first failure message
224 $first_error = false;
225 }
226
227 echo("> {$articles[$id]}: failure ({$error_messages[$views]})\n");
228 }
229
230 // ... and the final summary
231
232 $days_total = !$current_month
233 ? cal_days_in_month(CAL_GREGORIAN, STATS_MONTH, STATS_YEAR)
234 : (@date('j') - 1);
235 $days_missing = $days_total - count($days_available);
236
237 $articles_ok = $articles_total - $failures;
238 $month_name = @date('F', @strtotime(STATS_YEAR . '-' . STATS_MONTH . '-01'));
239
240 $elapsed_time = microtime(true) - $start_time;
241 $elapsed_min = intval($elapsed_time / 60);
242 $elapsed_sec = round($elapsed_time - $elapsed_min * 60);
243
244 echo("\nDone, {$month_name} " . STATS_YEAR . ' statistics for ' . plural_output($articles_ok, 'article') .
245 ' fetched in ' . (($elapsed_min > 0)
246 ? (plural_output($elapsed_min, 'minute') . ' and ')
247 : '') .
248 plural_output($elapsed_sec, 'second') . ".\n" .
249 (($failures > 0)
250 ? ('Fetching the views statistics failed for ' . plural_output($failures, 'article') . ".\n")
251 : ''));
252
253 if ($days_total > $days_missing) { // it's entirely possible that
254 $views_daily = intval($views_total / ($days_total - $days_missing)); // all days were missing
255
256 echo('Total ' . plural_output($views_total, 'view') . ', averaging in ' .
257 plural_output($views_daily, 'view') . ' per day (' .
258 plural_output($days_total, ($current_month ? 'whole ' : '') . 'day') .
259 ' in ' . ($current_month ? 'the current' : 'that') . ' month' .
260 (($days_missing > 0)
261 ? (', with the statistics unavailable for ' . plural_output($days_missing, 'day'))
262 : '') .
263 ").\n");
264 } else { // no statistics data
265 echo('Sorry, no statistics data is available at the moment for ' .
266 ($current_month ? 'the current' : 'that') . " month.\n");
267
268 $errno = ((($days_total != $days_missing) ? 10 : 0) + // just in case, perform some additional
269 (($views_total != 0) ? 20 : 0)); // sanity checks on the internal logic
270
271 if ($errno > 0) {
272 echo("\nInternal errors detected (error code: {$errno}), please report on " .
273 "https://en.wikipedia.org/wiki/User_talk:Dsimic by providing complete program output.\n");
274
275 exit(EXIT_FAILURE);
276 }
277 }
278
279 exit(EXIT_SUCCESS);
280
281 ?>
Output example
Below is an example of the output produced when the program from above is run. The program sorts the articles by their total page views in descending order, so the article that has received the largest number of page views is on top of the list.
Fetching statistics data: ....... done.
- M.2: total 43,038 views
- SATA Express: total 18,979 views
- Android Runtime: total 14,322 views
- CoreOS: total 9,897 views
- HipHop Virtual Machine: total 9,443 views
- Laravel: total 8,891 views
- Open vSwitch: total 3,240 views
- dm-cache: total 2,233 views
- ARM Cortex-A17: total 2,209 views
- OpenZFS: total 2,194 views
- UniDIMM: total 1,923 views
- bcache: total 1,470 views
- Port Control Protocol: total 1,143 views
- zswap: total 1,090 views
- Management Component Transport Protocol: total 942 views
- WebScaleSQL: total 891 views
- ftrace: total 848 views
- Intel X99: total 806 views
- kpatch: total 783 views
- Address generation unit: total 777 views
- kdump (Linux): total 713 views
- Solid-state storage: total 640 views
- Buildroot: total 639 views
- kernfs (Linux): total 634 views
- Eurocrem: total 633 views
- kGraft: total 580 views
- ThinkPad 8: total 460 views
- Distributed Overlay Virtual Ethernet: total 395 views
- OpenLMI: total 389 views
- Emdebian Grip: total 349 views
- kernfs (BSD): total 231 views
- List of Eurocrem packages: total 98 views
- Stagefright (bug): total 0 views
- Row hammer: total 0 views
Done, January 2015 statistics for 34 articles fetched in 3 seconds.
Total 130,880 views, averaging in 4,221 views per day (31 days in that month).