Texas

Trial 1 discussion

I fixed a botched attempt at a redirect by a noob[1] and got reverted by this bot. Thanks, Jon. — Preceding unsigned comment added by 81.145.247.25 (talk • contribs) 19:26, 2 November 2010

Thanks for pointing this out - "REDIRECT" was not included in the list of wiki markup to ignore. We are adding it now. Crispy1989 (talk) 19:30, 2 November 2010 (UTC)[reply]

That's not the only error so far. [2] is a rather inexplicable reversion. Also, the bot seems to have a bad habit of reverting itself [3] [4] for no apparent reason (or perhaps the means of identifying vandalism is so problematic that the bot really is marking its own edits as vandalism.) Peter Karlsen (talk) 19:37, 2 November 2010 (UTC)[reply]

Reverting itself has been fixed. The other errors are due to the dataset not being broad enough (adding these edits to the dataset and retraining should rectify this) and REDIRECT not being in the list of wiki markup (being fixed right now). Crispy1989 (talk) 19:43, 2 November 2010 (UTC)[reply]

Thanks. [5] appears to be an incorrect reversion as well. Peter Karlsen (talk) 19:46, 2 November 2010 (UTC)[reply]

I blanked an attack page and got a warning. Not good. Carl Sixsmith (talk) 19:45, 2 November 2010 (UTC)[reply]

This also seems to be an issue with dataset completeness. There are no instances of complete page blanking in the dataset that are legitimate. As soon as these are added, this will correct itself. Also, users above a certain threshold number of edits should be ignored (and you should quality). We'll look into lowering the threshold. Crispy1989 (talk) 19:49, 2 November 2010 (UTC)[reply]

I placed a speedy tag {{db-redirnone}} on a page with a broken redirect (to a page that has been deleted) and was reverted.[6] Thanks, Jon. — Preceding unsigned comment added by 81.145.247.25 (talk • contribs) 20:05, 2 November 2010
Thanks, will also add this tag to markup list and edit to the dataset. Crispy1989 (talk) 20:38, 2 November 2010 (UTC)[reply]

So far, the false positive rate seems to fall right into the expected range of around 0.25%. It also seems to be reverting more than half of all vandalism on Wikipedia, also as expected. The false positives that do exist seem to be primarily problems with redirects (which has been fixed in the code and is being tested before restarting the process running the trial) and problems that can be solved by increasing dataset size. Please continue to report any issues so they can be added to the training dataset, and so I can add tags and other things (like redirects) that I may have missed to the special text handling code. Crispy1989 (talk) 21:35, 2 November 2010 (UTC)[reply]

It would be better if you assign a special page for reporting FPs instead of here. Also, what is the number of edits above which a registered user will be ignored. Sole Soul (talk) 21:58, 2 November 2010 (UTC)[reply]

Users with more than 50 edits are now ignored. This is what the old ClueBot was set at. Crispy1989 (talk) 22:13, 2 November 2010 (UTC)[reply]

I changed a stub article, flagged as poor into a redirect and got a false positive. See The Secret Peacemaker for details. Jim no.6 (talk) 22:02, 2 November 2010 (UTC)[reply]

The robot is automatically reverting my edition on the page Ideologia but "Ideologia" could refer either to the album Ideología and to the album Ideologia (Cazuza's album). The robot needs some adjustment. — Preceding unsigned comment added by 187.68.100.92 (talk • contribs) 23:45, 2 November 2010

Reverted [7]. In the absence of a better solution, I suggest that article => redirect and redirect => article conversions be removed from the "vandalism" dataset. Peter Karlsen (talk) 23:54, 2 November 2010 (UTC)[reply]

The problem was not that it thought redirects were vandalism, but that there was no special handling for "REDIRECT". This has been fixed (special handling added). Testing the fixes right now, bot will be restarted with updates soon (it takes some time to retrain and test it). I also added special handling for disambiguation tags. Crispy1989 (talk) 00:02, 3 November 2010 (UTC)[reply]

I edited the Greatest Hits So Far redirecting page to a disambiguation page in order to include both The Greatest Hits, So Far and Greatest Hits... So Far!!!, since it is an upcoming release. Yet it was since reverted & I got a vandalism warning. Same thing happened when I corrected the UK edition track listing in the Greatest Hits... So Far!!! page, which is currently incorrect, and I replaced it with the track listing cited in the reference, yet it was reverted again. My edits should stop being reverted as they are not unconstructive nor vandalistic. Imjayyy (talk) 00:04, 3 November 2010 (UTC)[reply]

The issue with redirects and disambiguation pages is known and is being corrected. Cluebot-NG has not edited Greatest Hits... So Far!!!, it did not consider these edits vandalism. They were reverted by another user. Crispy1989 (talk) 00:18, 3 November 2010 (UTC)[reply]

Thank you! Imjayyy (talk) 00:21, 3 November 2010 (UTC)[reply]

Is it possible to generate new false positive vs catch rate charts once the bot has a few days under its belt so that we can see the progress? Gigs (talk) 01:12, 3 November 2010 (UTC)[reply]
The reports and charts are generated from the trial dataset. I can regenerate them (from the same trial dataset) after making these modifications, but it wouldn't look much different - the changes I'm making now are to prevent false positives from things which aren't well-represented in the dataset (seemingly, redirects and disambiguation pages). We will add these reported false positives to the training dataset, and add special handling code for these, but it won't greatly affect the results of a run with the trial dataset. What I think you want is statistics on exactly how much vandalism is really being caught, versus the false positive rate. For this, we need to have a human go through all the edits that bot has seen, and manually classify them to check for accuracy. We have already designed an interface exactly for this purpose (see User:Cluebot_NG for details), but have not gotten any volunteers yet. If we get enough volunteers, then we'll add all the edits the bot has seen to the review queue, and generate a separate trial dataset from this (then, we can generate the stats and graphs). Crispy1989 (talk) 01:23, 3 November 2010 (UTC)[reply]

80.192.184.101 (talk · contribs) made a test edit to thin layer chromatography, which I reverted. When the user made the same edit again, it was reverted by ClueBot NG. Does the bot assume that reverted edits are vandalism? --Ixfd64 (talk) 06:31, 4 November 2010 (UTC)[reply]

Well, one of the parameters into the ANN is number of times reverted. So, not directly, but the bot probably did pick up on the fact that that user had been reverted before. -- Cobi^(t|c|b) 06:56, 4 November 2010 (UTC)[reply]

Actually, the parameter is number of times user has been warned. This is one factor out of over a hundred, and alone, is not sufficient to indicate that an edit is vandalism. You can think of it more as an "estimation of good faith" for borderline edits. Crispy1989 (talk) 07:15, 4 November 2010 (UTC)[reply]

False positive reports

I made an edit to Shell game that was reverted, despite being a genuine edit. 75.67.220.101 (talk) 04:00, 22 November 2010 (UTC)[reply]
An attempt to remove the blatant advertising from the Lemsip article was reverted.

It's pretty obvious the article was written by someone working for the producer of the medicine.

I added some variations to the information for Kings, the drinking game. It actually has a section for variations and it had kicked off the suggestions I have added. Any suggestions how to make it stick?
tried to edit "Hoy_Me_Voy" to change sentences to a more appropriate tone, but ClueBot flagged the change as vandalism, put a notice on my talk page, and reverted the change back. What do I do? (talk)
tried to make a change on Fernando_Garibay, adding "so happy i could die" to his 2010 productions but it got removed. the proofs in his website and its already a source so what do i do. 70.173.230.88 (talk)
The following is the log entry regarding this warning: Four (energy drink) was changed by 109.46.144.246 (u) (t) ANN scored at 0.817485 on 2010-11-18T07:44:16+00:00 . Thank you. ClueBot NG (talk) 07:44, 18 November 2010 (UTC)
I attempted to revert an instance of vandalism to this page; ClueBot flagged my change as vandalism and reverted my change to the previously vandalized version. --24.72.122.184 (talk) 22:22, 15 November 2010 (UTC)[reply]

Okay, after this happened the first time, an automated message appeared on my talk page stating, "ClueBot NG produces very few false positives, but it does happen. If you believe the change you made should not have been detected as unconstructive, please report it here, remove this warning from your talk page, and then make the edit again." I did this, and it re-reverted the change again and placed another warning on my talk page. What is this? --24.72.122.184 (talk) 22:31, 15 November 2010 (UTC)[reply]

Okay, it was in fact another user that re-reverted the change, and not ClueBot. Please disregard that follow-up comment. --24.72.122.184 (talk) 22:55, 15 November 2010 (UTC)[reply]

This looks wrong. http://en.wikipedia.org/w/index.php?title=Kamen_Rider_OOO_%28character%29&curid=28242801&diff=396938724&oldid=396938713 How would it even know if this is vandalism? Stupid bot.131.202.131.250 (talk) 17:04, 15 November 2010 (UTC)[reply]
I followed the link provided to give a false positive report, but as far as I can tell, there's not actually a section set aside here for that purpose. I apologize for any intrusion in my making this section. —Bill Price (nyb) 03:28, 3 November 2010 (UTC)[reply]

Sorry for the confusion - that page is currently "left-over" from the original ClueBot. We thought it would be good to keep false positives here while the BRFA is open, so all reviewers can get a good idea of its accuracy.

Same thing. Happybunny95 (talk) 04:03, 15 November 2010 (UTC)[reply]

This edit of Bed Intruder Song made by Allyisaunicorn (talk · contribs). The edit needed to be reverted due to copyright issues, but it was not an act of vandalism. —Bill Price (nyb) 03:28, 3 November 2010 (UTC)[reply]
Edits like this are very difficult to distinguish from vandalism from a bot's point of view. The bot does specially handle things within quotes, but these lyrics were presented as normal page content, so they were handled normally. They contain unterminated sentences, slang, Bayesian keywords, abnormal use of capitalization, and other things which are fed into the neural network. It may be possible to add a special case for lyrics, but this might require that the training dataset contain several examples of lyrics being added. I'll look into the feasibility of this. Crispy1989 (talk) 03:52, 3 November 2010 (UTC)[reply]

You should note that the account that made that edit seems to be a single purpose vandalism only account. So the number of past edits that were vandalism statistic would likely be higher. As such I'm not sure this specific edit is really a false positive due to the user's obvious bad faith. --nn123645 (talk) 15:10, 3 November 2010 (UTC)[reply]

Thanks for pointing this out. Yes, the bot does take this into account. It may be a factor into why this particular edit was reverted. Crispy1989 (talk) 17:46, 3 November 2010 (UTC)[reply]

Here's another false positive: [8] (it's not immediately clear to me why the edit might have seemed like vandalism at all; the bot is surely not policing the addition of unreferenced material to Wikipedia.) Peter Karlsen (talk) 03:35, 3 November 2010 (UTC)[reply]
One unique thing about using a neural network as the core detection engine is that it's a bit of a "black box". Sometimes it's not immediately apparent why an error occurred. Usually it's because the dataset just isn't large enough, and as it grows, these will disappear. Currently, the dataset is large enough for the statistics mentioned above (about 60% of vandalism caught at 0.25% false positives), and just by estimating, it looks like these are approximately correct for the live run as well. As the dataset grows, these kinds of false positives will be eliminated. Crispy1989 (talk) 03:52, 3 November 2010 (UTC)[reply]
Do you intend keep the target false positive rate at 0.25%? (for editors new to this discussion, that's 0.25% of every edit examined; the number of incorrect reversions will be well over the two and a half per every 1000 rollbacks by the bot that might seem to be indicated by the raw percentage.) If so, then as the dataset improves, the threshold for reversion will simply be lowered to continue to meet 0.25% target, resulting in more vandalism reverted, but new and exciting false positives to replace the ones that have been eliminated. This is why, in the discussion above, I suggested that a 0.1% false positives target would be more conducive to community acceptance of the bot, and ultimate approval. Peter Karlsen (talk) 05:07, 3 November 2010 (UTC)[reply]
It can be adjusted to whatever people want. As the dataset improves, I'll update the graphs. The 0.25% was determined by looking for a sharp dropoff point below 0.5% on the graph. As the dataset improves, the false positive rate will be lowered as well. Crispy1989 (talk) 15:56, 3 November 2010 (UTC)[reply]

I too followed a link here to report a false positive, i really have no idea why it auto reverted my changes. The trigger that seems to have set off the bot was my use of the word "homosexual", although i was simply substitution it for the word "gay" in a sentence because it seemed more appropriate in the context, and created a better flow in the prose. Oddly the Bot also reverted a number of other minor changes i had made, which were mealy filling in missing words, and could in no way be construed as vandalism. Here are the changes i made[9] — Preceding unsigned comment added by Carensdp (talk • contribs) 05:13, 3 November 2010

I've taken the liberty of reverting the bot [10]. One thing that ought to be apparent from the false positives up till now is the bot's persistent homophobia - any mention of "gay" or "homosexual" seems to be enough to trigger it (for instance, [11].) Perhaps it might be advisable to add some edits to LGBT-related articles to the dataset of legitimate contributions, so the bot might (hopefully) be able to distinguish between references to actual homosexuality, and "gay" in the pejorative sense as a generalized insult. Peter Karlsen (talk) 05:35, 3 November 2010 (UTC)[reply]

Yeah, this is what needs to happen. In the current dataset, there are no instances of these words being used correctly. As soon as these edits are added, this problem should correct itself.

Cluebot seems to be immediately reverting all contributions made by IP's to the Reference Desk, eg. [12]. WikiDao ☯ (talk) 11:41, 3 November 2010 (UTC)[reply]
(edit conflict) here's another, maybe keep it to article space, or at least out of the discussion space? - Kingpin¹³ (talk) 11:42, 3 November 2010 (UTC)[reply]
The neural network is only trained on articles in the main namespace. It is not (currently) meant to handle any other articles. I was unaware that articles from other namespaces were fed to the core. I'll tell the person running the interface code to exclude any edits not in the main namespace. Crispy1989 (talk) 15:56, 3 November 2010 (UTC)[reply]
One thing this has brought to attention, is that the exclusion compliance is apparently not working, see here. - Kingpin¹³ (talk) 16:35, 3 November 2010 (UTC)[reply]
I'll tell the developer of the Wikipedia interface. It handles all whitelists and exclusions. Crispy1989 (talk) 16:49, 3 November 2010 (UTC)[reply]
Also, are you aware that it's not currently warning users? - Kingpin¹³ (talk) 16:50, 3 November 2010 (UTC)[reply]

I am confused by what you mean by "developer of the Wikipedia interface"? Exclusion compliant means following the {{bots}} template in this case, such as this. — HELLKNOWZ ▎TALK 16:55, 3 November 2010 (UTC)[reply]
The bot's code is created primarily by two people - myself and User:Cobi. I wrote the core which does the main vandalism detection with the machine learning techniques. Cobi wrote the interface to Wikipedia, which handles everything that's not machine-learning (exclusions, whitelists, etc). The interface was largely taken from the existing Cluebot. Crispy1989 (talk) 17:40, 3 November 2010 (UTC)[reply]

Exclusion compliance fixed. The {{nobots}} was working, but not {{bots}}. Not warning users was due to someone setting the bot's shutoff page, and due to a bug that has now been fixed, it only honored that page for warns. -- Cobi^(t|c|b) 17:47, 3 November 2010 (UTC)[reply]
ClueBot NG reverted a speedy deletion tag db-vandalism which was added by Uncle Milty. Minimac (talk) 14:23, 3 November 2010 (UTC)[reply]
This is being fixed. Crispy1989 (talk) 15:56, 3 November 2010 (UTC)[reply]

Note: Cluebot-NG has reviewed over 70,000 edits so far, resulting in a handful of false positives, which are either being fixed now programmatically, or will be fixed with the growing of the dataset. Crispy1989 (talk) 17:00, 3 November 2010 (UTC)[reply]

I should also note that, while Cluebot-NG has a false positive rate comparable to some humans (if a human were to review every single edit made to Wikipedia), the false positives are not always the same ones that you might expect a human to make. Crispy1989 (talk) 19:39, 3 November 2010 (UTC)[reply]

On Ajharper18 -- the content of the page was "test" so I tagged as {{db-g2}} and was reverted by ClueBot NG. There's no way that should have happened. I've never had any bot revert any of my speedy taggings previously. 174.109.197.174 (talk) 11:36, 4 November 2010 (UTC)[reply]
This seems to be an issue in some cases because the current dataset does not contain instances of speedy deletion tags being added. We are generating a new dataset now which should solve the issue. Crispy1989 (talk) 14:17, 4 November 2010 (UTC)[reply]

This is probably a silly question, but what does the "NG" stand for? New Generation? --Ixfd64 (talk) 20:22, 4 November 2010 (UTC)[reply]

Our intent was Next Generation. Crispy1989 (talk) 20:33, 4 November 2010 (UTC)[reply]

This edit on Franco Selleri was made by an inexperienced user, and just seems to be reverted by accidentally adding his signature ~~~~ to his edit. -- Crowsnest (talk) 10:24, 5 November 2010 (UTC)[reply]
The signature probably is the primary reason it was reverted - the training set doesn't include talk pages or areas where signatures are used, so without seeing a signature before, it probably seems like a random mashup of punctuation by a new user. As the dataset grows, and it sees instances of accidental signatures classified as constructive, this type of thing won't happen. In addition to the signature, a possible complicating factor is that the bot can detect common vandal grammatical errors, such as unterminated sentences - and the user's edit, in this case, adds one. Again, as the dataset grows, and there are instances of where edits like this are not classified as vandalism, the bot will score these lower. Crispy1989 (talk) 16:33, 5 November 2010 (UTC)[reply]

[13] Can't see why this revert was made. Philip Trueman (talk) 15:05, 5 November 2010 (UTC)[reply]
This is an instance where the dataset isn't large enough. For some reason, the only edits the bot has learned from with similar statistics have been vandalism. With a larger and more complete dataset, as is being generated now by volunteers, there will be fewer gaps in its training. Crispy1989 (talk) 16:38, 5 November 2010 (UTC)[reply]

This revert [14], while not referenced is definitely not vandelism. Not sure why it was labelled as such (use of ball?). AIRcorn (talk) 20:53, 5 November 2010 (UTC)[reply]
This is purely a case of a gap in the dataset. The Bayesian classifier (ie, words) were not what caused it, alone, anyway. - "ball" isn't even in the Bayesian database (the bot learned that it occurs about equally in vandalism and nonvandalism). A few words may have contributed ("you" occurs in 548 vandalism articles, and 45 good articles), but this should have been counterbalanced by other words ("22" occurs in 82 good articles, and 22 vandalism articles). With an increase in dataset size, this should stop. Crispy1989 (talk) 00:26, 6 November 2010 (UTC)[reply]

[15] looks like a revert of a perfectly legitimate and correct edit (see definition of ionization). PleaseStand ^(talk) 05:37, 6 November 2010 (UTC)[reply]
This appears to be because the bot was counting "i" and "e" both as uncapitalized sentences, and "i" as an uncapitalized 'I'. Thanks for pointing out this special case. It is now fixed. Crispy1989 (talk) 07:07, 6 November 2010 (UTC)[reply]

I made some improvements to the English translation on Du, du liegst mir im Herzen (here), however my changes were reverted as vandalism... why? 71.38.118.252 (talk) 06:51, 6 November 2010 (UTC)[reply]
Occasionally it has issues dealing with song lyrics because they do not follow standard acceptable wiki formatting. We're looking into adding special cases in code, and increasing dataset size should help as well. Crispy1989 (talk) 07:09, 6 November 2010 (UTC)[reply]

[16] I don't like the use of the word 'your', but to call the original edit vandalism is a stretch. Philip Trueman (talk) 11:22, 6 November 2010 (UTC)[reply]
This is the kind of false positive I'd expect - poor edits with borderline vandalism qualities. Even these should be reduced with a larger dataset (containing constructive edits with these traits). In addition to the word "your", the lack of space after the previous sentence was also a factor - it registers punctuation present in the middle of words (other than things like apostrophe). Crispy1989 (talk) 23:00, 6 November 2010 (UTC)[reply]

[17] (filling in the chronology part of an infobox with a link) definitely isn't vandalism. PleaseStand ^(talk) 19:21, 6 November 2010 (UTC)[reply]
Definitely something that could be fixed with a larger dataset. Crispy1989 (talk) 23:01, 6 November 2010 (UTC)[reply]

[18] Maybe it was a poor, unsourced edit to content about a living person, but it's not vandalism. PleaseStand ^(talk) 19:59, 6 November 2010 (UTC)[reply]
The bot most likely figured it was a poor/borderline edit based on statistics, and perhaps the word "estained". The previous warning for vandalism was used as an estimation of good faith (1/3 of all previous edits made were vandalism at the time of the edit). As with the other similar false positives here, increasing dataset size and including cases where previous vandals make constructive edits. Crispy1989 (talk) 23:06, 6 November 2010 (UTC)[reply]

[19] Again, not vandalism. PleaseStand ^(talk) 20:11, 6 November 2010 (UTC)[reply]
The bot is failing to recognize this as a link. It currently recognizes external links with either [blah or <a href=blah syntax. I'll correct it to look for more general forms. Crispy1989 (talk) 23:10, 6 November 2010 (UTC)[reply]

[20] I corrected first the non-working external links pointing to Finnish Army Insignias on Finnish Defence Forces' website. Next i added a few spaces to the links' texts to correct their appearance and the bot reported this as unconstructive. I forgot to mention the last change as a small change, which may have affected in the bot report. Kime79 ^(talk) 14:57, 9 November 2010 (UTC)[reply]
Thanks for pointing this out - this brings to my attention that, although links are removed (and analyzed separately) before being input to the neural net, total size difference includes the links. Because links are very rarely this long, this threw off the neural net. I'll look into modifying it into removing links in a preprocessing step instead. Crispy1989 (talk) 15:02, 9 November 2010 (UTC)[reply]

[21] Teach your bot what wikify and Manual of Style are.
What sort of a word is 'indiscovered'? Philip Trueman (talk) 03:22, 11 November 2010 (UTC)[reply]
The bot may have been picking up on the direct replacements of formal terms with informal terms (ie. your replacement of "large" with "big"), and the replacements of words with incorrect spellings of the words (ie. your replacement of "undiscovered" with "indiscovered"). If enough edits like this are added to the dataset and classified as constructive, the bot will stop recognizing it as vandalism. But it seems to me that this kind of edit is very borderline - adding misspelled words is one things, but replacing correct words with misspelled ones, and formal words with informal ones, in multiple places, is another thing. Crispy1989 (talk) 03:32, 11 November 2010 (UTC)[reply]

[22] Edit incorrectly reverted for not being 'constructive'.
Considering that the article is about the TLD and not the slang word, this edit seems very borderline. Crispy1989 (talk) 16:09, 11 November 2010 (UTC)[reply]
But the bot actually reinstated the piece about the slang term. Ucucha 16:11, 11 November 2010 (UTC)[reply]
I can't believe I missed that - wow. Yeah, it's the same problem that has caused a few other issues. Not enough reverts in the dataset. Crispy1989 (talk) 16:14, 11 November 2010 (UTC)[reply]
[23] The 0.25% false positive statistic doesn't seem correct; if you're calculating it by taking the number of people who take the time to post on this page divided by its total amount of edits, you're going to get a very skewed "statistic". Shubinator (talk) 05:09, 14 November 2010 (UTC)[reply]
Just looked over the last 50 of the bot's contribs (for the record, that's these), and found two more: [24] [25]. By my (very informal) data, ClueBot NG has a 4% false positive rate. Don't get me wrong, the bot is unique and the work you're doing is great, but the bot definitely needs some tweaking before being let loose unmonitored. Shubinator (talk) 05:22, 14 November 2010 (UTC)[reply]
False positives is the percentage of good edits it classifies as bad. I.e., it classifies 25 out of every 10,000 good edits as bad. And, yes, we realize there needs to be work done to tweak it -- that is why we have a review interface so we can create a better dataset. We calculated 0.25% by training with 20,000 edits in our current 30,000 edit dataset, and then having it classify the remaining 10,000, and seeing how many it said are vandalism, when our dataset said they were good. -- Cobi^(t|c|b) 06:20, 14 November 2010 (UTC)[reply]
Just to clarify what Cobi said: He's correct about how false positive rate is determined. To accurately determine what it is during a live run, you have to count the number of false positives in a time period, and divide that by total number of legitimate edits that were made in that time period. Also as Cobi said, the false positive rate is not determined by false positive reports. We divide our dataset up into two parts - 2/3 to use for training and 1/3 for trialing. That 1/3 is run through the network and is used for rate calculations. This should be a very accurate way of calculating it, assuming a representative dataset. Crispy1989 (talk) 07:41, 14 November 2010 (UTC)[reply]

This edit was reverted within seconds for no obvious reason. -- Smjg (talk) 16:36, 16 November 2010 (UTC)[reply]

[26] the improvement of a poor quote translation was reverted within seconds (see http://www.nybooks.com/books/imprints/classics/the-way-of-the-world/ for a source of the correct quote translation)

Reverting page deletion by author

this edit (now deleted) reverted the blanking of a page by its author. It is very confusing for an author who realises his page is inappropriate and blanks it, which is a frequent occurrence, when the inappropriate page is restored in stead of being tagged db-g7. JohnCD (talk) 10:09, 3 November 2010 (UTC)[reply]

We'll add an exemption for the author of the page. Crispy1989 (talk) 16:31, 3 November 2010 (UTC)[reply]

False positives

see User_talk:ClueBot_Commons#Cluebot_-too_many_false_positives too many false positives on the wikipedia science reference desk. and the error report ID fucntion seems broken.Sf5xeplus (talk) 13:25, 3 November 2010 (UTC)[reply]

Cluebot NG is not meant to edit anything outside of the main namespace. This is apparently a misunderstanding between the developer of the core and the developer of the Wikipedia interface. The interface will be changed to ignore edits not in the main namespace, unless at some point in the future we train separate neural networks for separate namespaces. Crispy1989 (talk) 16:33, 3 November 2010 (UTC)[reply]

That page was on the optin list. I've removed everything from the optin list, for now. Keep in mind, when users add pages there, they are inviting the bot somewhere where it has not been tested or designed for. It may work well. It may not. -- Cobi^(t|c|b) 17:33, 3 November 2010 (UTC)[reply]

Thannks.Sf5xeplus (talk) 19:00, 3 November 2010 (UTC)[reply]

Why did you populate User:ClueBot NG/Optin from User:ClueBot/Optin? No one requested that ClueBot NG revert pages outside of the article namespace. Peter Karlsen (talk) 20:16, 3 November 2010 (UTC)[reply]

I copied all of the control pages from ClueBot's userspace. I forgot to remove all but the comment at the top. -- Cobi^(t|c|b) 20:30, 3 November 2010 (UTC)[reply]

Then I'm concerned. Can you at least consider widening the scope to include the Template namespace? Philip Trueman (talk) 18:37, 5 November 2010 (UTC)[reply]

The method can be expanded to work with pretty much any namespace or content, but it should use a separate neural network, and must be trained on a training set from that namespace. I'd like to get the core perfected and approved for the main namespace first, then we'll look into generating datasets for other namespaces. If necessary, while getting it to work with other namespaces, the old heuristics-based cluebot could be run just on those namespaces. Crispy1989 (talk) 00:35, 6 November 2010 (UTC)[reply]

Review Interface

I already mentioned this, but it's important, so I'll bring it to attention again. Cluebot NG's accuracy depends almost entirely on its dataset. By fixing its current dataset, and helping to classify new edits, you can help to greatly improve its performance. We have an interface specifically designed for this, and should make it easy for volunteers to help out. The interface can be found at this link. You need a Google account to use it, and we need to authorize you to access it. If you'd like to help out, please follow the link and go to the signup section. Help is needed, and greatly appreciated! Crispy1989 (talk) 22:07, 3 November 2010 (UTC)[reply]

Thank you to all the people helping with dataset classification! We've added some stats for who's doing what right here.

We're looking to double our current dataset size (currently a little over 30,000 edits) and replace it with a model closer to reality by using a truly random sampling of data. The interface is currently loaded with around 70,000 edits - about a day's worth. Each edit must be reviewed by at least two different people (more if the first two disagree). If we can get this data, I believe the bot's performance can significantly improve, even from what it's at right now. Crispy1989 (talk) 17:17, 4 November 2010 (UTC)[reply]

Is there any, umm, help or documentation for this interface? I've activated my Google account, I've got as far as the screen that asks for my Wikipedia username to match my Google email id, and I'm looking at a page that says "Stored.". Now what do I do? Philip Trueman (talk) 18:48, 5 November 2010 (UTC)[reply]

Sorry, I need to fix that message to be more intuitive. It means you were added to the list of users for admins to review. I've approved you. You should be getting an e-mail about it. -- Cobi^(t|c|b) 19:40, 5 November 2010 (UTC)[reply]

Thanks - it's working now, and I've done a few. A few comments: If the dataset is aimed at mainspace aticles only, why was I offered a User talk space edit? I classified it as per its space - in an article it wouldn't've been good but it was fine as part of an attempt at dialogue. Also, some of the edits were edits made by approved bots that might equally well have been made by a human (e.g. RjwilmsiBot adding {{Persondata}}). Couldn't these have been automatically classified as OK? Finally, if there's any disagreement with another reviewer about any of my classifications then I'd appreciate learning about it, if only to improve my own performance. Philip Trueman (talk) 01:32, 6 November 2010 (UTC)[reply]

Thanks for your help. Although the bot currently is only being trained on mainspace articles, a few edits from other namespaces may have made their way into the random edit set. Classify these as you would normally (constructive, vandalism, skip). They simply won't be used for the main namespace training. We plan on expanding in the future to handle other namespaces as well, in which case, classifications from other namespaces will be used. We really don't want to assume any bot always makes good edits. Although this is usually the case in practice, we'd prefer to have every edit verified. Just classify these as constructive as usual (unless it's another anti-vandal bot with a false positive or something - in this case, it should probably be skipped). As for the question about being notified of any disagreement, I'll defer that to Cobi (the developer of the interface). Crispy1989 (talk) 01:43, 6 November 2010 (UTC)[reply]

I use Windows 7 and IE8. I had an edit come up in the review interface that caused IE to go into Compatibility View, and the diff it showed was blank. Sorry, can't remember which edit it was, but I marked the edit as 'Skip' (because I couldn't classify it) with a comment. Philip Trueman (talk) 12:35, 6 November 2010 (UTC)[reply]

Quiff

ClueBot NG gave a final warning to an IP editor for this edit (2 diffs). Is it just detecting that he's restoring content that was reverted by someone else, or is there something about the edit itself that's triggering ClueBot? Also, the warning given on the user's talk page suggests that ClueBot NG was giving the final warning simply because of the addition of the word "an" ... does ClueBot by default only give a diff link for one diff, or is it actually only the second diff that triggered ClueBot NG?—Soap— 22:40, 4 November 2010 (UTC)[reply]

Feel free to move this up to Section 2 if it fits better ... I'm not meaning to make my edit stand out from all the others. —Soap— 22:49, 4 November 2010 (UTC)[reply]

The only concept the bot has of restoring old content is if the edit summary says so. In this case, the edit was probably identified as vandalism because it had borderline statistics (but would not ordinarily be considered vandalism), combined with the fact that the user had vandalized a number of times before. Statistically, if a large portion of a user's previous edits have been vandalism, it's much more likely for their current edits to be vandalism. Alone this is not enough to trigger a vandalism classification, but it can push over the edge what might otherwise be a borderline edit. As the dataset grows, this will become more fine-tuned and less likely to be identified as vandalism, and the percentage of past edits that have been vandalism will remain a useful statistic in estimating good faith/bad intentions. Crispy1989 (talk) 00:25, 5 November 2010 (UTC)[reply]

Could you also clarify why the user was warned for adding "an"? - Kingpin¹³ (talk) 15:09, 5 November 2010 (UTC)[reply]

The neural network functions by analyzing statistics. Because "an" is a common word, word-based statistics do not apply. What the neural network sees is a user inserting a short word into the middle of an article - a user than already has several warnings. Without the existing warnings, the score would end up being 0.5 or less, well below the 0.95 threshold it's currently at. Multiple previous warnings significantly increase the probability that a given edit is classified as vandalism. Increasing dataset size and including instances where users with multiple warnings made constructive edits will decrease this kind of occurrence. Note: Removing this statistic from the neural network decreased catch rate when normalized to the same false positive rate, so this statistic is helpful overall to the performance of the method. Crispy1989 (talk) 16:46, 5 November 2010 (UTC)[reply]

Helping to classify ..

This [27] appears in the Dry Run but does not seem to be clear vandalism to me. Maybe greater weight needs to be given to the context of the change?

Is it actually worthwhile for humans to review the whole of the Dry Run? If so, what's the best way to flag what's been reviewed?

BTW, I tried to get myself a Google id to help out with reviewing the dataset and ended up writing a scathing comment about the user hostility of the application process. Philip Trueman (talk) 06:40, 5 November 2010 (UTC)[reply]

There are no (preset) weights. Statistics are combined using a neural network. To correct outlying datapoints like this, the datasize must grow. It's not really worthwhile to review the entire dry run - particularly since it's with an older version. The dataset review interface combines edits randomly from several sources - one of these sources is edits that bot is unsure of. So the dataset review interface is by far the best way to help. Crispy1989 (talk) 06:51, 5 November 2010 (UTC)[reply]

possible false positive

I may be wrong, but this edit didn't seem like vandalism to me. --Ixfd64 (talk) 03:55, 8 November 2010 (UTC)[reply]

This can be fixed by enlarging the dataset, and by fine-tuning word categories. Crispy1989 (talk) 12:21, 8 November 2010 (UTC)[reply]

[28] You'll understand I'm a bit miffed. Philip Trueman (talk) 09:06, 8 November 2010 (UTC)[reply]

Exactly the false positive I was coming to report. I reverted ClueBot's reversion. Curious to see if I get a warning too. :D Millahnna (talk) 09:08, 8 November 2010 (UTC)[reply]

Ouch. This shouldn't be happening at all. The real issue can be fixed by enlarging the dataset (the current dataset doesn't contain many vandalism reversions) ... but there should be a hard threshold of edits per user. Users with more than 50 edits shouldn't be reverted at all - this is a bug in the Wikipedia interface code. We'll correct it ASAP. Crispy1989 (talk) 12:25, 8 November 2010 (UTC)[reply]

It's OK, I'm not offended. Well, not much. Strangely, I've just been presented in the review interface with one of my own reversions. I added a comment asking for a 'Recuse' button ... Philip Trueman (talk) 14:45, 8 November 2010 (UTC)[reply]

[29], [30]

Poor quality edits to poor quality article rather than deliberate vandalism. Philip Trueman (talk) 05:36, 9 November 2010 (UTC)[reply]

In my opinion, false positives of poor quality edits aren't quite as bad as false positives of good quality edits - but they still shouldn't happen. These should also be able to be prevented by expanding the dataset. The second of these two even looks like it's so poor quality that it could be borderline vandalism. Crispy1989 (talk) 11:20, 9 November 2010 (UTC)[reply]

[31] Not vandalism. Not enough good edits in the database with the word 'toilets', right? Philip Trueman (talk) 08:52, 10 November 2010 (UTC)[reply]

[32] Certainly not vandalism; an improvement if anything. Philip Trueman (talk) 13:19, 10 November 2010 (UTC)[reply]

Both of these can only be explained by the dataset not being large enough. I'm not really sure why the second one was misclassified - it must just be a hole in the training data. Crispy1989 (talk) 13:23, 10 November 2010 (UTC)[reply]

[33] Ooops! Just asking, but is this a case where the bot would have reverted itself back again? The word 'iincluding' is presumably rare in good edits. If a diff counts as vandalism both ways then surely it should hold off. Also, how much does the bot know about article categories? Words like 'love' and 'hate' and 'pregnant' are normal in, say, Category:Serial drama television series when they're not in, say, Category:Chemical elements. Philip Trueman (talk) 08:46, 13 November 2010 (UTC)[reply]
The word "iincluding" is not present in the dataset at all, so it would not contribute at all to the Bayesian score. If a word has never been seen before, it is not assumed to be good or bad, beyond a few basic things to detect if it's gibberish or leetspeak. Also, for a word to contribute to the score at all, it has to appear in a certain minimum number of articles total (currently 6). You bring up a good point about the words in the categories. Right now, it assesses which words belong in context by checking added words against words that already appear on the page. This is usually sufficient, but as you pointed out, does have some holes. I may be able to figure out a way to determine statistical word relations - not as in a Markov chain, or a Bayesian classifier, but in a broader sense to sort-of automatically categorize an article. Crispy1989 (talk) 16:05, 13 November 2010 (UTC)[reply]

Riding an old hobby-horse

This edit [34] is fine as far as it goes, but clearly needed to go one revision further back. I don't know whether the 100 or so things the neural network takes into account include the identity of the editor of the version that would be rolled back to. In my experience if reverting a bad edit by an IP would mean rolling back to a version last edited by a similar IP then it's worth digging deeper, and I've modified my anti-vandal tool to warn the user in this case. I seem to recall that at least one anti-vandal bot had a rule not to revert in such a case, so as not to 'lock in' an earlier bad edit. Perhaps what's really needed here is a semi-protected page the bot can write to flagging up articles it thinks need human attention. Philip Trueman (talk) 10:25, 9 November 2010 (UTC)[reply]

There are a few things involved here in figuring out how to handle these situations. While it's true that if an edit is vandalism, immediately previous edits by the same user on the same article are probably also vandalism, it begs the question, why didn't the bot catch the earlier edit? The best thing to do is just to keep improving the bot (which I'm doing) and the dataset. It would definitely be possible to post borderline edits somewhere - the neural net generates a score which is compared against a threshold. The threshold (currently around 0.95) is calculated from a given false positive rate at dataset training/trial time. A second threshold could be set somewhat below this, where edits falling into that group could be posted somewhere. At the ~0.95 threshold it's currently catching 60% of vandalism with 0.25% false positives (calculated from the trial dataset). At a threshold of around 0.65, it gets over 90% of vandalism (with about 3% false positives). Maybe a threshold around 0.65 would be useful. Crispy1989 (talk) 11:38, 9 November 2010 (UTC)[reply]

I thought ClueBot used rollback anyway, but that's not important. It's not the same editor, so that's probably why it didn't analyze. (X! · talk) · @538 · 11:54, 9 November 2010 (UTC)[reply]

Perhaps I didn't make myself clear. It's precisely because rollback only rolls back consecutive edits by exactly the same editor that this case needs to be trapped. It is frequently the case than when the same article has consecutive edits by different IPs that are in the same narrow range that in fact they were made by the same person. If the latest edit is vandalism then the earlier ones are suspect. Using rollback in this case runs the risk that earlier vandalism may become locked in - further vandalisms will show up, to bots and in anti-vandalism tools, as bad diffs, but the earlier vandalism might remain in place for some time. Philip Trueman (talk) 13:11, 9 November 2010 (UTC)[reply]

If you have a good suggestion as to how to reliably determine when an IP is sufficiently similar to warrant reverting it as part of the rollback, I'd like to hear it, but I don't think that can be done without adding more false positives. As for the idea of a page for review, that could be done. Do you just want it for any previous IP who is in the same /24? -- Cobi^(t|c|b) 21:19, 9 November 2010 (UTC)[reply]

The only other piece of information I can think of that a bot could go on is how recent the previous edit was. If the same article gets hit several times in a short period by several IPs in a narrow range, and one edit is clearly vandalism, then (in my experience) all those edits are suspect, especially if it's the same line that's been edited. If there's a longish gap then it sometimes turns out that it's a case of different editors at an educational establishment editing a page about that establishment and the previous edit was in good faith. That's why I have my anti-vandal tool warn the user rather than revert back further. I have the range set at /16 and (I'm guessing here, I don't have any statistics) I'd say the previous edit is also bad at least 80% of the time. That's nowhere near enough for a bot, of course. BTW, I also have the tool hold off if it would revert to a version by a previously reverted vandal - is that worth considering? My preferred solution would be to hold off if the IPs are in the same /16 range, list the article for attention by a human, and (ultimately) give the bot sysop rights so it can briefly semi-protect the page if it considers that more than one of the IPs in that range has recently made a bad edit to the article.

Slightly separately, the idea of a page to log edits that are just below the threshold is attractive on the face of it, but in practice it may prove difficult to make it useful - many edits that are just below the threshold will be bad and will have been reverted by humans with good anti-vandal tools almost immediately, so it'll be out-of-date almost immediately. Philip Trueman (talk) 02:42, 10 November 2010 (UTC)[reply]

Interesting ideas. I'll leave it up to Cobi whether it's feasible or not to put additional rules to prevent reverting to a previous edit if the previous edit is potentially vandalism. I can say that this would likely incur a significant delay to fetch the extra information (although I'm not certain of this). Also, I believe it's possible to get the bot accurate enough to the point where the previous edit would have already been caught it it were vandalism. Another thing to consider is that vandals tend to follow a pattern - if the current edit is reverted, it's likely previous edits in the same style would also be reverted.

But your suggestions give me a few ideas for how to potentially improve accuracy. It may be possible to add an input to the neural network that is the time of the previous edit. Also, I may be able to add a parameter in cases where both the current and previous revisions are made by IPs - the "distance" between the IPs. This parameter would just be the smallest CIDR subnet size that contains both IPs. I'll look into this. Crispy1989 (talk) 02:58, 10 November 2010 (UTC)[reply]

Here's [35] an excellent example of what I'd like the bot to hold off doing - or, at least, ask a human for help. Presumably the first bad edit wasn't bad enough, and the second was by an IP that had already been reverted on that page that day. So when it reverted the third bad edit it reverted to a version by an editor it had previously reverted. Philip Trueman (talk) 09:56, 11 November 2010 (UTC)[reply]

"Expand the dataset"

Firstly, I think this bot is doing a great job, however, it is getting a large number of false positives. Some of these false positives are things which can easily be fixed, at the source code for the bot. Such as not reverting experienced users, ignoring discussion pages, not reverting CSD-tagging etc. (I believe these things are being addressed in the code, but I'm not sure about that :D) and it's really good to see these getting resolved. However, the bulk of the false positives seem to be down to not having a large enough dataset. Some of these are understandable, for example edits which in the context of the article are good, but appear to be vandalism otherwise. But the large majority seem to be edits which can't really be said to look anything like vandalism. I'd just like to say I think it's key for this bot to not assume that edits are vandalism. Saying "we need to expand the dataset so the bot picks up more vandalism" makes sense, saying "we need to expand the dataset so the bot picks up less false positives" doesn't, for me anyway. - Kingpin¹³ (talk) 11:57, 9 November 2010 (UTC)[reply]

That's how artificial neural networks work. In this case, it is basically a classifier - either vandalism or not, with a given certainty. If the neural network has never seen a given edit before, its internal weights are not trained to classify it, so it may end up giving an unexpected output. In fact, the network needs much more good edits in its set than bad edits to not make false positives. — HELLKNOWZ ▎TALK 12:01, 9 November 2010 (UTC)[reply]

Well that's kind of what I mean. I understand that the reviewed edits are either "good" or "bad". So if you have only reviewed bad edits, the bot is going to be more likely to assume that edits are bad. But I think it's gone to far towards the assuming the edits are bad. Maybe reviewing more good edits would deal with this - I don't know. Basically, it's seems too concentrated on identifying bad edits, and not enough on identifying good edits. - Kingpin¹³ (talk) 12:06, 9 November 2010 (UTC)[reply]

Right now the dataset is roughly 50/50 vandalism/constructive. The dataset we are generating with the interface will come from a day's worth of edits (roughly 70k edits), and will have a more realistic ratio. -- Cobi^(t|c|b) 12:13, 9 November 2010 (UTC)[reply]

I should point out that the false positive rate is selectable, and can be reconfigured at any time. I should also point out that the false positive rate is currently set at 0.25% - and that the actual number of false positives is *below* this. For the issues in the code, here's a quick overview:

Redirects - Fixed. At the beginning of the trial, there was no metric to recognize these to input to the neural net, so the neural net just saw it as shouting. This metric has been added.
Various tags - Fixed. A metric was added for certain tags, and template names are now removed before statistical processing.
Non-main namespace pages - Fixed. This wasn't actually a bug, but was due to importing the old Cluebot's opt-in list. It has since been cleared.
Not reverting experienced users - Fixed. This was actually two separate problems. The first is that the edit threshold was initially too high, and has been decreased. The second is that the WP API was returning errors in a few cases, so the number of edits was being treated as zero. Error handling has been added to solve this.

Even context-specific false positives are at a much lower rate than existing bots, and can continue to be improved with a larger dataset.

Also, H3llkn0wz is right about the neural net - increasing dataset size and quality will both increase vandalism catch rate and decrease false positives. Cluebot-NG's false positive rate is very, very low, considering the sheer number of edits it reviews. Now, after fixing the programmatic issues, it's only getting a few false positives a day. As I mentioned earlier, Cluebot-NG's false positive rate is very low, but the false positives it does have aren't necessarily the same ones you'd expect from another bot.

About the dataset ratio, this actually doesn't really matter. Having a dataset ratio that differs from reality will affect the average result score from the neural net, but remember that the threshold is calculated and calibrated using a set false positive rate, so even if the average score is higher in general, the threshold will also be calculated to be higher, and will normalize the results. Crispy1989 (talk) 12:21, 9 November 2010 (UTC)[reply]

Multiple reverts

I'm not going to argue that these reverts shouldn't be done – it's quite obvious they should have been. However it was my understanding that the old ClueBot would not revert the same thing twice. Was this just a coincidence or was that true? This bot doesn't seem to follow that same pattern. Is that intentional or not? --Shirik (Questions or Comments?) 18:31, 9 November 2010 (UTC)[reply]

Cluebot-NG does follow the same behavior of the old Cluebot in this regard - the interface to Wikipedia (of which this functionality is a part) is largely just copied, and is the same code. Cobi knows the exact logic behind it, but my understanding is that, by default, it does not revert the same user/article combination twice in the same day, with some exceptions. These exceptions are for the article of the day (which this is), and any articles listed in the "angry opt-in list". Crispy1989 (talk) 18:49, 9 November 2010 (UTC)[reply]

Another false positive for your dataset

IP added vandalism, another IP removed it and ClueBot tagged the second edit as vandalism. Timing issue I'm guessing? Millahnna (talk) 15:54, 11 November 2010 (UTC)[reply]

This is actually a real false positive. The dataset needs more instances of people reverting vandalism. Right now it has very few. The means used to generate it apparently don't generate a random sampling. As soon as the review interface generates a large enough dataset from random edit reviews, we'll replace our current dataset entirely. Crispy1989 (talk) 16:05, 11 November 2010 (UTC)[reply]

Just kind of a random question, but, what does the "NG" stand for? Allmightyduck  What did I do wrong? 03:47, 12 November 2010 (UTC)[reply]

Believe I read it stands for "Next Generation". N419 BH 07:44, 12 November 2010 (UTC)[reply]

"This is probably a silly question, but what does the "NG" stand for? New Generation? --Ixfd64 (talk) 20:22, 4 November 2010 (UTC) [reply]

Our intent was Next Generation. Crispy1989 (talk) 20:33, 4 November 2010 (UTC)"[reply]

Another Dataset Plea

Looking at some of the current data from the review interface, it seems that our training dataset is significantly biased. The bot's current performance, while still better than existing bots, is significantly inferior to what it could be. This is due entirely to the bias in the dataset. I'd like to scrap our entire existing dataset and replace it with the truly random sampling (and verified) edits from the review interface. But not enough edits have been reviewed yet to provide sufficient data for training. Is there anything we can do to make it easier to review edits, or make it seem more worthwhile to people? Thanks to those who are already helping! Crispy1989 (talk) 15:19, 12 November 2010 (UTC)[reply]

I reply to this with some diffidence, because I've already talked enough on this page. (Thank you! Thank you!) But I do have a few comments. Firstly, please give us some time: quite a few people, me included, have already helped out, and I see no indication yet of contributors dropping out - rather the reverse. But this whole project is very much a volunteer effort and we all have real lives elsewhere, no matter what may seem to be the case here. Secondly, it really would be helpful to have some feedback on our efforts: this is, I understand, a fairly basic result in experimental psychology - performance improves with feedback, even negative feedback, compared with no feedback. Even putting up a message to say "At the current rate we expect to go live with a fully-reviewed dataset in the middle of February" would give us a target to beat. Thirdly, it would be nice to have specific feedback on the quality of the classification of difficult edits. No-one expects an individual thank-you for correctly classifying the addition of a {{Persondata}} template or of "MRS FINKELSTEIN IS A GREAT BIG CHODE!!!", but in my experience some of these edits have proved quite difficult, and it's going to be the bot's ability to classify borderline cases correctly that will distinguish it from the rest, and justify the effort that goes into building it. I didn't really like the suggestion I saw somewhere that if two reviewers disagree then the edit will be dropped from the dataset - surely that is a recipe for blunting the sensitivity of the bot? If that is the case then I wonder what is happening to all those comments I've placed on difficult edits. What needs to happen is for those edits to be reviewed even more carefully, and perhaps even put up for community discussion. We'd all learn something, the reviewers as well as the bot. Enough! Philip Trueman (talk) 20:10, 12 November 2010 (UTC)[reply]

About the reviewing interface, it is really easy to work with. I do regularly get an error message that makes me have to refresh. I get both generic error messages telling me something went wrong and I need to refresh, as well messages telling me that it is out of revisions. I like that there is a counter in the corner, so one can set a goal for themselves as 'I will review x amount of revisions this session', and then do just that.

About getting people to participate, this same problem is faced all the time by wikiprojects who organize 'drives' to improve certain parts of the encyclopedia. Some of the techniques I see used in these drives are: fixed timespans, clear goals, 'rewards' (meaning: glorified thank-you notes), and advertisement on places such as the 'Community bulletin board' (on the community portal). Arthena ^(talk) 22:57, 12 November 2010 (UTC)[reply]

Review interface is fine. Though it would be nice if 30% of edits for review were not my own bot's addition of persondata. To encourage wider participation include some stats (e.g. "after the n000 reviews bot accuracy has improved 5%" or whatever it is) and just politely spam the various tech village pump, bot owner noticeboard, huggle talk pages etc. Can we establish how many reviews are needed to reach production-level accuracy, set a target for the review phase? Rjwilmsi 00:13, 13 November 2010 (UTC)[reply]

Interesting points. The following have been added to Cobi's and my TODO list:

For giving feedback, it's not really possible to set a certain goal, because it will always be improvable. It would work fine right now. Rjwilmsi's suggestion about giving statistics on the bot's current accuracy, given the current dataset, are definitely possible, though. We're going to work on setting up a system to retrain and retrial the bot daily, each time using the new current dataset. The results of these trial runs will be posted. We may also be able to take this data over a period of a number of days and create things such as graphs of dataset size versus accuracy.
For discarding edits where there's some disagreement, we've decided to change this to a scheme where every edit is always classified as something (Vandalism, Constructive, or Skip), and that the classification that is used must have at least 3x the votes as any other classification.
For getting feedback on difficult edits, we've discussed ways to do this, and it may be possible to set something up, but it would likely require some restructuring of the database. The idea is to allow users to view a list of all edits they've classified, that others have classified differently, and allow them to view and add more comments, and change their existing vote. But the internal database currently cannot support this. The best way to implement this will be to wait until all edits currently in the database are classified (10,000ish), then upgrade the database. In the mean time, we'll see if there's any halfway point (possibly viewing controversial edits without being able to change the past vote) that we can implement without reconstructing the db.
Continue to try to figure out the few bugs that are causing random (but harmless) occasional error messages.

Crispy1989 (talk) 16:21, 13 November 2010 (UTC)[reply]

Providing a better user experience

ClueBot NG seems to catch vandalism much better than the old ClueBot did. However, we must not forget that regardless of how amazingly well such statistical techniques as artificial neural networks work, there will be false positives; it will never be possible to recreate 100% of the brain of a human RC patroller in computer software. When users who make acceptable edits have them reverted, misunderstandings arise. For example, see Old revision of User talk:ClueBot Commons/Archives/2010/November#are anons not allowed to post subst:prod.

I believe that a concise, informative FAQ page is a necessity if we are to approve this bot. We need to explain that:

The bot is not perfect, and it will never reach 100% accuracy, although its false positive rate has been set to revert only 1 in 400 legitimate edits. This is to help Wikipedia remain free of vandalism.
There are certain types of edits that the Wikipedia community does not find acceptable. (Summarize the vandalism policy here, including the different types of vandalism.)
The bot's revert of a user's edit does not necessarily mean that it is unacceptable.
If the user believes that his edit is not vandalism, he may repeat the edit, and the bot will not take action. (Include instructions for reverting the bot using undo, maybe even a link in the talk page message.)
The bot operators are open to suggestions of how to improve the bot, including reports of false positives. PleaseStand ^(talk) 22:02, 12 November 2010 (UTC)[reply]

A FAQ like this would indeed be useful, and I'll work on writing something up. A few comments on your list, though:

Several of these points are already mentioned in the warning the bot posts on user talk pages (although it can't hurt to have it be elsewhere in a FAQ as well).
It's probably a good idea to emphasize, "If this edit was made in good faith, do not be afraid to post a false positive report, and clear your good name." I can understand that new users could potentially be intimidated by a big warning, so something to this effect would probably be helpful.
It may not be the best idea to make it clearly apparent that the bot will not re-revert the same edit. Even now, without this fact being made clear, a significant amount of vandalism is being caught, but slips through when the user re-vandalizes the page. This behavior of the bot is necessary (unless the false positives are eventually somehow reduced to an incredibly low amount), but making it apparent to vandals, and even providing links for them to re-vandalize in one click, could drastically reduce the actual effectiveness of the bot.
The old Cluebot has a nice user-friendly false positive reporting mechanism. When Cluebot-NG goes into production, we'll bring this interface live again.

Crispy1989 (talk) 16:32, 13 November 2010 (UTC)[reply]

Old topic, but I wouldn't put "and clear your good name", that implies that the reversion is saying something about their name in the first place. It might be good to compare it to a spam filter as well, since people understand that those sometimes have false positives. Gigs (talk) 21:00, 21 November 2010 (UTC)[reply]

Glitch?

what triggered this? Choyoołʼįįhí:Seb az86556 ^{> haneʼ} 05:41, 14 November 2010 (UTC)[reply]

Looks like a dataset completeness issue to me. Crispy1989 (talk) 07:43, 14 November 2010 (UTC)[reply]

Some comments on the review mechanism

N.B. Bot owners - feel free to move this to a new page if you feel it doesn't belong here.

I'm sure that at least twice now I've had the same edit come up for review twice - one on the safety of microwave ovens and one about common given names in Azerbaijan. Does the interface not check whether a reviewer has seen the edit before?

I support Rjwilmsi in his comments about the frequency of his bot's edits. Maybe there should be a few, but I get the impression that we're heading for a skewed dataset. The idea of "one day's edits" is flawed - much will depend on what bots are active that day, whether school's in or out, and what the major news item of the day is (we're getting a lot of stuff about the 2010 mid-term elections right now). If it has to be a random selection then it needs to sampled from a period of several weeks. Also, there should be a limit on the number of edits in the dataset by any given editor.

I could do with three more choices: "This needs a subject matter expert", "Content dispute", and "Recuse". I've had an edit of my own come up, and Rjwilmsi has been in the uncomfortable position of having to classify an edit by his own bot.

It would be nice to know what the criteria are for asking 'Are you sure?'. I've assumed that the answer is "This is the first time we've had that answer for this edit". If that's the case, then I'd say 0.5%-1% of the edits in the dataset are currently wrong classified. Is that right? Some feedback on how many errors the reviewers have uncovered would be welcome.

Finally, how should we handle the case where the edit we're presented with is OK on it's own, but is the latest of a string of edits by the same editor that cumulatively are bad? In my experience, this is a common case when doing RCP. Normally, I'd hit revert, but for the cumulative edit. What's the correct action here? Philip Trueman (talk) 11:10, 15 November 2010 (UTC)[reply]

You should not have gotten the exact same edit twice (maybe the user re-did their edit?).
We will be able to add a more random sampling over a span of a few weeks.
I don't see why there should be a limit, so long as it is proportional to the number of edits they make in a day.
"Subject matter expert" should be able to be handled by the skip or the refresh button. "Recuse" is handled by refreshing. "Content dispute" should be a skip. We could make a dedicated button to do the same as refreshing, though.
Furthermore, about a "recuse" button, clearly, if you get an edit of your own, it's constructive. It's not a courtroom, just dataset generation. If someone who makes vandalized edits has access to the interface anyway, there are much larger problems than someone classifying their own edit.
"Are you sure?" comes up in some circumstances where the current bot isn't sure on the edit.
We are working on adding some more feedback to the interface.
For an edit that isn't vandalism, but is in the same string of edits where vandalism occurred by the same user, just hit skip.

Hope that helps. -- Cobi^(t|c|b) 14:35, 15 November 2010 (UTC)[reply]

Yes, that helps. Thank you. Philip Trueman (talk) 16:14, 15 November 2010 (UTC)[reply]

Cobi has made some major improvements to the review interface based on received comments. One of the important asked-for improvements is that users can now view what others have voted on edits they've already reviewed by clicking on the counter in the top-right corner, and potentially change your own vote in retrospect. Note that you cannot view what others have voted before voting yourself - this is to prevent any prior bias. Also, the logic to determine the final result has changed, and contested edits are no longer discarded. Crispy1989 (talk) 02:12, 22 November 2010 (UTC)[reply]

Status Update

In the last day or so we've made some major improvements with the dataset. We discovered an issue with the dataset we've been using. The output of the dataset downloader was not matching the output of the live downloader, essentially adding some degree of randomness to some of the fields, and causing the bot's live performance to not measure up to its theoretical performance based on a dataset trial. After rewriting the dataset downloader to use the same code as the live downloader, and regenerating the dataset, the bot's live performance is now much closer to its theoretical dataset performance (before the live bot was catching only about 10% of vandalism, about twice that of existing bots - now it's catching 50%-60%, in the range of the dataset trial). The false positive rate remains at the same 0.25% as before.

Also, the classifications from the review interface are now enough to start being used. There aren't enough to use as a training dataset yet, but there are enough to use for trials. This means two things. First, it means we can train the bot using our entire existing dataset, instead of reserving a portion for trials. This should slightly increase accuracy. Second, it means that the statistics we give about the accuracy of the bot are now guaranteed to be accurate and unbiased (the 50%-60% above is an example). Crispy1989 (talk) 14:50, 15 November 2010 (UTC)[reply]