{"id":8664,"date":"2017-02-03T16:45:23","date_gmt":"2017-02-03T21:45:23","guid":{"rendered":"https:\/\/www.ezrasf.com\/wplog\/?p=8664"},"modified":"2017-07-14T09:19:03","modified_gmt":"2017-07-14T13:19:03","slug":"fun-with-regex","status":"publish","type":"post","link":"https:\/\/www.ezrasf.com\/wplog\/2017\/02\/03\/fun-with-regex\/","title":{"rendered":"Fun With Regex"},"content":{"rendered":"<p>We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue.<\/p>\n<p>Because the old sytem used an HTML editor for a specific textarea, the content in them was difficult to read without expertise in HTML. Fine for a former Webmaster like myself, but few people who will need this read it like they do English.<\/p>\n<p>My first thought was to look for products that clean up HTML. I even got excited when I notice <a href=\"http:\/\/tidy.sourceforge.net\/\">HTML Tidy<\/a> comes with our Linux OS, but that just converted the HTML to standardized format of HTML. (And trashed the plain-text portions of the ticket.) I did not find options for removing the HTML with Tidy.<\/p>\n<p>So, my next thought was to try Regular Expressions (Regex). Certainly it ought to be doable. Just Regex is hard. No, difficult. No, turn your hair gray at 22. But,\u00c2\u00a0it can do anything if you put your mind to it. And I ran across <a href=\"http:\/\/regexr.com\/\">RegExr<\/a> which really simplified the process by showing how my pattern worked in sample content.<\/p>\n<p>In the end\u00c2\u00a0I mad\u00c2\u00a0a simple shell script to clean up the files.<\/p>\n<blockquote><p>#!\/bin\/bash<br \/>\n#############################################################<br \/>\n# Convert HTML to plaintext using sed.<br \/>\n# Created by Ezra Freelove,\u00c2\u00a0email<br \/>\n#############################################################<br \/>\n# Variables<br \/>\nWORKINGDIR=\/stage\/$1<br \/>\nif [ -d $WORKINGDIR ] ; then echo &#8220;&#8230; found dir; continuing&#8221; ; else echo &#8220;&#8230; missing dir ; bailing&#8221; ; exit; fi<br \/>\nDESTDIR=${WORKINGDIR}\/fixed<br \/>\n# Make a list of files to convert.<br \/>\ncd $WORKINGDIR<br \/>\nWORKINGLIST=`ls *.txt`<br \/>\n# Fix the files<br \/>\nmkdir -p $DESTDIR<br \/>\nfor WORKINGFILE in $WORKINGLIST<br \/>\ndo<br \/>\nsed -e &#8216;s|&lt;br[\\ \\\/]*&gt;|\\n|g&#8217; -e &#8216;s\/&lt;[^!&gt;]*&gt;\/\/g&#8217; -e &#8216;s\/&amp;nbsp;\/ \/g&#8217; -e &#8216;s\/&amp;lt;\/&lt;\/g&#8217; -e &#8216;s\/&amp;gt;\/&gt;\/g&#8217; $WORKINGFILE &gt; ${DESTDIR}\/fixed_${WORKINGFILE}<br \/>\ndone<\/p><\/blockquote>\n<p>The regexes are:<\/p>\n<ul>\n<li><strong>s|&lt;br[\\ \\\/]*&gt;|\\n|g<\/strong> which means match HTML &lt;br&gt; tags and replace with a newline character . The &lt;br&gt; tag tells a web browser to go to the next line.<\/li>\n<li><strong>s\/&lt;[^!&gt;]*&gt;\/\/g<\/strong> which means match a less than (&lt;)\u00c2\u00a0out to the next greater than but exclude an exclamation point. Delete everything between. This handle the HTML elements and their attributes. This like\u00c2\u00a0&lt;p class=&#8221;MsoPlainText&#8221;&gt; or\u00c2\u00a0&lt;\/span&gt;. For some reason the date and username of the person who updated the ticket are stored as &lt;! 2017-02-03 username&gt;, so I had to figure out how to keep them.<\/li>\n<li><strong>&#8216;s\/&amp;nbsp;\/ \/g&#8217;<\/strong> which means match\u00c2\u00a0the text &#8220;&amp;nbsp;&#8221;\u00c2\u00a0which is a non-breaking space it with a normal space.<\/li>\n<li><strong>&#8216;s\/&amp;lt;\/&lt;\/g&#8217;<\/strong> which means replace the text &#8220;&amp;lt;&#8221; with a &#8220;&lt;&#8220;. And finally the same thing but for greater than.<\/li>\n<\/ul>\n<p>An easy way to match all of these latter ones would be pretty cool, but I think dealing with the most common ones is good enough.<\/p>\n<p>Initially I was going to remove all the character codes like &amp;nbsp;. In the end, I decided that the ones I handled should help people. The more rare ones can be determined easily if someone runs across them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue. Because the old sytem used an HTML editor for a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":true,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[3011],"tags":[1797,1620,2463],"class_list":["post-8664","post","type-post","status-publish","format-standard","hentry","category-scripting","tag-html","tag-regex","tag-regular-expressions"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p1rUBW-2fK","jetpack-related-posts":[],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/posts\/8664","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/comments?post=8664"}],"version-history":[{"count":0,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/posts\/8664\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/media?parent=8664"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/categories?post=8664"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/tags?post=8664"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}