Labeling variables in R

This great procedure makes it easy to remember what variables are related to in R. One of the troubles with exploratory data analysis is that when one has a lot of variables it can be confusing what the variable was created for originally.  Certainly code comments can help but that makes the files larger and unwieldy in some cases.  One solution for that is to add comment fields to the objects created so that we can query the object and see a description.  So, for example, we could create a time series called sales_ts, and then create a window of that, called sales_ts_window_a, and another called sales_ts_window_b, and so on for several unique spans of time.  As we move through the project we could have created numerous other variables and subsets of those variables.   We can see the details of those by using head() or tail(), but that may not be an extremely useful and clear measure.

To that end, these code segments allow applying a descriptive comment to an item and then querying that comment later via a describe command.

example_object <- "I appreciate r-cran."
# This adds a describe attribute/field to objects that can be queried.
# Could also change to some other attribute/Field other than help.
describe <- function(obj) attr(obj, "help")
# to use it, take the object and modify the "help" attribute/field.  
attr(example_object, "help") <- "This is an example comment field."
describe(example_object)

The above example refers to an example object, that could easily be sales_ts_window_a mentioned above.  So we would use the attribute command to apply our description to sales_ts_window_a.

attr(sales_ts_window_a, "help") <- "Sales for the three quarters Jan was manager"
attr(sales_ts_window_b, "help") <- "Sales for the five quarters Bob was manager"

After hours or days have passed and there are many more variables under investigation, a simple query reveals the comment.

describe(sales_ts_window_a)
[1] "Sales for the three quarters Jan was manager"

This might seem burdensome, but RStudio makes it very easy to add this via code snippets. We can create two code snippets. The first is the one that goes at the top of the file which defines the describe function that we use to read the field we apply to the comment to. Open RStudio Settings > Code > Code Snippets and add the following code. RStudio requires tabs to indent these.

snippet lblMaker
        #
        # Code and Example for Providing Descriptive Comments about Objects
        # 
        example_object <- "I appreciate r-cran."
        # This adds a describe attribute/field to objects that can be queried.
        # Could also change to some other attribute/Field other than help.
        describe <- function(obj) attr(obj, "help")
        # to use it, take the object and modify the "help" attribute/field.  
        attr(example_object, "help") <- "This is an example comment field."
        describe(example_object)

snippet lblThis
        attr(ObjectName, "help") <- "Replace this text with comment"

Now one can use the code completion to add the label maker to the top of the script. Simply start typing lblMak and hit the tab key to complete the code snippet. When wanting to label an object for future examination, start typing lblTh and hit tab to complete it and replace the objectname with the variable name and replace the string on the right with the comment. These code snippets provide a valuable way to store descriptive information about variables as they are created and set aside with potential future use.

This functionality does overlap with the built in comment functionality with a bit of a twist. The description added via this method appears at the end of the print output when typing the variable name. The built in comment function does not print out. It is also less intuitive than describe() and receiving a description.

R contains a built in describe command, but it often is not useful. Summary is the one I use most often. For a good description, I import the psych package and use psych::describe(data). Because of that, the describe method in this article is very useful. The printout appears like below with the [1]…

lu71802xbt90_tmp_dac5c795

Adding attributes other than “help” could easily be accomplished. DescribeAuthor, DescribeLocation, and other functions could be added. When using a console to program, a conversational style makes it flow better.

My Favorite Function

My favorite function of all time is varsoc in Stata.  That’s saying a lot because I have been working with computers for decades and have written software in several languages, used many different types of administrative software tool sets, and owned a lot of books with code in them.  Varsoc regresses one variable, y, upon another variable, x, and then regresses each lag of y on x to produce output that allows one to know the best fit lag for a regression model.   It allows someone analyzing time series data to immediately know that data from the several prior is a better predictor of today’s reality than more recent data.  I adore Stata for scientific analysis.  In order to use this for my big data project, I needed to automate it, and so I wrote an R vignette that would analyze 45 lags and produce the relevant test statistics. My vignette produces r2 values1, parameter estimates, and f-statistics for 45 lags of y regressed on x. The p-values are then written to a CSV file. The decision rule for a p-value is that we reject the null hypothesis if the p-value is less than or equal to α/2.2 The data comes from 5GB of CSV files that were created via Python.

Running the lags shows us the relationships between the historical prices of two securities. When we regress y on x in this case, we are regressing the price of security 2 on security 1. We then do this on a lag. The L1 of security 2 regressed on security 1’s L0. Then we regress L2 of security 2 on security 1’s L0. This occurs for 45 iterations. For example, we might find that the price of a gold ETF 44 days ago has the best relationship with the price of Apple stock today as compared to the price of that same gold ETF 12 days ago and even today. That’s an example only and not anything substantiated in the data. There will certainly be some spurious relationships. An ETF buying shares of Apple and then the same ETF’s fee going up the next month, for example. To mitigate this, the vignette uses the first difference of the logarithm so that the data is stationary. The CSVs are already produced so that unit roots are accounted for. This is a research project to identify what actually bodes well in other sectors. It runs on every listed security on the American exchanges. Every symbol is regressed on Apple. Every symbol is regressed on Microsoft, and so on. The data is stationary and unit roots are eliminated.

I initially began this project some time ago and at that time I stopped because it was going to take a solid month of continuous 12-core processing to accomplish the entire series. In retrospect, I should have let that proceed but there would have been a great tradeoff in that I couldn’t have played Roblox, The Isle, and Ark Survival Evolved with my daughter. Finally, I’ve got the research running on a new machine dedicated to that purpose. That machine uses an AMD Ryzen 5 3500 and NVMe SSD. The program is running on 6 cores in parallel. Previously, with the one month estimate, it was running concurrently on 12-cores of Westmere Xeon CPUs and storing the output in RAM instead of on an SSD. This will serve as an interesting test for the Ryzen since all six cores will be running at 100% for months on end. The operating system is OpenSuse Leap 15.2, the R version is 4.05, and the Python version is 2.7.18.

One of the reasons to write these articles is for my own memory. It gets older to remember as one gets older. These blog posts are essentially a public notebook to aid myself and others.


1  R2 is the coefficient of determination, which is the square of the Pearson correlation coefficient, r, the formula for which is ρ=β1(σx/σy), where β1 is the parameter estimate. ASCI and Unicode text does not have a circumflex, ^, on top of the β. For this documentation the objective is multiplatform long-term readability so an equation editor with specialized support for circumflexes is out of the question.

2  There is also the existence of the rejection region method. We reject the null hypothesis if the test statistic’s absolute value is greater than the critical value, which we can express with the formula Reject if |t| > tα/2,n-1

Risk, religion, and temping

“How The Masses Deal With Risk (And Why They Remain Poor)” appeared on Capitalist Exploits in January of 2016. The quote that resonated the most was “What is also a fact is that the mean return of early stage VC investments is north of 50% per annum. This is the mean and like anything else with a little bit (OK, a lot) of work, outperforming the average in anything is entirely achievable if you put effort into it.” (Chris MacIntosh, 2016)

“For Many Americans, ‘Temp’ Work Becomes a Permanent Way of Life” appeared on NBC News in April of 2014. The article follows Kelly Sibla and others who joined the ranks of the permanent no-benefit-no-FMLA class of temporary employees. The market started calling ‘temp’ jobs ‘contract’ jobs around the end of the Great Recession. “…labor economists warn that companies’ growing hunger for a workforce they can switch on and off could do permanent damage to these workers’ career trajectories and retirement plans” (Maddie McGarvey, 2014).Andrew Moran, writing for Time Doctor looked at the same issue in “Employee Extinction? The Rise of the Contract, Temp Workers in Business” using Federal Reserve data and other countries. The phenomenon is not unique to the United States, however the United States does not have a social safety net for things like housing the way that other countries do.

James Balogun wrote a career advice piece on the subject called “Here’s the Deal with Contract to Hire Positions”, and although he left out the valuable statistics about the majority never converting to full time employees, the article provides a great analysis on the scenarios when taking such a job. The best quote is “Let’s be clear here. The employee is the one taking the risk in a contract to hire, not the employer”. (Balogun, 2016)

Outcome-Based Religion by Mac Dominick describes the management theories of Peter Drucker and their penetration into organized religion in Chapter 13. It’s an interesting read and describes the mode of many denominations to act in a business manner. It details theological seminaries and Pharmaceutical company foundations working with seminaries via foundations (Eli Lilly, among others). The book mentions one “community church” that makes hundreds of referrals for psychiatric care annually. Dominick refers to this as the rise of “Christian Psychology”. It’s an interesting read, but like many other works that discuss the Roman Catholic faith, fact-checking assertions remains a good idea. One example of such claims is the assertion that Catholicism teaches that salvation exists in all faiths, but, in August 2016, Brother Andre Marie wrote an explanation detailing the misunderstandings of that view.

Dr. Ed Hindson at Liberty University wrote an article denying preterism in 2005 called The New Last Days Scoffers. Donald Perkins discusses the refutation and explains the futurism view. J. R. Bronger wrote another analysis of the preterist view in August 1999, and calledRealized Eschatology a poisonous belief. Bronger used a broad brush, but made strong arguments, including references to Hymenaeus and Philetus, historical figures who claimed the resurrection was already past. JM wrote a more recent article with strong arguments opporsed to futurism. Jame’s Loyd’s article at Christian Media Research takes issue with preterism and contains historical detail in addition to scriptural analysis while keeping Daniel’s 7 debated years in the past rather than the future.

Age of the earth and the race of Jesus

Age of the earth debates from the old-age side are based on linear regressions which are parameter estimates and arguing about whether that’s a fact or not is like arguing about whether the expected value of a portfolio is a fact or not. It’s an absurd thing to claim as truth and argue about since it is a mathematical outcome from a chosen formula.

Genetic ancestor tests DON’T ACTUALLY REVEAL ANCESTRY [1]. This one is a myth that new atheists push about.

…It’s also quite possible for someone who is African American to get ancestry test results that say they’re 75 percent European… [1]

One cannot analyze a bunch of DNA and determine where someone came from a million years ago, and applying DNA results to modern geopolitical borders is snake-oil selling. At best they are correlations only and correlation doesn’t imply causation.

The second one is a favorite of anti-Israel proponents who secretly think the Judeans in the Bible were replaced en-masse at some point in the past with people who looked differently than the modern Isrealis who got that state as a result of Judaism-following ancestors, thus proving that Jesus was ‘browner’ and did not have ‘blue eyes’ [2] because of hithertoo unknown genetic predictive power proving that he would thus side with the PLA in morality questions. King David being said to have had Red hair really puts the lie to that whole browner thing… Hence why genealogies are a waste except as box-checking messiah status.

1. https://now.tufts.edu/articles/pulling-back-curtain-dna-ancestry-tests [archive | wayback]

2. https://www.timesofisrael.com/anomalous-blue-eyed-people-came-to-israel-6500-years-ago-from-iran-dna-shows/

STEM jobs in the United States

The number of science, technology, engineering, and math, STEM, jobs in the United States, shrank for the past three decades,1982-2012. The draw-down accelerated from 2000-2012.

The highest occupational growth occurred among occupations with soft skills, with K-12 teaching and non-doctor health care support staff, such as nurses, technicians, and therapists. From 2000-2012, those in the physical sciences, such as chemistry, physics, and others, biological scientists, and engineers saw decreases in the availability of work in their field. The percentage of the workforce that fell into the category of “engineer” declined by over 15% (David Deming, 2017). In “The Economics of Noncognitive Skills”, data from the Brookings Institution’s Hamilton project shows that the number of service jobs increased the most over the last three decades (Timothy Taylor, 14 October 2016). These are tasks such as customer service.

Decision Theory Articles

Very good article on decision theory by James Jones, Professor of Mathematics Richland Community College.  The modes discussed are expect value (realist), also called the Bayesian principle, Maximax (optimist), Maximin (Pessimist), and Minimax (Opportunist).  They use the example of a bicycle shop choosing how many bicycles to purchase and sell.  The example is very good and the explanation is well-constructed.

Forestry Economics: A Managerial Approach by John E. Wagner has a great explanation of the decision modes.  The Wikipedia article on Minimax includes pseudocode for using Minimax in games, such as Chess.  Of particular interest was the mention of this technique’s use by Deep Blue, the computer which beat Gary Kasparov in chess.

Management and the Technology Professional – B302 Risk analysis using maximin criterion, minimax regret criterion, expected value criterion, and decision trees is a good example of decision theory writing as well, and it includes a Dilbert cartoon.  Ultimately the regret table at Wikipedia was one of the most useful.

Real-World Decision Making: An Encyclopedia of Behavioral Economics edited by Morris Altman captured my interest when search related to Laplace decision criteria.  It’s more of an economics book on behavior.  I am mentioning here not because of decision theory content, but because it’s page on Google Books led me to IndieBound.org, which seems to be a federation of independent bookstores.

Filter Bubble Analysis

Upward Pull was a term used in marketing to describe the effects of people seeing advertisements that made them aspire. Examples including seeing a nice watch or a luxury vehicle in a magazine. Someone who may have to work for years for those items may see them and aspire to obtain them, despite them normally appearing outside their socioeconomic demographic. With the internet and advertisers, upward pull has all but vanished. When one views websites they see almost nothing aspirational. They instead see what is immediately obtainable.

Parmy Olsen describes the filter bubble that exists on the internet.

“…as algorithms make predictions about people based on their web behavior, they can inadvertently deepen existing disparities on aspects like culture, race or gender. In a few years you could, for instance, be looking at a richer or poorer version of the Internet depending on how things work out with your credit score or where you live, and not even know it.” (1)

She further says “The Princeton researchers will compare search results, prices, ads, offers and emails that their fake profiles receive over the coming months, and look for patterns to measure what kind of discrimination is happening across different sites.” (1)

1. Olson, Parmy. “This Landmark Study Could Reveal How The Web Discriminates Against You – Forbes.” News. This Landmark Study Could Reveal How The Web Discriminates Against You, 2 Dec. 2013, http://www.forbes.com/sites/parmyolson/2013/12/02/this-landmark-study-could-reveal-how-the-web-discriminates-against-you/.

2. Englehardt, Steven, et al. Web Privacy Measurement: Scientific Principles, Engineering Platform, and New Results., Draft, June 1, 2014 https://www.cs.princeton.edu/~arvindn/publications/WebPrivacyMeasurement.pdf

3. Englehardt, Steven, et al. OpenWPM: An Automated Platform for Web Privacy Measurement. Zotero, https://senglehardt.com/papers/openwpm_03-2015.pdf.

 


Addendum: The study was published. They focused on news site personalization at the time, which was before Google entered the foray as a primary driver of news content via Google News. Steven Englehardt maintains a webpage at https://senglehardt.com/pages/publications.html. The published material is very interesting and reveals how tracking mechanisms work over time. I have been unable to find detailed analysis of the different filter bubbles encountered by their bots. They published a second paper on their privacy analysis tool, but it does not delve into the different potential experiences of the web user. The material relates to the technical aspects of what occurs. Those studies have been added with notes 2 and 3.

Last modified on March 7th, 2026 at 10:44 PM

Synchronize time on CentOS 6

This will start the time service and synchronize the clocks on CentOS 6.

yum install ntp 
chkconfig ntpd on 
ntpdate pool.ntp.org 
/etc/init.d/ntpd start